Welcome to MilkyWay@home

problem with de_nbody tasks never finishing

Message boards : Number crunching : problem with de_nbody tasks never finishing
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3321
Credit: 520,597,350
RAC: 30,434
Message 70731 - Posted: 13 Apr 2021, 11:36:58 UTC - in response to Message 70730.  

But you seem to be running mobile CPUs I run my machines 24/7, since they are dedicated.
It could be one of the power-down tricks that Intel or Microsoft uses that causes the problem. I set my power options to "high performance" mode..
Thanks again for the reply.. You are right. Both my multi-core computers are laptops. They are older and the batteries are shot. I run them plugged into the charger, pretty much 24/7 for BOINC. I run several different BOINC projects and it's only the 3 CPUs Nbody tasks that have any problems.
Are the work units being suspended?
BOINC does not show the hung-up tasks as suspended. They are shown as Running with Elapsed time counting up but Progress is frozen. Most Nbody tasks take 20 tp 25 minutes to finish up - so when I see one with a longer elapsed time, I restart BOINC. When BOINC restarts the hung-up task starts running again, but its Elapsed time has been reset to a much earlier time (less than 20 minutes) Guessing that only around 10% of tasks hang-up. Some tasks hang up multiple times.


Instead of exiting Boinc try suspending it and then restarting the crunching after a slow 10 count. The other problem could be memory, if your laptops don't have enough memory to handle the tasks they will slow to a crawl, you might try running 1 less task at a time and see if it helps.
ID: 70731 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stick

Send message
Joined: 8 Oct 07
Posts: 52
Credit: 5,637,045
RAC: 160
Message 70736 - Posted: 14 Apr 2021, 18:37:40 UTC - in response to Message 70731.  

Instead of exiting Boinc try suspending it and then restarting the crunching after a slow 10 count.
I was pretty sure I had tried that before I first reported the problem a couple of years ago. But, just to make sure, I waited for another hang-up to occur and checked again. Suspeneding a hung up task and later resumng has no effect.
The other problem could be memory, if your laptops don't have enough memory to handle the tasks they will slow to a crawl,
Don't think this is the problem either. Never seen the slow down symtoms. But I have seen BOINC automatically handle a memory issue related to Einstein. And that always works seamlessly
you might try running 1 less task at a time and see if it helps
Obviously, you don't remember or understand how the Nbody task works. It will take over any spare cores available.
The n-body workunits don't work for everyone, they work for alot of them but not everyone and it's a work in progress to keep up with all the new features and cpu's that come out all the time. I suggest just running the standard units, you can run 11 of them at a time if you also use your gpu for crunching.
This is a quote from your 4 Jun 2018 post on this thread. In retrospect, I should have taken the advice and switched to running standard units only. Instead, I bought into the work in progress theory and did my part to report issues - assuming there might be efforts to fix them. But I was wrong in that assumption.
ID: 70736 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3321
Credit: 520,597,350
RAC: 30,434
Message 70737 - Posted: 14 Apr 2021, 21:50:31 UTC - in response to Message 70736.  

Instead of exiting Boinc try suspending it and then restarting the crunching after a slow 10 count.
I was pretty sure I had tried that before I first reported the problem a couple of years ago. But, just to make sure, I waited for another hang-up to occur and checked again. Suspeneding a hung up task and later resumng has no effect.
The other problem could be memory, if your laptops don't have enough memory to handle the tasks they will slow to a crawl,
Don't think this is the problem either. Never seen the slow down symtoms. But I have seen BOINC automatically handle a memory issue related to Einstein. And that always works seamlessly


That's a common problem across the boards.....just because it works at that boinc project has almost no bearing at all on this boinc project

you might try running 1 less task at a time and see if it helps
Obviously, you don't remember or understand how the Nbody task works. It will take over any spare cores available.

Ahh stupid me I knew that but was guilty on not be fully awake when I typed it sorry!!

[quote]The n-body workunits don't work for everyone, they work for alot of them but not everyone and it's a work in progress to keep up with all the new features and cpu's that come out all the time. I suggest just running the standard units, you can run 11 of them at a time if you also use your gpu for crunching.
This is a quote from your 4 Jun 2018 post on this thread. In retrospect, I should have taken the advice and switched to running standard units only. Instead, I bought into the work in progress theory and did my part to report issues - assuming there might be efforts to fix them. But I was wrong in that assumption.


IMHO it's worth trying every once in awhile just to see if they work, if they don't no problem just abort them and move on, if they do enjoy the progress the Project has made.
ID: 70737 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stick

Send message
Joined: 8 Oct 07
Posts: 52
Credit: 5,637,045
RAC: 160
Message 70742 - Posted: 15 Apr 2021, 17:41:16 UTC - in response to Message 70737.  
Last modified: 15 Apr 2021, 17:42:44 UTC

IMHO it's worth trying every once in awhile just to see if they work, if they don't no problem just abort them and move on, if they do enjoy the progress the Project has made.
Mikey,
Thank you for your help and suggestions on this issue. I truly appreciate the efforts you and Jim1348 made in responding to my posts. If my last post sounded a little cynical, please know my cynicism is directed at the project hierachy and not to you. In the roughly 3 years since I first reported the problem, Tom Donlon's post on 9 Apr 2021 was the project's first response to this thread. I would also note that the problem existed w/Nbody v1.68 and was not fixed when the current v1.76 came out.. I can only conclude that the project developers are not concerned with user reported issues. But, to your point, if and when, a newer version of Nbody is released, I will try it.
Stick
ID: 70742 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stick

Send message
Joined: 8 Oct 07
Posts: 52
Credit: 5,637,045
RAC: 160
Message 70813 - Posted: 21 May 2021, 0:07:41 UTC

Just wanted to say that Nbody V1.80 has the same Hang-Up problem that I first reported on this thread about 3 years ago with V1.68 and then later with V1.76. As always exiting BOINC and then restarting it gets things going again.

Unrelated issue to the hangup problem but some people are reporting immediate task failures with V1.80 because it is not compatible with older Nbody WU's. If you are having that problem, RESETing the project will fix it.
ID: 70813 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stick

Send message
Joined: 8 Oct 07
Posts: 52
Credit: 5,637,045
RAC: 160
Message 71069 - Posted: 27 Aug 2021, 13:29:38 UTC - in response to Message 70813.  

Just wanted to say that Nbody V1.80 has the same Hang-Up problem that I first reported on this thread about 3 years ago with V1.68 and then later with V1.76. As always exiting BOINC and then restarting it gets things going again.

Recently tried Nbody V1.82 and confirmed that the hang-up problem is still not fixed - wasted about 6 days of CPU time before discovering the hangs ups and restarting BOINC.
ID: 71069 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
DaveH52

Send message
Joined: 22 Apr 10
Posts: 3
Credit: 804,800
RAC: 0
Message 71073 - Posted: 28 Aug 2021, 19:08:24 UTC

I've has several cases where the longer it "runs" (with no CPU resources being used) the longer it will take.
Here are some times I recorded:
Elapsed: 3:26:30, Remaining 10:18:40 25.047% Complete
Elapsed: 10:01:00, Remaining 1d 10:41:12 25.047% Complete
Elapsed: 13:00:32, Remaining 1d 14:55:41 25.047% Complete

BOINC Manager Version 7.16.11 (x64)
wxWidgets Version 3.0.1
VirtualBox 6.1.26r145957 (Qt 5.6.2)

Dell Optiplex 9010, BIOS O9010 A30
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz 3.40 GHz, 32.0 GB ram 64-bit operating system, x64-based processor
Windows 10 Pro, 21H1, Build 19043.1165
NVIDIA GeForce GT-710 Bios Version 80.28.a6.0.5f
(or built-in Intel HD Graphics 4000)[/img]
ID: 71073 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stick

Send message
Joined: 8 Oct 07
Posts: 52
Credit: 5,637,045
RAC: 160
Message 71074 - Posted: 28 Aug 2021, 23:36:26 UTC - in response to Message 71073.  

I've has several cases where the longer it "runs" (with no CPU resources being used) the longer it will take.

What happens if you exit BOINC and then restart it?
ID: 71074 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tom Veik

Send message
Joined: 11 Dec 16
Posts: 1
Credit: 396,131
RAC: 0
Message 71302 - Posted: 3 Nov 2021, 1:16:00 UTC

I'm having this problem too. Started using MilkyWay again after being away for some time. Work units just stop processing even though they say "Running" in the list. CPU load drops to very low when this happens. If I restart BOINC that gets them going again for a short time, then they stop again. I set BOINC to stop getting new work units. I'll try to nurse these work units along until they complete. Hope I can get them done before the deadline. Then I'm switching to something else to work on.

BOINC 7.16.20
Windows 10 Home fully updated.
16 gig ram
ID: 71302 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Andreas

Send message
Joined: 18 Feb 09
Posts: 2
Credit: 2,212,132
RAC: 0
Message 71542 - Posted: 19 Dec 2021, 10:30:27 UTC

Ich hatte ein ähnliches Problem mit allen Tasks von Milkyway@home (nBody & Separation 1.46). Die sind immer bis 100% durchgelaufen und haben dann sofort wieder bei 0% begonnen. Auch opencl-Tasks hatten das selbe Problem. Gelöst bekommen habe ich das Problem teilweise durch den Ausschluss des projects-Folders im Antivirus-Scan von "Acronis Cyber Protect Home Office". Ab da hat's wieder geflutscht, lediglich die opencl-Tasks sind weiterhin im Kreis gelaufen - dafür hab ich keine Lösung gefunden und: Diese Tasks lassen sich nicht mal Abbrechen - die verschwinden nur wenige Sekunden aus dem BOINC Manager und tauchen dann bei 0% sofort wieder auf!

Rechnet ja keiner damit, dass eine Backup-Lösung irgendwas blockiert... Draufgekommen bin ich, da die Blockierung der Rosetta-Anwendung als Abfragefenster aufgeploppt ist - das war bei Milkyway nicht der Fall. Und: Der extrem lahme Start der Projekte nach Aufruf des BOINC-Managers hat sich damit auch behoben. Bei mir war die ersten 4 min erst mal der Manager leer, so als hätte ich keine Projekte hinzugefügt bzw. keine Aufgaben in der Warteschleife.
ID: 71542 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Andreas

Send message
Joined: 18 Feb 09
Posts: 2
Credit: 2,212,132
RAC: 0
Message 71543 - Posted: 19 Dec 2021, 10:30:29 UTC
Last modified: 19 Dec 2021, 10:35:42 UTC

Sorry, mouse was faster than me translating...:
had a similar problem here with M@h tasks (both nBody and Separation 1.46). They kept circling around restarting after reaching 100% with 0 % again. opencl-Tasks had the same issue. Solution was to exclude the projects-folder in "Acronis Cyber Protect Home Office" Antivirus-Scans. Just the openCL-Tasks keep going round and round... And those tasks additionally cannot be aborted. The apperar again at 0% after a few seconds

The solution above solved another issue I had on BOINC manager startup: I had an empty window for sevaral minutes (as if I had no projects attached or tasks in queue). That's much better now.
ID: 71543 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3321
Credit: 520,597,350
RAC: 30,434
Message 71544 - Posted: 19 Dec 2021, 12:36:46 UTC - in response to Message 71543.  

Sorry, mouse was faster than me translating...:
had a similar problem here with M@h tasks (both nBody and Separation 1.46). They kept circling around restarting after reaching 100% with 0 % again. opencl-Tasks had the same issue. Solution was to exclude the projects-folder in "Acronis Cyber Protect Home Office" Antivirus-Scans. Just the openCL-Tasks keep going round and round... And those tasks additionally cannot be aborted. The apperar again at 0% after a few seconds

The solution above solved another issue I had on BOINC manager startup: I had an empty window for sevaral minutes (as if I had no projects attached or tasks in queue). That's much better now.


You can up that a little bit and exclude the whole set of Boinc folders instead of just the Project folders if you'd like, any real virus will try to escape the Boinc folders and get caught while any of the numerous 'false positive' notifications will no longer be a problem. One problem with the a/v companies is they are looking for patterns now and sending and receiving small bits of data from the same place could be an indication of a virus sending back it's info to wherever, but Boinc does exactly the same thing and it is not a virus.
ID: 71544 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,895,974
RAC: 353
Message 73823 - Posted: 12 Jun 2022, 18:56:35 UTC - in response to Message 71544.  
Last modified: 12 Jun 2022, 18:59:20 UTC

I have not had the problem. When doing NBODY I always ensure that it gets a fixed amount of CPU’s. I have 16 processors, when doing Nbody I set the cpu % to 25%, 4 CPU’s. Allow new jobs to download and build up a queue all requiring 4 CPU’s, then stop any more downloads Once under way I change the cpu allocation for Boinc to say 50 or 75 % which allows 2 or 3 WU’s concurrently. Is it possible that those with the problem are using a right amount of processors, ie 80% of 16 processors is 12.8, can Nbody cope with that ? Just me rambling my method may be a bit archaic but it works. I then still have enough to run Einstein which needs a CPU as well as the GpU, machine temp stays at low 70’s.
ID: 73823 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3321
Credit: 520,597,350
RAC: 30,434
Message 73827 - Posted: 13 Jun 2022, 10:06:41 UTC - in response to Message 73823.  

I have not had the problem. When doing NBODY I always ensure that it gets a fixed amount of CPU’s. I have 16 processors, when doing Nbody I set the cpu % to 25%, 4 CPU’s. Allow new jobs to download and build up a queue all requiring 4 CPU’s, then stop any more downloads Once under way I change the cpu allocation for Boinc to say 50 or 75 % which allows 2 or 3 WU’s concurrently. Is it possible that those with the problem are using a right amount of processors, ie 80% of 16 processors is 12.8, can Nbody cope with that ? Just me rambling my method may be a bit archaic but it works. I then still have enough to run Einstein which needs a CPU as well as the GpU, machine temp stays at low 70’s.


Yes your way works just fine but with an app_config.xm; file like this:

<app_config>

<app>
<name>milkyway</name>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>0.5</cpu_usage>
</gpu_versions>
</app>
<app_version>
<app_name>milkyway_nbody</app_name>
<max_concurrent>2</max_concurrent>
<plan_class>mt</plan_class>
<avg_ncpus>2</avg_ncpus>
<cmdline>--nthreads> 6</cmdline>
</app_version>
</app_config>

you can control BOTH the number of cpu cores the Nbody tasks use AND the number of gpu tasks you run at the same time. The above is set to run 2 gpu tasks at a time and run the Nbody tasks using only 2 cpu cores per tasks with a maximum of 6 tasks running on the pc.

As for how many cpu cores the Nbody tasks can cope with I think it depends on each pc, though NOT using all the pc's cpu cores for each task does seem to be the key.
ID: 73827 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
captainjack

Send message
Joined: 22 Jun 13
Posts: 44
Credit: 64,258,609
RAC: 0
Message 73829 - Posted: 13 Jun 2022, 11:35:40 UTC

Hey mikey,

Your app_config.xml has a typo in it.

In the line
<cmdline>--nthreads> 6</cmdline>
there is an extra ">" after --nthreads.

Look to me like it should be
<cmdline>--nthreads 6</cmdline>

I don't know if it will make a difference but thought it might.
ID: 73829 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3321
Credit: 520,597,350
RAC: 30,434
Message 73830 - Posted: 13 Jun 2022, 15:35:26 UTC - in response to Message 73829.  
Last modified: 13 Jun 2022, 15:46:09 UTC

Hey mikey,

Your app_config.xml has a typo in it.

In the line
<cmdline>--nthreads> 6</cmdline>
there is an extra ">" after --nthreads.

Look to me like it should be
<cmdline>--nthreads 6</cmdline>

I don't know if it will make a difference but thought it might.


Yup you are correct and there is another error in the same line where I think the dash should be an underline before nthreads, I will fix both in my file..thanks
ID: 73830 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
captainjack

Send message
Joined: 22 Jun 13
Posts: 44
Credit: 64,258,609
RAC: 0
Message 73831 - Posted: 13 Jun 2022, 16:30:18 UTC

According to the example given at https://boinc.berkeley.edu/wiki/client_configuration, it looks to me like it is supposed to be dashes. I have dashes in my app_config.xml and they work fine.
ID: 73831 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3321
Credit: 520,597,350
RAC: 30,434
Message 73832 - Posted: 13 Jun 2022, 23:40:31 UTC - in response to Message 73831.  

According to the example given at https://boinc.berkeley.edu/wiki/client_configuration, it looks to me like it is supposed to be dashes. I have dashes in my app_config.xml and they work fine.


You sir are correct and I will change it back right now!!

Client configuration
[<cmdline>--nthreads 7</cmdline>]
ID: 73832 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : problem with de_nbody tasks never finishing

©2024 Astroinformatics Group