Welcome to MilkyWay@home

Multiple CPU tasks stall

Message boards : Number crunching : Multiple CPU tasks stall
Message board moderation

To post messages, you must log in.

AuthorMessage
Fujidave

Send message
Joined: 15 Jun 11
Posts: 1
Credit: 174,210,453
RAC: 0
Message 69615 - Posted: 22 Mar 2020, 15:25:05 UTC

I have been seeing that tasks that target multiple CPU's will run for a while then completely stall. Timers are still going but progress halts until BIONIC manager is restarted. My PC pulls tasks for 9CPU's no matter what I have the settings set to (use 100 of CPU's) which should pull up to 12. The CPU + GPU tasks progress without issue, but once a all CPU task stalls, it is necessary to shutdown BIONIC manager, and restart it. Once that is done it will process for while then stall again.

Is this a known issue? Any advice on correcting it?
ID: 69615 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kuroyuki

Send message
Joined: 1 Apr 20
Posts: 1
Credit: 109,566
RAC: 0
Message 69644 - Posted: 3 Apr 2020, 21:32:53 UTC - in response to Message 69615.  

I have the same issue here. For me the 4 CPU "N-Body Simulation 1.76 (mt)" stuck several (all of the) times. When "stops" it's consumed CPU time reduces to ~0.2%
Somehow can I report, or figure out the error?
ID: 69644 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aaron

Send message
Joined: 2 May 10
Posts: 5
Credit: 255,667,653
RAC: 0
Message 69657 - Posted: 6 Apr 2020, 19:12:52 UTC
Last modified: 6 Apr 2020, 19:14:50 UTC

I have the same issue with every task. I found that if you suspend computation, exit boinc, reopen boinc then resume computation it continues where it left off.
This is very tedious to do since it's impossible to know which percentage it will stall at. Last night a task stalled at 96% done. When I got up today the next task stalled at 11%.
This only happens with n-body simulations.

Is there a theory on why this is happening?

EDIT: I haven't tried the GPU apps since that is being used for folding@home.
ID: 69657 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3310
Credit: 519,244,124
RAC: 20,888
Message 69658 - Posted: 6 Apr 2020, 21:04:02 UTC - in response to Message 69615.  

I have been seeing that tasks that target multiple CPU's will run for a while then completely stall. Timers are still going but progress halts until BIONIC manager is restarted. My PC pulls tasks for 9CPU's no matter what I have the settings set to (use 100 of CPU's) which should pull up to 12. The CPU + GPU tasks progress without issue, but once a all CPU task stalls, it is necessary to shutdown BIONIC manager, and restart it. Once that is done it will process for while then stall again.

Is this a known issue? Any advice on correcting it?


Do you have Boinc set to use 100% of the processors AND have Hyper-threading turned ON in the bios? If so that could be the problem...it's trying to use 12 cpu cores when it really only has 6 actual cpu cores. At some point it would lock up if you are trying to use both the actual and virtual cpu cores at the same time because there just isn't enough cpu time.
ID: 69658 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aaron

Send message
Joined: 2 May 10
Posts: 5
Credit: 255,667,653
RAC: 0
Message 69659 - Posted: 7 Apr 2020, 0:28:15 UTC - in response to Message 69658.  

Do you have Boinc set to use 100% of the processors AND have Hyper-threading turned ON in the bios? If so that could be the problem...it's trying to use 12 cpu cores when it really only has 6 actual cpu cores. At some point it would lock up if you are trying to use both the actual and virtual cpu cores at the same time because there just isn't enough cpu time.


I can't speak for the original poster but i'm currently using 9/12 cpu threads and boinc is using 70% of cpu time.
ID: 69659 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3310
Credit: 519,244,124
RAC: 20,888
Message 69660 - Posted: 7 Apr 2020, 10:23:17 UTC - in response to Message 69659.  

Do you have Boinc set to use 100% of the processors AND have Hyper-threading turned ON in the bios? If so that could be the problem...it's trying to use 12 cpu cores when it really only has 6 actual cpu cores. At some point it would lock up if you are trying to use both the actual and virtual cpu cores at the same time because there just isn't enough cpu time.


I can't speak for the original poster but i'm currently using 9/12 cpu threads and boinc is using 70% of cpu time.


Try cutting back to using only 6 cpu cores, just the real ones, and see if that helps.
ID: 69660 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aaron

Send message
Joined: 2 May 10
Posts: 5
Credit: 255,667,653
RAC: 0
Message 69663 - Posted: 7 Apr 2020, 20:06:12 UTC - in response to Message 69660.  

Looks like i'll have to run out some cpu tasks since the job cache is full. Is there a way to limit the number of tasks per app so I can increase the amount of work boinc stores?
I don't want to get 60 more tasks i'll have to babysit or abort.
ID: 69663 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3310
Credit: 519,244,124
RAC: 20,888
Message 69664 - Posted: 7 Apr 2020, 23:39:59 UTC - in response to Message 69663.  
Last modified: 7 Apr 2020, 23:42:04 UTC

Looks like i'll have to run out some cpu tasks since the job cache is full. Is there a way to limit the number of tasks per app so I can increase the amount of work boinc stores?
I don't want to get 60 more tasks i'll have to babysit or abort.


Sure go into the Boinc Manager and under Options, computing preferences set the top line to 50% assoonas you click ok at the bottom of the page it will limit the cpucores your pcuses. This is a pc by pc setting to do it this way, to do it globally go into the website settings and do it there, they will take affect the next time your pc's connect to the website or as soon as you click update under Projects in the Boinc Manager.

To limit the number of tasks you get change the percentage setting for the Project on the website settings under Your Account or change the cache settings in the Boinc Manager or on the website.
ID: 69664 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aaron

Send message
Joined: 2 May 10
Posts: 5
Credit: 255,667,653
RAC: 0
Message 69679 - Posted: 9 Apr 2020, 19:07:28 UTC
Last modified: 9 Apr 2020, 19:16:45 UTC

I cut back to only using 6 cores and the 1st n-body task to run stalled at about 40% with 8 hours of runtime. I suspended computation, exited boinc, relaunched boinc, resumed computation and it continued where it left off (cpu usage around 70-75%).

This method seems to be the only way to get n-body tasks to run so i'll only be using my gpu for this project.
ID: 69679 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TrevorGoesB00

Send message
Joined: 23 Oct 17
Posts: 3
Credit: 2,797,983
RAC: 0
Message 69730 - Posted: 17 Apr 2020, 21:47:23 UTC

I have the same issue. Tasks stall after various run times, will restart if BOINC is closed/re-opened, and then stall again after various run times.

I'm running (rather, attempting to run) an old laptop as a dedicated cruncher; Intel Core i5-4200m; 2 cores, 4 threads.

I've tried various combinations of % of CPUs (100, 90, 50), % of CPU time (100, 90, 50), enabling/disabling threading and/or core-multi-processing. All fail.

I was running both Rosetta and MilkyWay, but scaled back to just MilkyWay in hopes to isolate the problem. No success.

What am I missing?
ID: 69730 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3310
Credit: 519,244,124
RAC: 20,888
Message 69735 - Posted: 18 Apr 2020, 11:32:19 UTC - in response to Message 69730.  

I have the same issue. Tasks stall after various run times, will restart if BOINC is closed/re-opened, and then stall again after various run times.

I'm running (rather, attempting to run) an old laptop as a dedicated cruncher; Intel Core i5-4200m; 2 cores, 4 threads.

I've tried various combinations of % of CPUs (100, 90, 50), % of CPU time (100, 90, 50), enabling/disabling threading and/or core-multi-processing. All fail.

I was running both Rosetta and MilkyWay, but scaled back to just MilkyWay in hopes to isolate the problem. No success.

What am I missing?


How much memory does each task take when it's running? Go into Boinc Manager and click on a running task and then click properties in the left hand box and scroll down to the bottom. It will say something like:
Virtual memory size 96.96 MB
Working set size 37.01 MB

Those numbers are for a NON Milkyway task but you can see Boinc set aside almost 100MB of ram for the task, since you only have 8gb of ram in that laptop you could be running out and the workunit is pausing to wait for more memory. The only way to fix it to restart Boinc like you have been, run fewer tasks at one time or add more memory. Remember your OS takes over 1GB of memory just for itself. There is a setting in Boinc Manager under Computing Preferences, disk and memory and then the bottom section Memory, the default setting for Boinc is 50% I believe for when computer is in use, you could bump that up and see if it helps. BUT it all depends on how much memory each task needs to run, if it needs 2GB per task and you are running 4 tasks at a time that will never work, remember the OS takes at least 1gb just to load itself leaving you a max of 7GB in that machine before Boinc even starts.

The memory number is set by the Project as they try to keep the workunit running in memory to prevent swapping to an at least 10 times slower hard drive every second and frying your hard drive. Boinc always reserves the max it will need for each task even as in my case you can see it's only using 37MB at the time I looked at it.
ID: 69735 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aaron

Send message
Joined: 2 May 10
Posts: 5
Credit: 255,667,653
RAC: 0
Message 69742 - Posted: 19 Apr 2020, 17:15:31 UTC - in response to Message 69735.  

I'm not 100% sure if not having enough RAM is the problem. I have BOINC using at up 80% of my 32gb RAM and still had issues with n-body simulation.
Right now i'm running 2 Milkyway tasks on my GPU (.5CPU/.5GPU) and 9 CPU tasks for Rosetta with no issues.
Think this may be a bug in the code of the n-body simulation program.
ID: 69742 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TrevorGoesB00

Send message
Joined: 23 Oct 17
Posts: 3
Credit: 2,797,983
RAC: 0
Message 69751 - Posted: 22 Apr 2020, 15:37:48 UTC

Let me begin with the observation that the 'stalling' seems to be isolated to n-body simulations. I haven't documented that it ONLY time it stalls is while running MilkyWay simulations, but I can say that it does stall EVERY time a n-body simulation is running. These are my current settings for COMPUTING and MEMORY:

MilkyWay:
Computing: Use at most 50 % of the CPUs
Computing: Use at most 50 % of CPU time
Memory: When computer is in use, use at most 90 %
Memory: When computer is not in use, use at most 95 %

Non-MilkyWay (Rosetta):
Computing: Use at most 50 % of the CPUs
Computing: Use at most 95 % of CPU time
Memory: When computer is in use, use at most 90 %
Memory: When computer is not in use, use at most 95 %

MilkyWay (currently suspended):
Virtual memory size 13.03 MB
Working set size 992.00 KB
Calculated: 0.08% (again, currently suspended)

Non-MilkyWay (Rosetta) (currently active tasks):
Virtual memory size 308.13 MB
Working set size 306.24 MB
Calculated: 99.39%

Virtual memory size 255.60 MB
Working set size 254.42 MB
Calculated: 99.54%

I chose the 50% limit on both MilkyWay and Rosetta as other posts indicated that there may be a CPU core/virtual core issue. As this dedicated machine has 4 cores, 2 physical and two virtual, I set the usage limit to 50% in the application settings, and disabled hyperthreading in the BIOS in hopes to only use physical cores. Prior tests with higher CPU usage limits and/or enabling/disabling hyperthreading yielded the same results. Please let me know if you think my settings should be changed as the goal is for this computer to be a dedicated crunching machine with little to no other use.

I'm of the mindset that RAM is not the problem. I typically have 2 active tasks, when running either MilkyWay or Rosetta. I'm beginning to think that it's a problem with the n-body simulation, not my hardware or user settings. It sucks as I want to run MilkyWay and/or other tasks at the maximum capability of this machine.

Thoughts?[/i]
ID: 69751 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TrevorGoesB00

Send message
Joined: 23 Oct 17
Posts: 3
Credit: 2,797,983
RAC: 0
Message 69759 - Posted: 24 Apr 2020, 19:44:04 UTC

After a little more "forum reconnaissance" I've found that issues with n-body tasks are somewhat pervasive.

My research found that "N-Body tasks are large difficult CPU memory demanding tasks. By editing preferences, and deselecting N-Body tasks, you will only get Milkyway@home separation tasks which will work better".

The full thread can be found here:

https://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4589

I've deselected the N-body tasks and am happy to report that the stalling/hanging issue seems to have ceased. After a few days of running without issue, I re-enabled hyperthreading. I'll report back on that front once I have a larger data set.
ID: 69759 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Markus Torstensson

Send message
Joined: 28 Apr 20
Posts: 1
Credit: 62,710
RAC: 0
Message 69765 - Posted: 30 Apr 2020, 9:35:46 UTC
Last modified: 30 Apr 2020, 9:36:42 UTC

It smells like a deadlock issue.
Multi thread programming is hard to get right.

I force them to run with a single thread only, and then it works.

Place this in a file called app_config.xml in the project folder
<app_config>
   <app_version>
       <app_name>milkyway_nbody</app_name>
       <plan_class>mt</plan_class>
       <avg_ncpus>1</avg_ncpus>
       <cmdline>--nthreads 1</cmdline>
   </app_version>
</app_config>
ID: 69765 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Multiple CPU tasks stall

©2024 Astroinformatics Group