Welcome to MilkyWay@home

Multi cpu tasks stalling

Message boards : Number crunching : Multi cpu tasks stalling
Message board moderation

To post messages, you must log in.

AuthorMessage
MishraMirabai

Send message
Joined: 23 Jan 21
Posts: 10
Credit: 277,217
RAC: 0
Message 71662 - Posted: 3 Feb 2022, 14:14:12 UTC

This is similar problem to that described in "Long-Running tasks" thread, possibly same problem?

I'm having a problem with Multi cpu tasks stalling out, in this case (12 cpus) on an i7-8850h, 6core, but I was also experiencing similar or same problem on my other laptop, with a 4 core i3, with 4cpu tasks. Problem only happening with multi cpu tasks from Milkyway@home.
Basically, task will run fine, normal speed until some random percentage, then stall. time elapsed keeps ticking of course, but time remaining will begin counting upwards. This will go on indefinitely. Fully exiting Boinc and restarting resets both elapsed and remaining time, and gets task running, till it stalls again.
Playing around with it, seems Xing out to minimize to system tray will sometimes cause it to stall. But even if i've got it open, it WILL stall when my display turns off. Seems to also happen if its open but just minimized. BUT stalling doesn't happen with other tasks, say, wcg tasks run fine.
I have all the "When to Suspend" options unchecked. Have the computers powersettings set to turn off display in a minute (closing lid causes stall just the same), but keep hard drive active, and no sleeping. Prior to this, things ran fine for 2 days straight crunching wcg tasks..
Hoping to not have to abort the whole stack.
I'm relatively new to community, first post, thanks for the support!

Boinc version 7.16.20 (x64)

Model: HP ZBook 15 G5
Computer Type: Mobile
Manufactured by: HP
Baseboard ID: 842A
BIOS version: Q70 Ver. 01.18.01
Processor: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
Enabled Processor Count: 12
Total Memory: 16 GB
Local Storage: 1.36 TB (3 drives)
Graphics Card & Driver 1: UHD Graphics 630 -- 27.20.100.8681
Graphics Card & Driver 2: Quadro P1000 -- 472.08
Operating System: Windows 10 (64-bit) - Microsoft Windows 10 Pro
Current Culture: en-US
.NET Framework: 528372
ID: 71662 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile HRFMguy

Send message
Joined: 12 Nov 21
Posts: 236
Credit: 575,037,238
RAC: 35,415
Message 71663 - Posted: 3 Feb 2022, 16:27:06 UTC - in response to Message 71662.  

I have not seen one count upwards indefinitely. In my case, they did count up but eventually completed with very high cpu second counts, and all were eventually credited. As an experiment, what happens if you pick one and let it run for 18 or 20 hours? Otherwise, I think the MW@home staff will need to resolve this. Good luck!
ID: 71663 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
MishraMirabai

Send message
Joined: 23 Jan 21
Posts: 10
Credit: 277,217
RAC: 0
Message 71669 - Posted: 3 Feb 2022, 17:47:11 UTC - in response to Message 71663.  

Thanks, ill give it a shot.. i'm babysitting one currently, screen on, Bonic open, and yep stalled again, at 7.723 No progress, 36 min elapsed, 7hr 15min remaining. looks like time remaining estimate is gaining around 15 sec for every sec elapsed. With 12 cpu, should only be a 2min task. If i reset, will revert and run at proper speed, for awhile. Now up to 50min elapsed / 10hr remaining!

Last year was running milkyway no prob, since then, updated windows, switched os to new ssd, updated drivers etc. just started crunching again last week, all works fine except these multi cpu tasks. I'm hoping is something simple i can fix with settings, but strange thing is i'm having same problem with 4cpu milkyway tasks on other laptop. I'll see about reproducing it on that one again.
cheers!
ID: 71669 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
MishraMirabai

Send message
Joined: 23 Jan 21
Posts: 10
Credit: 277,217
RAC: 0
Message 71670 - Posted: 3 Feb 2022, 18:25:03 UTC

Well, looking around I see is de_nbody problem, comes and goes, other users have reported same thing.
Changed power settings to "High Performance" and seems to be working ok. i3 also crunching away. Now to leave it all day..
sheepish for possibly superfluous post, but live and learn!
ID: 71670 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
MishraMirabai

Send message
Joined: 23 Jan 21
Posts: 10
Credit: 277,217
RAC: 0
Message 71671 - Posted: 3 Feb 2022, 18:25:27 UTC

Well, looking around I see is de_nbody problem, comes and goes, other users have reported same thing.
Changed power settings to "High Performance" and seems to be working ok. i3 also crunching away. Now to leave it all day..
sheepish for possibly superfluous post, but live and learn!
ID: 71671 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile HRFMguy

Send message
Joined: 12 Nov 21
Posts: 236
Credit: 575,037,238
RAC: 35,415
Message 71677 - Posted: 5 Feb 2022, 2:02:59 UTC - in response to Message 71671.  

this one might be a case in point. it was originally estimated to be a 20 minute task, but at 20 minutes, it had ballooned up to well over 16 hours. I think I will let it run. My only fear is to run it for that long only to have it fail validation.


Application Milkyway@home N-Body Simulation 1.82 (mt)
Name de_nbody_08_31_2021_v176_40k__data__12_1640968180_337678
State Running
Received 2/2/2022 10:29:04 AM
Report deadline 2/14/2022 10:29:05 AM
Resources 4 CPUs
Estimated computation size 10,431 GFLOPs
CPU time 02:31:59
CPU time since checkpoint 00:03:29
Elapsed time 00:42:22
Estimated time remaining 15:23:01
Fraction done 4.390%
Virtual memory size 14.70 MB
Working set size 18.68 MB
Directory slots/1
Process ID 6528
Progress rate 6.480% per hour
Executable milkyway_nbody_1.82_windows_x86_64__mt.exe
ID: 71677 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
MishraMirabai

Send message
Joined: 23 Jan 21
Posts: 10
Credit: 277,217
RAC: 0
Message 71678 - Posted: 5 Feb 2022, 5:05:06 UTC

After running for awhile, having same problem, happening both on 12cpu i7, and on the 4cpu i3.
Took a screenshot of a task which after 2hrs 40min had balloned to 290 days estimated completion time.
So looks like, Once stalled, task will remain stalled indefinitely.
Fully restarting Boinc will reset and allow task to run, sometimes long enough to finish task. But will always stall after a few minutes.
Tasks that finish seem to report and validate just fine, its just taking multiple restarts to finish a task.. and these are supposed to only take 3min. 15min or so for the 4cpus.
I thought was related to display shutting off, or minimizing to system tray, but keeping boinc open and running, will stall eventually just the same.
Next im going to suspend all the multi cpu tasks and see if the singles process.
last update only gave about 8 singles, 20+ of the multi, on the i3 laptop, I only received multi cpu tasks..
looks to me like milky way tasks are broken till this is resolved.
ID: 71678 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,950,748
RAC: 21,733
Message 71679 - Posted: 5 Feb 2022, 10:46:03 UTC - in response to Message 71678.  

After running for awhile, having same problem, happening both on 12cpu i7, and on the 4cpu i3.
Took a screenshot of a task which after 2hrs 40min had balloned to 290 days estimated completion time.
So looks like, Once stalled, task will remain stalled indefinitely.
Fully restarting Boinc will reset and allow task to run, sometimes long enough to finish task. But will always stall after a few minutes.
Tasks that finish seem to report and validate just fine, its just taking multiple restarts to finish a task.. and these are supposed to only take 3min. 15min or so for the 4cpus.
I thought was related to display shutting off, or minimizing to system tray, but keeping boinc open and running, will stall eventually just the same.
Next im going to suspend all the multi cpu tasks and see if the singles process.
last update only gave about 8 singles, 20+ of the multi, on the i3 laptop, I only received multi cpu tasks..
looks to me like milky way tasks are broken till this is resolved.


How many multi-cpu tasks are you trying to run at one time? You may need to check the properties of those tasks, Boinc Manager, Tasks then pick a running one and then on the left click Properties. If each task is using ie 10gb of memory and your pc only has 16gb and you are trying to do something else that needs the memory and cpu too you could be running out of memory and the task may not be designed to handle that situation.
ID: 71679 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile HRFMguy

Send message
Joined: 12 Nov 21
Posts: 236
Credit: 575,037,238
RAC: 35,415
Message 71682 - Posted: 5 Feb 2022, 17:06:48 UTC - in response to Message 71679.  

290 days!?!?!! Ouch! that'll leave a mark!
ID: 71682 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
MishraMirabai

Send message
Joined: 23 Jan 21
Posts: 10
Credit: 277,217
RAC: 0
Message 71688 - Posted: 7 Feb 2022, 8:38:37 UTC - in response to Message 71679.  

How many multi-cpu tasks are you trying to run at one time? You may need to check the properties of those tasks, Boinc Manager, Tasks then pick a running one and then on the left click Properties. If each task is using ie 10gb of memory and your pc only has 16gb and you are trying to do something else that needs the memory and cpu too you could be running out of memory and the task may not be designed to handle that situation.


At a time, just one.. I do have 16gb ram on both laptops, Properties says working set size is about 15mg however. here is coppied properties of a task stalled at 39.614%. Task manager doesn't show excessive ram, only 40% when task is running.. does say power usage high on nbody process, but only when its running. once it stalls, still says its "running" but cpu usage in taskmanager is nil..


Application
Milkyway@home N-Body Simulation 1.82 (mt)
Name
de_nbody_08_31_2021_v176_40k__data__13_1640968180_2498241
State
Running
Received
2/2/2022 2:46:28 AM
Report deadline
2/14/2022 2:46:20 AM
Resources
12 CPUs
Estimated computation size
12,267 GFLOPs
CPU time
00:15:16
CPU time since checkpoint
00:00:05
Elapsed time
00:05:46
Estimated time remaining
00:08:47
Fraction done
39.615%
Virtual memory size
11.59 MB
Working set size
15.11 MB
Directory
slots/0
Process ID
14440
Progress rate
7.470% per minute
Executable
milkyway_nbody_1.82_windows_x86_64__mt.exe
ID: 71688 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,881
RAC: 267
Message 71689 - Posted: 7 Feb 2022, 20:01:09 UTC
Last modified: 7 Feb 2022, 20:36:14 UTC

I think the WU’s are OK. As a test I just did 23 and they ran OK. I run an I7 with 16Gb. I allocated 6 CPUs for theNbody simulation runs. Each run took between 3 and 5 minutes elapsed time. They all completed ok are now waiting validation. Only one task at a time was running, GPU tasks (Einstein were also running).
ID: 71689 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,950,748
RAC: 21,733
Message 71691 - Posted: 8 Feb 2022, 13:00:56 UTC - in response to Message 71688.  

How many multi-cpu tasks are you trying to run at one time? You may need to check the properties of those tasks, Boinc Manager, Tasks then pick a running one and then on the left click Properties. If each task is using ie 10gb of memory and your pc only has 16gb and you are trying to do something else that needs the memory and cpu too you could be running out of memory and the task may not be designed to handle that situation.


At a time, just one.. I do have 16gb ram on both laptops, Properties says working set size is about 15mg however. here is coppied properties of a task stalled at 39.614%. Task manager doesn't show excessive ram, only 40% when task is running.. does say power usage high on nbody process, but only when its running. once it stalls, still says its "running" but cpu usage in taskmanager is nil..


Application
Milkyway@home N-Body Simulation 1.82 (mt)
Name
de_nbody_08_31_2021_v176_40k__data__13_1640968180_2498241
State
Running
Received
2/2/2022 2:46:28 AM
Report deadline
2/14/2022 2:46:20 AM
Resources
12 CPUs
Estimated computation size
12,267 GFLOPs
CPU time
00:15:16
CPU time since checkpoint
00:00:05
Elapsed time
00:05:46
Estimated time remaining
00:08:47
Fraction done
39.615%
Virtual memory size
11.59 MB
Working set size
15.11 MB
Directory
slots/0
Process ID
14440
Progress rate
7.470% per minute
Executable
milkyway_nbody_1.82_windows_x86_64__mt.exe


Windows Task Manager is notorious for being very very bad at recognizing what's actually running and how much it tasks when it comes to memory stuff, ie it doesn't even show that Windows itself takes almost 1gb. But the listing above says MilkyWay is only 15 MB not GB of memory so that shouldn't be the problem. Try changing the cpu cores available to Boinc for the next bath of tasks to only use 50% of the total available cpu cores, you can do this in the Boinc Manager by clicking on Options on the top line above your tasks and the Computing Preferences and then change the top box to 50%. I think your problem could be that you are using every cpu core, both real and HT and the pc is just bogging down because it can't handle all the swapping around of stuff the tasks do in cached memory. Your pc's are hidden so I can't be positive but that's what it seems like to me.
ID: 71691 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
MishraMirabai

Send message
Joined: 23 Jan 21
Posts: 10
Credit: 277,217
RAC: 0
Message 71693 - Posted: 9 Feb 2022, 6:42:37 UTC - in response to Message 71691.  

Windows Task Manager is notorious for being very very bad at recognizing what's actually running and how much it tasks when it comes to memory stuff, ie it doesn't even show that Windows itself takes almost 1gb. But the listing above says MilkyWay is only 15 MB not GB of memory so that shouldn't be the problem. Try changing the cpu cores available to Boinc for the next bath of tasks to only use 50% of the total available cpu cores, you can do this in the Boinc Manager by clicking on Options on the top line above your tasks and the Computing Preferences and then change the top box to 50%. I think your problem could be that you are using every cpu core, both real and HT and the pc is just bogging down because it can't handle all the swapping around of stuff the tasks do in cached memory. Your pc's are hidden so I can't be positive but that's what it seems like to me.


Thanks! hmm i would un hide my pcs, not sure where that option is? can print it here if it helps. Thing is, those 12cpu tasks worked fine last year, no prob, with same settings. Since then updated to new win10 version, all new drivers, fresh install, so if its a driver issue, there's no telling which one. singles and gpu tasks, and other projects run fine.
setting to 50% seems to help, still stalled twice but its running longer. I'm upping memory to 75% while in use to match the 75% when not in use.
Seems like, once I reset boinc, it'll Run for about a a task and a half before stalling. I'll try those settings on the i3 and we'll see if it changes.
ID: 71693 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 696
Credit: 540,077,776
RAC: 86,705
Message 71694 - Posted: 9 Feb 2022, 6:51:09 UTC - in response to Message 71693.  

Unhide your PC's from Milkyway Project preferences https://milkyway.cs.rpi.edu/milkyway/prefs.php?subset=project
Should MilkyWay@home show your computers on its web site?
ID: 71694 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
MishraMirabai

Send message
Joined: 23 Jan 21
Posts: 10
Credit: 277,217
RAC: 0
Message 71723 - Posted: 10 Feb 2022, 14:42:50 UTC

Thanks, I made PCs visible. Still having the stalling issue. Tried setting Cpu usage to 50%, and at 25%, and Cpu time at the default 60% and 30%. Also upped memory. I'm mostly convinced problem is with the nbody task itself somehow as its happening on both computers. Also, since installing Boinc, all other tasks run fine but the multi cpus from milkyway have stalled from the first.
ID: 71723 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 696
Credit: 540,077,776
RAC: 86,705
Message 71727 - Posted: 10 Feb 2022, 20:09:24 UTC - in response to Message 71723.  
Last modified: 10 Feb 2022, 20:33:06 UTC

You are letting the application use all of your processors 12 threads. This is generally not advisable as you don't leave any threads for the host to do normal housekeeping of background tasks. That will have to share time-slices with the N-body application.

I would advise restricting the app to only use 50% to 80% of the available threads. You cannot achieve this by changing your general compute preferences. You need to run an app_config.xml file to restrict the MT app to a lesser amount of cpu threads. The BOINC client configuration docs provide examples on how to do this.

https://boinc.berkeley.edu/wiki/Client_configuration#Application_configuration

<app_config>
[<app>
<name>Application_Name</name>
<max_concurrent>1</max_concurrent>
[<report_results_immediately/>]
[<fraction_done_exact/>]
<gpu_versions>
<gpu_usage>.5</gpu_usage>
<cpu_usage>.4</cpu_usage>
</gpu_versions>
</app>]
...
[<app_version>
<app_name>Application_Name</app_name>
[<plan_class>mt</plan_class>]
[<avg_ncpus>x</avg_ncpus>]
[<ngpus>x</ngpus>]
[<cmdline>--nthreads 7</cmdline>]
</app_version>]

...
[<project_max_concurrent>N</project_max_concurrent>]
[<report_results_immediately/>]
</app_config>

The --nthreads parameter one is the one you need to configure to limit how many threads the N-body tasks are allowed to use.
Also you can restrict the host to only running one N-body MT task at a time to give the PC access to the leftover threads for general housekeeping.

This is what I would try at first as an example.

<app_config>
  <app>
    <name>milkyway_nbody</name>
    <plan_class>mt</plan_class>
       <avg_ncpus>6</avg_ncpus>
       <cmdline>--nthreads 6</cmdline>
       <max_concurrent>1</max_concurrent>
  </app>
   <app>
      <name>milkyway</name>
      <gpu_versions>
          <gpu_usage>1.0</gpu_usage>
          <cpu_usage>1.0</cpu_usage>
      </gpu_versions>
    </app>
</app_config>

ID: 71727 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
MishraMirabai

Send message
Joined: 23 Jan 21
Posts: 10
Credit: 277,217
RAC: 0
Message 71754 - Posted: 14 Feb 2022, 1:13:59 UTC - in response to Message 71727.  

Thanks, I'll work on Implementing that and report how it works.
Much appreciate the help, would not have realized this needed to be done!
ID: 71754 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Multi cpu tasks stalling

©2024 Astroinformatics Group