Message boards :
Number crunching :
Multi cpu tasks stalling
Message board moderation
Author | Message |
---|---|
Send message Joined: 23 Jan 21 Posts: 10 Credit: 277,217 RAC: 0 |
This is similar problem to that described in "Long-Running tasks" thread, possibly same problem? I'm having a problem with Multi cpu tasks stalling out, in this case (12 cpus) on an i7-8850h, 6core, but I was also experiencing similar or same problem on my other laptop, with a 4 core i3, with 4cpu tasks. Problem only happening with multi cpu tasks from Milkyway@home. Basically, task will run fine, normal speed until some random percentage, then stall. time elapsed keeps ticking of course, but time remaining will begin counting upwards. This will go on indefinitely. Fully exiting Boinc and restarting resets both elapsed and remaining time, and gets task running, till it stalls again. Playing around with it, seems Xing out to minimize to system tray will sometimes cause it to stall. But even if i've got it open, it WILL stall when my display turns off. Seems to also happen if its open but just minimized. BUT stalling doesn't happen with other tasks, say, wcg tasks run fine. I have all the "When to Suspend" options unchecked. Have the computers powersettings set to turn off display in a minute (closing lid causes stall just the same), but keep hard drive active, and no sleeping. Prior to this, things ran fine for 2 days straight crunching wcg tasks.. Hoping to not have to abort the whole stack. I'm relatively new to community, first post, thanks for the support! Boinc version 7.16.20 (x64) Model: HP ZBook 15 G5 Computer Type: Mobile Manufactured by: HP Baseboard ID: 842A BIOS version: Q70 Ver. 01.18.01 Processor: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz Enabled Processor Count: 12 Total Memory: 16 GB Local Storage: 1.36 TB (3 drives) Graphics Card & Driver 1: UHD Graphics 630 -- 27.20.100.8681 Graphics Card & Driver 2: Quadro P1000 -- 472.08 Operating System: Windows 10 (64-bit) - Microsoft Windows 10 Pro Current Culture: en-US .NET Framework: 528372 |
Send message Joined: 12 Nov 21 Posts: 236 Credit: 575,038,236 RAC: 2,525 |
I have not seen one count upwards indefinitely. In my case, they did count up but eventually completed with very high cpu second counts, and all were eventually credited. As an experiment, what happens if you pick one and let it run for 18 or 20 hours? Otherwise, I think the MW@home staff will need to resolve this. Good luck! |
Send message Joined: 23 Jan 21 Posts: 10 Credit: 277,217 RAC: 0 |
Thanks, ill give it a shot.. i'm babysitting one currently, screen on, Bonic open, and yep stalled again, at 7.723 No progress, 36 min elapsed, 7hr 15min remaining. looks like time remaining estimate is gaining around 15 sec for every sec elapsed. With 12 cpu, should only be a 2min task. If i reset, will revert and run at proper speed, for awhile. Now up to 50min elapsed / 10hr remaining! Last year was running milkyway no prob, since then, updated windows, switched os to new ssd, updated drivers etc. just started crunching again last week, all works fine except these multi cpu tasks. I'm hoping is something simple i can fix with settings, but strange thing is i'm having same problem with 4cpu milkyway tasks on other laptop. I'll see about reproducing it on that one again. cheers! |
Send message Joined: 23 Jan 21 Posts: 10 Credit: 277,217 RAC: 0 |
Well, looking around I see is de_nbody problem, comes and goes, other users have reported same thing. Changed power settings to "High Performance" and seems to be working ok. i3 also crunching away. Now to leave it all day.. sheepish for possibly superfluous post, but live and learn! |
Send message Joined: 23 Jan 21 Posts: 10 Credit: 277,217 RAC: 0 |
Well, looking around I see is de_nbody problem, comes and goes, other users have reported same thing. Changed power settings to "High Performance" and seems to be working ok. i3 also crunching away. Now to leave it all day.. sheepish for possibly superfluous post, but live and learn! |
Send message Joined: 12 Nov 21 Posts: 236 Credit: 575,038,236 RAC: 2,525 |
this one might be a case in point. it was originally estimated to be a 20 minute task, but at 20 minutes, it had ballooned up to well over 16 hours. I think I will let it run. My only fear is to run it for that long only to have it fail validation. Application Milkyway@home N-Body Simulation 1.82 (mt) Name de_nbody_08_31_2021_v176_40k__data__12_1640968180_337678 State Running Received 2/2/2022 10:29:04 AM Report deadline 2/14/2022 10:29:05 AM Resources 4 CPUs Estimated computation size 10,431 GFLOPs CPU time 02:31:59 CPU time since checkpoint 00:03:29 Elapsed time 00:42:22 Estimated time remaining 15:23:01 Fraction done 4.390% Virtual memory size 14.70 MB Working set size 18.68 MB Directory slots/1 Process ID 6528 Progress rate 6.480% per hour Executable milkyway_nbody_1.82_windows_x86_64__mt.exe |
Send message Joined: 23 Jan 21 Posts: 10 Credit: 277,217 RAC: 0 |
After running for awhile, having same problem, happening both on 12cpu i7, and on the 4cpu i3. Took a screenshot of a task which after 2hrs 40min had balloned to 290 days estimated completion time. So looks like, Once stalled, task will remain stalled indefinitely. Fully restarting Boinc will reset and allow task to run, sometimes long enough to finish task. But will always stall after a few minutes. Tasks that finish seem to report and validate just fine, its just taking multiple restarts to finish a task.. and these are supposed to only take 3min. 15min or so for the 4cpus. I thought was related to display shutting off, or minimizing to system tray, but keeping boinc open and running, will stall eventually just the same. Next im going to suspend all the multi cpu tasks and see if the singles process. last update only gave about 8 singles, 20+ of the multi, on the i3 laptop, I only received multi cpu tasks.. looks to me like milky way tasks are broken till this is resolved. |
Send message Joined: 8 May 09 Posts: 3321 Credit: 520,614,626 RAC: 31,037 |
After running for awhile, having same problem, happening both on 12cpu i7, and on the 4cpu i3. How many multi-cpu tasks are you trying to run at one time? You may need to check the properties of those tasks, Boinc Manager, Tasks then pick a running one and then on the left click Properties. If each task is using ie 10gb of memory and your pc only has 16gb and you are trying to do something else that needs the memory and cpu too you could be running out of memory and the task may not be designed to handle that situation. |
Send message Joined: 12 Nov 21 Posts: 236 Credit: 575,038,236 RAC: 2,525 |
290 days!?!?!! Ouch! that'll leave a mark! |
Send message Joined: 23 Jan 21 Posts: 10 Credit: 277,217 RAC: 0 |
How many multi-cpu tasks are you trying to run at one time? You may need to check the properties of those tasks, Boinc Manager, Tasks then pick a running one and then on the left click Properties. If each task is using ie 10gb of memory and your pc only has 16gb and you are trying to do something else that needs the memory and cpu too you could be running out of memory and the task may not be designed to handle that situation. At a time, just one.. I do have 16gb ram on both laptops, Properties says working set size is about 15mg however. here is coppied properties of a task stalled at 39.614%. Task manager doesn't show excessive ram, only 40% when task is running.. does say power usage high on nbody process, but only when its running. once it stalls, still says its "running" but cpu usage in taskmanager is nil.. Application Milkyway@home N-Body Simulation 1.82 (mt) Name de_nbody_08_31_2021_v176_40k__data__13_1640968180_2498241 State Running Received 2/2/2022 2:46:28 AM Report deadline 2/14/2022 2:46:20 AM Resources 12 CPUs Estimated computation size 12,267 GFLOPs CPU time 00:15:16 CPU time since checkpoint 00:00:05 Elapsed time 00:05:46 Estimated time remaining 00:08:47 Fraction done 39.615% Virtual memory size 11.59 MB Working set size 15.11 MB Directory slots/0 Process ID 14440 Progress rate 7.470% per minute Executable milkyway_nbody_1.82_windows_x86_64__mt.exe |
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,896,054 RAC: 323 |
I think the WU’s are OK. As a test I just did 23 and they ran OK. I run an I7 with 16Gb. I allocated 6 CPUs for theNbody simulation runs. Each run took between 3 and 5 minutes elapsed time. They all completed ok are now waiting validation. Only one task at a time was running, GPU tasks (Einstein were also running). |
Send message Joined: 8 May 09 Posts: 3321 Credit: 520,614,626 RAC: 31,037 |
How many multi-cpu tasks are you trying to run at one time? You may need to check the properties of those tasks, Boinc Manager, Tasks then pick a running one and then on the left click Properties. If each task is using ie 10gb of memory and your pc only has 16gb and you are trying to do something else that needs the memory and cpu too you could be running out of memory and the task may not be designed to handle that situation. Windows Task Manager is notorious for being very very bad at recognizing what's actually running and how much it tasks when it comes to memory stuff, ie it doesn't even show that Windows itself takes almost 1gb. But the listing above says MilkyWay is only 15 MB not GB of memory so that shouldn't be the problem. Try changing the cpu cores available to Boinc for the next bath of tasks to only use 50% of the total available cpu cores, you can do this in the Boinc Manager by clicking on Options on the top line above your tasks and the Computing Preferences and then change the top box to 50%. I think your problem could be that you are using every cpu core, both real and HT and the pc is just bogging down because it can't handle all the swapping around of stuff the tasks do in cached memory. Your pc's are hidden so I can't be positive but that's what it seems like to me. |
Send message Joined: 23 Jan 21 Posts: 10 Credit: 277,217 RAC: 0 |
Windows Task Manager is notorious for being very very bad at recognizing what's actually running and how much it tasks when it comes to memory stuff, ie it doesn't even show that Windows itself takes almost 1gb. But the listing above says MilkyWay is only 15 MB not GB of memory so that shouldn't be the problem. Try changing the cpu cores available to Boinc for the next bath of tasks to only use 50% of the total available cpu cores, you can do this in the Boinc Manager by clicking on Options on the top line above your tasks and the Computing Preferences and then change the top box to 50%. I think your problem could be that you are using every cpu core, both real and HT and the pc is just bogging down because it can't handle all the swapping around of stuff the tasks do in cached memory. Your pc's are hidden so I can't be positive but that's what it seems like to me. Thanks! hmm i would un hide my pcs, not sure where that option is? can print it here if it helps. Thing is, those 12cpu tasks worked fine last year, no prob, with same settings. Since then updated to new win10 version, all new drivers, fresh install, so if its a driver issue, there's no telling which one. singles and gpu tasks, and other projects run fine. setting to 50% seems to help, still stalled twice but its running longer. I'm upping memory to 75% while in use to match the 75% when not in use. Seems like, once I reset boinc, it'll Run for about a a task and a half before stalling. I'll try those settings on the i3 and we'll see if it changes. |
Send message Joined: 24 Jan 11 Posts: 708 Credit: 544,067,692 RAC: 125,261 |
Unhide your PC's from Milkyway Project preferences https://milkyway.cs.rpi.edu/milkyway/prefs.php?subset=project Should MilkyWay@home show your computers on its web site? |
Send message Joined: 23 Jan 21 Posts: 10 Credit: 277,217 RAC: 0 |
Thanks, I made PCs visible. Still having the stalling issue. Tried setting Cpu usage to 50%, and at 25%, and Cpu time at the default 60% and 30%. Also upped memory. I'm mostly convinced problem is with the nbody task itself somehow as its happening on both computers. Also, since installing Boinc, all other tasks run fine but the multi cpus from milkyway have stalled from the first. |
Send message Joined: 24 Jan 11 Posts: 708 Credit: 544,067,692 RAC: 125,261 |
You are letting the application use all of your processors 12 threads. This is generally not advisable as you don't leave any threads for the host to do normal housekeeping of background tasks. That will have to share time-slices with the N-body application. I would advise restricting the app to only use 50% to 80% of the available threads. You cannot achieve this by changing your general compute preferences. You need to run an app_config.xml file to restrict the MT app to a lesser amount of cpu threads. The BOINC client configuration docs provide examples on how to do this. https://boinc.berkeley.edu/wiki/Client_configuration#Application_configuration <app_config> [<app> <name>Application_Name</name> <max_concurrent>1</max_concurrent> [<report_results_immediately/>] [<fraction_done_exact/>] <gpu_versions> <gpu_usage>.5</gpu_usage> <cpu_usage>.4</cpu_usage> </gpu_versions> </app>] ... [<app_version> <app_name>Application_Name</app_name> [<plan_class>mt</plan_class>] [<avg_ncpus>x</avg_ncpus>] [<ngpus>x</ngpus>] [<cmdline>--nthreads 7</cmdline>] </app_version>] ... [<project_max_concurrent>N</project_max_concurrent>] [<report_results_immediately/>] </app_config> The --nthreads parameter one is the one you need to configure to limit how many threads the N-body tasks are allowed to use. Also you can restrict the host to only running one N-body MT task at a time to give the PC access to the leftover threads for general housekeeping. This is what I would try at first as an example. <app_config> <app> <name>milkyway_nbody</name> <plan_class>mt</plan_class> <avg_ncpus>6</avg_ncpus> <cmdline>--nthreads 6</cmdline> <max_concurrent>1</max_concurrent> </app> <app> <name>milkyway</name> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>1.0</cpu_usage> </gpu_versions> </app> </app_config> |
Send message Joined: 23 Jan 21 Posts: 10 Credit: 277,217 RAC: 0 |
Thanks, I'll work on Implementing that and report how it works. Much appreciate the help, would not have realized this needed to be done! |
©2024 Astroinformatics Group