1)
Message boards :
Number crunching :
Stalled computation
(Message 71949)
Posted 14 Mar 2022 by d_a_dempsey Post: Are you guys seeing stalls on both n body AND separation? I'm only getting them on N-Body, and a lot of them. No issues with Separation. David |
2)
Message boards :
Number crunching :
Stalled computation
(Message 71936)
Posted 12 Mar 2022 by d_a_dempsey Post: Hello, Ed, What event log options do you have selected? I'm not seeing any messages when mine stall. |
3)
Message boards :
Number crunching :
Stalled computation
(Message 71916)
Posted 10 Mar 2022 by d_a_dempsey Post: You're not alone. I'm having this happen on 4 separate computers, but each time it is an N-Body simulation. Number of cores/age of chip doesn't seem to matter. Very frustrating to find that your crunching has been held hostage by these units for hours. I don't babysit my computers and shouldn't have to. David |
4)
Message boards :
Number crunching :
Work units not completing or are stalled
(Message 70386)
Posted 18 Jan 2021 by d_a_dempsey Post: Glad you figured it out. Just wanted to comment, you may have future or further issues with that many exclusive apps defined that intrude or prevent crunching. It does look like a lot, but they're different subprograms of a single application, Dungeons & Dragons Online, and you you don't get 2 running at the same time, and the overlap with crunching time is typically just 10pm-11pm. Midnight if I'm having a good game and no work in the morning. :) Game doesn't need a lot of CPU, but it certainly uses the 1080 TI. All crunching and no play would be--boring. |
5)
Message boards :
Number crunching :
Work units not completing or are stalled
(Message 70384)
Posted 17 Jan 2021 by d_a_dempsey Post: Installing nvidia driver 460.79 has fixed the problem for me, tried 461.09 again to double check and there is definitely something with the new driver that is breaking milkyway, or at least the current work units. That was it! Sometime between 1/7 and 1/12 I must have updated the Nvidia driver, possible the evening before it patched so that the issue coincided with the patch/reboot. Thank you!! I am happily crunching through a backlog of MW@H WUs on this computer. |
6)
Message boards :
Number crunching :
Work units not completing or are stalled
(Message 70378)
Posted 16 Jan 2021 by d_a_dempsey Post: Here's some of the output from the rr_simulation option while both WUs appear to be stalled. Task ending 169_1 is on my GTX 980 and task ending 344_1 is running on my GTX 1080 TI 1/16/2021 1:02:08 PM | | Re-reading cc_config.xml 1/16/2021 1:02:08 PM | | Config: don't use GPUs while dndclient.exe is running 1/16/2021 1:02:08 PM | | Config: don't use GPUs while dndclient_awesomium.exe is running 1/16/2021 1:02:08 PM | | Config: don't use GPUs while dndclient64.exe is running 1/16/2021 1:02:08 PM | | Config: don't use GPUs while DNDLauncher.exe is running 1/16/2021 1:02:08 PM | | Config: don't use GPUs while turbineclientlauncher.exe is running 1/16/2021 1:02:08 PM | | Config: use all coprocessors 1/16/2021 1:02:08 PM | | log flags: file_xfer, sched_ops, task 1/16/2021 1:02:08 PM | Milkyway@Home | Found app_config.xml 1/16/2021 1:26:36 PM | | Re-reading cc_config.xml 1/16/2021 1:26:36 PM | | Config: don't use GPUs while dndclient.exe is running 1/16/2021 1:26:36 PM | | Config: don't use GPUs while dndclient_awesomium.exe is running 1/16/2021 1:26:36 PM | | Config: don't use GPUs while dndclient64.exe is running 1/16/2021 1:26:36 PM | | Config: don't use GPUs while DNDLauncher.exe is running 1/16/2021 1:26:36 PM | | Config: don't use GPUs while turbineclientlauncher.exe is running 1/16/2021 1:26:36 PM | | Config: use all coprocessors 1/16/2021 1:26:36 PM | | log flags: file_xfer, sched_ops, task, rr_simulation 1/16/2021 1:26:36 PM | Milkyway@Home | Found app_config.xml 1/16/2021 1:26:37 PM | | [rr_sim] doing sim: CPU sched 1/16/2021 1:26:37 PM | | [rr_sim] start: work_buf min 8640 additional 86400 total 95040 on_frac 0.961 active_frac 0.565 1/16/2021 1:26:37 PM | Milkyway@Home | [rr_sim] 0.02: de_modfit_84_bundle4_4s_south4s_bgset_4_1603804501_61530169_1 finishes (1.00 CPU + 1.00 NVIDIA GPU) (3.25G/135.60G) 1/16/2021 1:26:37 PM | Milkyway@Home | [rr_sim] 74.99: de_modfit_85_bundle4_4s_south4s_bgset_4_1603804501_61520344_1 finishes (1.00 CPU + 1.00 NVIDIA GPU) (10168.93G/135.60G) 1/16/2021 1:26:37 PM | Milkyway@Home | [rr_sim] 310.73: de_modfit_81_bundle4_4s_south4s_bgset_4_1603804501_61576619_0 finishes (1.00 CPU + 1.00 NVIDIA GPU) (42133.60G/135.60G) 1/16/2021 1:26:37 PM | Milkyway@Home | [rr_sim] 385.71: de_modfit_83_bundle4_4s_south4s_bgset_4_1603804501_61559680_1 finishes (1.00 CPU + 1.00 NVIDIA GPU) (42134.20G/135.60G) <snip> 1/16/2021 1:26:37 PM | | [rr_sim] doing sim: work fetch 1/16/2021 1:26:37 PM | | [rr_sim] already did at this time 1/16/2021 1:27:37 PM | | [rr_sim] doing sim: CPU sched 1/16/2021 1:27:37 PM | | [rr_sim] start: work_buf min 8640 additional 86400 total 95040 on_frac 0.961 active_frac 0.565 1/16/2021 1:27:37 PM | Milkyway@Home | [rr_sim] 0.02: de_modfit_84_bundle4_4s_south4s_bgset_4_1603804501_61530169_1 finishes (1.00 CPU + 1.00 NVIDIA GPU) (2.30G/135.61G) 1/16/2021 1:27:37 PM | Milkyway@Home | [rr_sim] 77.15: de_modfit_85_bundle4_4s_south4s_bgset_4_1603804501_61520344_1 finishes (1.00 CPU + 1.00 NVIDIA GPU) (10462.83G/135.61G) 1/16/2021 1:27:37 PM | Milkyway@Home | [rr_sim] 310.71: de_modfit_81_bundle4_4s_south4s_bgset_4_1603804501_61576619_0 finishes (1.00 CPU + 1.00 NVIDIA GPU) (42133.60G/135.61G) 1/16/2021 1:27:37 PM | Milkyway@Home | [rr_sim] 387.84: de_modfit_83_bundle4_4s_south4s_bgset_4_1603804501_61559680_1 finishes (1.00 CPU + 1.00 NVIDIA GPU) (42134.20G/135.61G) <snip> 1/16/2021 1:27:37 PM | | [rr_sim] doing sim: work fetch 1/16/2021 1:27:37 PM | | [rr_sim] already did at this time 1/16/2021 1:28:38 PM | | [rr_sim] doing sim: CPU sched 1/16/2021 1:28:38 PM | | [rr_sim] start: work_buf min 8640 additional 86400 total 95040 on_frac 0.961 active_frac 0.565 1/16/2021 1:28:38 PM | Milkyway@Home | [rr_sim] 0.01: de_modfit_84_bundle4_4s_south4s_bgset_4_1603804501_61530169_1 finishes (1.00 CPU + 1.00 NVIDIA GPU) (1.62G/135.62G) 1/16/2021 1:28:38 PM | Milkyway@Home | [rr_sim] 79.32: de_modfit_85_bundle4_4s_south4s_bgset_4_1603804501_61520344_1 finishes (1.00 CPU + 1.00 NVIDIA GPU) (10757.05G/135.62G) 1/16/2021 1:28:38 PM | Milkyway@Home | [rr_sim] 310.68: de_modfit_81_bundle4_4s_south4s_bgset_4_1603804501_61576619_0 finishes (1.00 CPU + 1.00 NVIDIA GPU) (42133.60G/135.62G) 1/16/2021 1:28:38 PM | Milkyway@Home | [rr_sim] 389.99: de_modfit_83_bundle4_4s_south4s_bgset_4_1603804501_61559680_1 finishes (1.00 CPU + 1.00 NVIDIA GPU) (42134.20G/135.62G) <snip> 1/16/2021 1:28:38 PM | | [rr_sim] doing sim: work fetch 1/16/2021 1:28:38 PM | | [rr_sim] already did at this time 1/16/2021 1:28:53 PM | | Re-reading cc_config.xml 1/16/2021 1:28:53 PM | | Config: don't use GPUs while dndclient.exe is running 1/16/2021 1:28:53 PM | | Config: don't use GPUs while dndclient_awesomium.exe is running 1/16/2021 1:28:53 PM | | Config: don't use GPUs while dndclient64.exe is running 1/16/2021 1:28:53 PM | | Config: don't use GPUs while DNDLauncher.exe is running 1/16/2021 1:28:53 PM | | Config: don't use GPUs while turbineclientlauncher.exe is running 1/16/2021 1:28:53 PM | | Config: use all coprocessors 1/16/2021 1:28:53 PM | | log flags: file_xfer, sched_ops, task 1/16/2021 1:28:53 PM | Milkyway@Home | Found app_config.xml |
7)
Message boards :
Number crunching :
Work units not completing or are stalled
(Message 70377)
Posted 16 Jan 2021 by d_a_dempsey Post: While the problem is happening, change your logging preferences to set rr_simulation through at least one cycle of task completion/reporting/request for work and see what it says. BOINC will tell you why it won't start new work in the output. I will try that next. Here's what's been tried so far.
|
8)
Message boards :
Number crunching :
Work units not completing or are stalled
(Message 70372)
Posted 15 Jan 2021 by d_a_dempsey Post: Yes, I am aware I was just getting better information to BOINC for scheduling. Current app_config.xml files. <app_config> <app> <name>milkyway</name> <max_concurrent>2</max_concurrent> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>1</cpu_usage> </gpu_versions> </app> </app_config> For WCG: <app_config> <project_max_concurrent>7</project_max_concurrent> </app_config> With use at most 84% of CPUs, and the above settings I have 9 tasks, 2 MW (1 CPU + 1 NVIDIA GPU) and 7 WCG packets against 10 of 12 total threads. Two threads reserved for OS and me. The problem still occurs as I have described. After stop/suspend, reboot, it will complete the packets it was working on, and stall on next set. Sometimes it makes it through 2 additional packets. CPU starvation is an interesting thought, but not sure why it would show up 6 days into project where it was running side-by-side with WCG. I suspect it's related to the patch and forced reboot, but don't have the skill to track it down as to whether corrupted config file or other weirdness. |
9)
Message boards :
Number crunching :
Work units not completing or are stalled
(Message 70365)
Posted 15 Jan 2021 by d_a_dempsey Post: Setting it 84% helped a little. One of the MW@H tasks started working better, and a 9th WCG task. Clearly, BOINC is having trouble doing math with fractional CPU usages, e.g. 0.985. Would it be better to add an app_config.xml for MW@H and set <gpu_versions><cpu_usage>1</cpu_usage><gpu_versions> so that BIONC can do allot resources better, or do an app_config.xml for WCG and try to limit it to 8 cpus? Note: I only put relevant XML pieces in statement, I know there's more to it. :) |
10)
Message boards :
Number crunching :
Work units not completing or are stalled
(Message 70364)
Posted 15 Jan 2021 by d_a_dempsey Post: I don't see anything wrong with most of your tasks. Just a few outliers that appear to be starved of enough cpu support and thus have double the runtimes and cputimes. I'm not sure that the WCG cpu tasks are stealing too much from MW@H.
|
11)
Message boards :
Number crunching :
Work units not completing or are stalled
(Message 70362)
Posted 15 Jan 2021 by d_a_dempsey Post: Should be available soon, I adjusted preferences. Didn't know you could do that. I have 2 computers, an ancient HP with an NNVIDIA GTX 660. That one's not having problems (of course!). The one having problems is a Dell Alienware Area 51 R2, i7-582K @3.3GHz with an NVIDiA GE 980 and an NVIDIA GTX 1080 TI. Computing preferences are Use at most 80% of CPUs, Use at most 85% of CPU time. I have not adjust config files to run more than one task per GPU. I'm very new to this project but a 600M credit cruncher on GPUgrid. They're empty, so I came here and everything was great until yesterday. CPU-wise, I'm crunching for WCG, too, but my problems are specifically with the GPU WUs. Hope this helps, David |
12)
Message boards :
Number crunching :
Work units not completing or are stalled
(Message 70354)
Posted 14 Jan 2021 by d_a_dempsey Post:
I reinstalled the Nvidia driver, rebooted PC. The two stalled tasks completed almost immediately, and promptly stalled at 50% on the next 2 it started. Just sits there with Elapsed Time and Time Remaining merrily incrementing away. :( |
13)
Message boards :
Number crunching :
Work units not completing or are stalled
(Message 70348)
Posted 13 Jan 2021 by d_a_dempsey Post: After Windows rebooted last night from an update, my MW tasks are either a) reaching 100% and not complete, continuing to accrue time, or b) sit at 25%, 50% accruing time and increasing the estimate time to finish at the same rate. Before this my 2 GPUs were completing WUs in 2 minutes and 4 minutes. Reboot PC and power off/power on seems to work for WU and then goes back to problem above. Nvidia driver current: 461.09 01/07/2021 BOINC Client current: 7.6.11 (x64) CPU-based WU for WCG working fine. Any thoughts or suggestions? David |
14)
Message boards :
News :
New Milkyway Badges Online
(Message 70336)
Posted 11 Jan 2021 by d_a_dempsey Post: Thank you for this. David |
©2024 Astroinformatics Group