Welcome to MilkyWay@home

Posts by d_a_dempsey

1) Message boards : Number crunching : Stalled computation (Message 71949)
Posted 14 Mar 2022 by Profile d_a_dempsey
Post:
Are you guys seeing stalls on both n body AND separation?


I'm only getting them on N-Body, and a lot of them. No issues with Separation.

David
2) Message boards : Number crunching : Stalled computation (Message 71936)
Posted 12 Mar 2022 by Profile d_a_dempsey
Post:
Hello,

Of late on my laptop active work units, after chugging along fine with decreasing time remaining, go apparently idle with the eight CPU usages down to a few percent for each core with increasing elapsed time and time to complete. I have to exit and restart BOINC to get usage back to normal.

More recently I get the following message in the event log.
"3/8/2022 12:52:02 PM | Milkyway@Home | Task de_nbody_08_31_2021_v176_40k__data__11_1645561443_646216_1 postponed for 600 seconds: Waiting to acquire slot directory lock. Another instance may be running."

According to Task Manager I see only on instance.

I'd appreciate any thoughts on what I'm doing wrong.

Thank you,

Ed Machak


Ed,

What event log options do you have selected? I'm not seeing any messages when mine stall.
3) Message boards : Number crunching : Stalled computation (Message 71916)
Posted 10 Mar 2022 by Profile d_a_dempsey
Post:
You're not alone. I'm having this happen on 4 separate computers, but each time it is an N-Body simulation. Number of cores/age of chip doesn't seem to matter.
Very frustrating to find that your crunching has been held hostage by these units for hours. I don't babysit my computers and shouldn't have to.

David
4) Message boards : Number crunching : Work units not completing or are stalled (Message 70386)
Posted 18 Jan 2021 by Profile d_a_dempsey
Post:
Glad you figured it out. Just wanted to comment, you may have future or further issues with that many exclusive apps defined that intrude or prevent crunching.

It does look like a lot, but they're different subprograms of a single application, Dungeons & Dragons Online, and you you don't get 2 running at the same time, and the overlap with crunching time is typically just 10pm-11pm. Midnight if I'm having a good game and no work in the morning. :)
Game doesn't need a lot of CPU, but it certainly uses the 1080 TI. All crunching and no play would be--boring.
5) Message boards : Number crunching : Work units not completing or are stalled (Message 70384)
Posted 17 Jan 2021 by Profile d_a_dempsey
Post:
Installing nvidia driver 460.79 has fixed the problem for me, tried 461.09 again to double check and there is definitely something with the new driver that is breaking milkyway, or at least the current work units.

To find older nvidia drivers, just fill in your gpu details here and grab 460.79 to install. Shouldn't need to do a clean install either, express worked fine for me.

(edit) I know system restore should have reverted your driver to the previous version, but I still think it is worth trying to install the older driver from nvidia. I'm 100% sure it was the new driver that caused wu's to hang/crunch forever without completing.


That was it! Sometime between 1/7 and 1/12 I must have updated the Nvidia driver, possible the evening before it patched so that the issue coincided with the patch/reboot.

Thank you!! I am happily crunching through a backlog of MW@H WUs on this computer.
6) Message boards : Number crunching : Work units not completing or are stalled (Message 70378)
Posted 16 Jan 2021 by Profile d_a_dempsey
Post:
Here's some of the output from the rr_simulation option while both WUs appear to be stalled.

Task ending 169_1 is on my GTX 980 and task ending 344_1 is running on my GTX 1080 TI
1/16/2021 1:02:08 PM |  | Re-reading cc_config.xml
1/16/2021 1:02:08 PM |  | Config: don't use GPUs while dndclient.exe is running
1/16/2021 1:02:08 PM |  | Config: don't use GPUs while dndclient_awesomium.exe is running
1/16/2021 1:02:08 PM |  | Config: don't use GPUs while dndclient64.exe is running
1/16/2021 1:02:08 PM |  | Config: don't use GPUs while DNDLauncher.exe is running
1/16/2021 1:02:08 PM |  | Config: don't use GPUs while turbineclientlauncher.exe is running
1/16/2021 1:02:08 PM |  | Config: use all coprocessors
1/16/2021 1:02:08 PM |  | log flags: file_xfer, sched_ops, task
1/16/2021 1:02:08 PM | Milkyway@Home | Found app_config.xml
1/16/2021 1:26:36 PM |  | Re-reading cc_config.xml
1/16/2021 1:26:36 PM |  | Config: don't use GPUs while dndclient.exe is running
1/16/2021 1:26:36 PM |  | Config: don't use GPUs while dndclient_awesomium.exe is running
1/16/2021 1:26:36 PM |  | Config: don't use GPUs while dndclient64.exe is running
1/16/2021 1:26:36 PM |  | Config: don't use GPUs while DNDLauncher.exe is running
1/16/2021 1:26:36 PM |  | Config: don't use GPUs while turbineclientlauncher.exe is running
1/16/2021 1:26:36 PM |  | Config: use all coprocessors
1/16/2021 1:26:36 PM |  | log flags: file_xfer, sched_ops, task, rr_simulation
1/16/2021 1:26:36 PM | Milkyway@Home | Found app_config.xml
1/16/2021 1:26:37 PM |  | [rr_sim] doing sim: CPU sched
1/16/2021 1:26:37 PM |  | [rr_sim] start: work_buf min 8640 additional 86400 total 95040 on_frac 0.961 active_frac 0.565
1/16/2021 1:26:37 PM | Milkyway@Home | [rr_sim] 0.02: de_modfit_84_bundle4_4s_south4s_bgset_4_1603804501_61530169_1 finishes (1.00 CPU + 1.00 NVIDIA GPU) (3.25G/135.60G)
1/16/2021 1:26:37 PM | Milkyway@Home | [rr_sim] 74.99: de_modfit_85_bundle4_4s_south4s_bgset_4_1603804501_61520344_1 finishes (1.00 CPU + 1.00 NVIDIA GPU) (10168.93G/135.60G)
1/16/2021 1:26:37 PM | Milkyway@Home | [rr_sim] 310.73: de_modfit_81_bundle4_4s_south4s_bgset_4_1603804501_61576619_0 finishes (1.00 CPU + 1.00 NVIDIA GPU) (42133.60G/135.60G)
1/16/2021 1:26:37 PM | Milkyway@Home | [rr_sim] 385.71: de_modfit_83_bundle4_4s_south4s_bgset_4_1603804501_61559680_1 finishes (1.00 CPU + 1.00 NVIDIA GPU) (42134.20G/135.60G)
<snip>
1/16/2021 1:26:37 PM |  | [rr_sim] doing sim: work fetch
1/16/2021 1:26:37 PM |  | [rr_sim] already did at this time
1/16/2021 1:27:37 PM |  | [rr_sim] doing sim: CPU sched
1/16/2021 1:27:37 PM |  | [rr_sim] start: work_buf min 8640 additional 86400 total 95040 on_frac 0.961 active_frac 0.565
1/16/2021 1:27:37 PM | Milkyway@Home | [rr_sim] 0.02: de_modfit_84_bundle4_4s_south4s_bgset_4_1603804501_61530169_1 finishes (1.00 CPU + 1.00 NVIDIA GPU) (2.30G/135.61G)
1/16/2021 1:27:37 PM | Milkyway@Home | [rr_sim] 77.15: de_modfit_85_bundle4_4s_south4s_bgset_4_1603804501_61520344_1 finishes (1.00 CPU + 1.00 NVIDIA GPU) (10462.83G/135.61G)
1/16/2021 1:27:37 PM | Milkyway@Home | [rr_sim] 310.71: de_modfit_81_bundle4_4s_south4s_bgset_4_1603804501_61576619_0 finishes (1.00 CPU + 1.00 NVIDIA GPU) (42133.60G/135.61G)
1/16/2021 1:27:37 PM | Milkyway@Home | [rr_sim] 387.84: de_modfit_83_bundle4_4s_south4s_bgset_4_1603804501_61559680_1 finishes (1.00 CPU + 1.00 NVIDIA GPU) (42134.20G/135.61G)
<snip>
1/16/2021 1:27:37 PM |  | [rr_sim] doing sim: work fetch
1/16/2021 1:27:37 PM |  | [rr_sim] already did at this time
1/16/2021 1:28:38 PM |  | [rr_sim] doing sim: CPU sched
1/16/2021 1:28:38 PM |  | [rr_sim] start: work_buf min 8640 additional 86400 total 95040 on_frac 0.961 active_frac 0.565
1/16/2021 1:28:38 PM | Milkyway@Home | [rr_sim] 0.01: de_modfit_84_bundle4_4s_south4s_bgset_4_1603804501_61530169_1 finishes (1.00 CPU + 1.00 NVIDIA GPU) (1.62G/135.62G)
1/16/2021 1:28:38 PM | Milkyway@Home | [rr_sim] 79.32: de_modfit_85_bundle4_4s_south4s_bgset_4_1603804501_61520344_1 finishes (1.00 CPU + 1.00 NVIDIA GPU) (10757.05G/135.62G)
1/16/2021 1:28:38 PM | Milkyway@Home | [rr_sim] 310.68: de_modfit_81_bundle4_4s_south4s_bgset_4_1603804501_61576619_0 finishes (1.00 CPU + 1.00 NVIDIA GPU) (42133.60G/135.62G)
1/16/2021 1:28:38 PM | Milkyway@Home | [rr_sim] 389.99: de_modfit_83_bundle4_4s_south4s_bgset_4_1603804501_61559680_1 finishes (1.00 CPU + 1.00 NVIDIA GPU) (42134.20G/135.62G)
<snip>
1/16/2021 1:28:38 PM |  | [rr_sim] doing sim: work fetch
1/16/2021 1:28:38 PM |  | [rr_sim] already did at this time
1/16/2021 1:28:53 PM |  | Re-reading cc_config.xml
1/16/2021 1:28:53 PM |  | Config: don't use GPUs while dndclient.exe is running
1/16/2021 1:28:53 PM |  | Config: don't use GPUs while dndclient_awesomium.exe is running
1/16/2021 1:28:53 PM |  | Config: don't use GPUs while dndclient64.exe is running
1/16/2021 1:28:53 PM |  | Config: don't use GPUs while DNDLauncher.exe is running
1/16/2021 1:28:53 PM |  | Config: don't use GPUs while turbineclientlauncher.exe is running
1/16/2021 1:28:53 PM |  | Config: use all coprocessors
1/16/2021 1:28:53 PM |  | log flags: file_xfer, sched_ops, task
1/16/2021 1:28:53 PM | Milkyway@Home | Found app_config.xml
7) Message boards : Number crunching : Work units not completing or are stalled (Message 70377)
Posted 16 Jan 2021 by Profile d_a_dempsey
Post:
While the problem is happening, change your logging preferences to set rr_simulation through at least one cycle of task completion/reporting/request for work and see what it says. BOINC will tell you why it won't start new work in the output.

Turn it off after you get the information as it is especially wordy and the log entries get out of hand.


I will try that next.

Here's what's been tried so far.


  1. Reinstall video driver and reboot
  2. Update Use at Most X percentage of CPUs from 80 to 84 to avoid 9.6 CPUs when 10 desired
  3. Create app_config.xml for MW@H, Specify full CPU instead of project default of 0.9xx
  4. Create app_config.xml for WCG to run no more than 7 tasks
  5. Restored to system savepoint before patching on 1/12
  6. Reinstall video driver and reboot



At no point have I been able to process more than 4 packets (2 each per GPU) before problem shows up. Typically, 0 or 2 before problem.

8) Message boards : Number crunching : Work units not completing or are stalled (Message 70372)
Posted 15 Jan 2021 by Profile d_a_dempsey
Post:
Yes, I am aware I was just getting better information to BOINC for scheduling.

Current app_config.xml files.
<app_config>
    <app>
      <name>milkyway</name>
      <max_concurrent>2</max_concurrent>
      <gpu_versions>
          <gpu_usage>1</gpu_usage>
          <cpu_usage>1</cpu_usage>
      </gpu_versions>
    </app>
</app_config>


For WCG:
<app_config>
<project_max_concurrent>7</project_max_concurrent>
</app_config>


With use at most 84% of CPUs, and the above settings I have 9 tasks, 2 MW (1 CPU + 1 NVIDIA GPU) and 7 WCG packets against 10 of 12 total threads. Two threads reserved for OS and me. The problem still occurs as I have described. After stop/suspend, reboot, it will complete the packets it was working on, and stall on next set. Sometimes it makes it through 2 additional packets.

CPU starvation is an interesting thought, but not sure why it would show up 6 days into project where it was running side-by-side with WCG. I suspect it's related to the patch and forced reboot, but don't have the skill to track it down as to whether corrupted config file or other weirdness.
9) Message boards : Number crunching : Work units not completing or are stalled (Message 70365)
Posted 15 Jan 2021 by Profile d_a_dempsey
Post:
Setting it 84% helped a little. One of the MW@H tasks started working better, and a 9th WCG task. Clearly, BOINC is having trouble doing math with fractional CPU usages, e.g. 0.985.

Would it be better to add an app_config.xml for MW@H and set <gpu_versions><cpu_usage>1</cpu_usage><gpu_versions> so that BIONC can do allot resources better, or do an app_config.xml for WCG and try to limit it to 8 cpus?
Note: I only put relevant XML pieces in statement, I know there's more to it. :)
10) Message boards : Number crunching : Work units not completing or are stalled (Message 70364)
Posted 15 Jan 2021 by Profile d_a_dempsey
Post:
I don't see anything wrong with most of your tasks. Just a few outliers that appear to be starved of enough cpu support and thus have double the runtimes and cputimes.

Why not run at 100% cpu usage?

Save a few threads for machine housekeeping and keep them free of BOINC loading.

Your use at most 80% of your cpu threads is fine.

I suspect your WCG cpu tasks are stealing too much support from your gpu tasks occasionally.


I'm not sure that the WCG cpu tasks are stealing too much from MW@H.

  1. Work units are similar to GPUgrid in that they need 0.xx of a CPU and 1 GPU; GPUgrid has not had this problem
  2. I did not have this problem for the first week that I ran MW@H
  3. 80% of 12 threads 9.6, and WCG is only running 8 tasks


I'll try upping it it to 84% so that it rounds to slightly more than 10.
I run 85% as I like to use my machine for extended periods, and need more than 2 threads.

11) Message boards : Number crunching : Work units not completing or are stalled (Message 70362)
Posted 15 Jan 2021 by Profile d_a_dempsey
Post:
Should be available soon, I adjusted preferences. Didn't know you could do that.

I have 2 computers, an ancient HP with an NNVIDIA GTX 660. That one's not having problems (of course!).
The one having problems is a Dell Alienware Area 51 R2, i7-582K @3.3GHz with an NVIDiA GE 980 and an NVIDIA GTX 1080 TI.
Computing preferences are Use at most 80% of CPUs, Use at most 85% of CPU time.

I have not adjust config files to run more than one task per GPU. I'm very new to this project but a 600M credit cruncher on GPUgrid. They're empty, so I came here and everything was great until yesterday. CPU-wise, I'm crunching for WCG, too, but my problems are specifically with the GPU WUs.

Hope this helps,

David
12) Message boards : Number crunching : Work units not completing or are stalled (Message 70354)
Posted 14 Jan 2021 by Profile d_a_dempsey
Post:


I would remove and/or reload the gpu drivers as MS has a VERY bad habit of replacing the Nvidia ones with their own ones that seem similar but don't work well or at all for crunching


I reinstalled the Nvidia driver, rebooted PC. The two stalled tasks completed almost immediately, and promptly stalled at 50% on the next 2 it started. Just sits there with Elapsed Time and Time Remaining merrily incrementing away. :(
13) Message boards : Number crunching : Work units not completing or are stalled (Message 70348)
Posted 13 Jan 2021 by Profile d_a_dempsey
Post:
After Windows rebooted last night from an update, my MW tasks are either a) reaching 100% and not complete, continuing to accrue time, or b) sit at 25%, 50% accruing time and increasing the estimate time to finish at the same rate.

Before this my 2 GPUs were completing WUs in 2 minutes and 4 minutes. Reboot PC and power off/power on seems to work for WU and then goes back to problem above.

Nvidia driver current: 461.09 01/07/2021
BOINC Client current: 7.6.11 (x64)

CPU-based WU for WCG working fine.

Any thoughts or suggestions?

David
14) Message boards : News : New Milkyway Badges Online (Message 70336)
Posted 11 Jan 2021 by Profile d_a_dempsey
Post:
Thank you for this.

David




©2024 Astroinformatics Group