Welcome to MilkyWay@home

Hanging Work Units

Message boards : Number crunching : Hanging Work Units
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile KWSN imcrazynow
Avatar

Send message
Joined: 22 Nov 08
Posts: 136
Credit: 319,414,799
RAC: 0
Message 25850 - Posted: 17 Jun 2009, 22:53:43 UTC

I've had another seriously hanging work unit. Note the GPU time and the wall clock times. This is for task:
ps_sgr_218F5_3s_wtest_564836_1245261734_0

Work Unit ID:81999922



<core_client_version>6.4.7</core_client_version>
<![CDATA[
<stderr_txt>
Running Milkyway@home ATI GPU application version 0.19f by Gipsel
CPU: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz (4 cores/threads) 3.54596 GHz (289ms)

CAL Runtime: 1.3.145
Found 1 CAL device

Device 0: ATI Radeon HD 4800 (RV770) 1024 MB local RAM (remote 28 MB cached + 1024 MB uncached)
GPU core clock: 750 MHz, memory clock: 900 MHz
800 shader units organized in 10 SIMDs with 16 VLIW units (5-issue), wavefront size 64 threads
supporting double precision

3 WUs already running on GPU 0
No free GPU! Waiting ... 119.516 seconds.
Starting WU on GPU 0

main integral, 320 iterations
predicted runtime per iteration is 191 ms (33.3333 ms are allowed), dividing each iteration in 6 parts
borders of the domains at 0 272 536 800 1072 1336 1600
Calculated about 9.89542e+012 floatingpoint ops on GPU, 1.23583e+008 on FPU. Approximate GPU time 16130 seconds.

probability calculation (stars)
Calculated about 3.05993e+009 floatingpoint ops on FPU.

WU completed.
CPU time: 8.85938 seconds, GPU time: 16130 seconds, wall clock time: 16389.6 seconds, CPU frequency: 3.546 GHz

I'm running the Prime Grid Challenge on 3 cores with 1 core reserved for MW.
Does anybody have any idea what may be going on here. This is the second hanging w/u i've caught this week. It will cause MW to stop processing on the gpu until it completes. Aparently successfully as i'm given credit for it. Not much for over 16K seconds but it does validate and get credit.

4870 GPU
4870 GPU
ID: 25850 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile kashi

Send message
Joined: 30 Dec 07
Posts: 311
Credit: 149,490,184
RAC: 0
Message 25858 - Posted: 18 Jun 2009, 0:40:06 UTC - in response to Message 25850.  

Perhaps the CPU core you allocated to MilkyWay switched to PrimeGrid before the MilkyWay task completed. The MilkyWay status would show as "Waiting to Run" until BOINC calculated the accumulated debt was repaid. Then after about 4.5 hours it switched back and the task completed.

It seems to me if you run with a <avg_ncpus>value of less than 1, then the estimated "To completion" time for Milky Way tasks may be incorrect and thus a scheduling debt could build up. This is just a possibility, others with more experience and knowledge may be able to advise you better.

I now run only one MilkyWay task at a time with <avg_ncpus> value of 1 and have had few problems. I had some trouble with this configuration once after stopping and restarting BOINC so now have a cc_config.xml file with <zero_debts>1</zero_debts> (works with BOINC 6.6.11 and above). Also I download 1.5 days of Einstein work at a time and then set Einstein to No new tasks so that MilkyWay will continue to download new work. This is working for me with BOINC 6.6.31, HD 3850, Xeon W3520 and Einstein as my other project. Your HD 48xx, Q6600 and PrimeGrid may require a different configuration.
ID: 25858 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 2 Jan 08
Posts: 122
Credit: 69,480,206
RAC: 1,380
Message 26007 - Posted: 19 Jun 2009, 12:21:32 UTC

Have also caught a WU that hung on my GPU

WU completed.
CPU time: 7.875 seconds, GPU time: 25267.7 seconds, wall clock time: 25561 seconds,

Mine appears to have been caused by BOINC allocating too many resources to other projects that I run and did not leave enough CPU power to run the GPU, so it did not do anything for about 7 hours.

It had started and said it was running but all that was happening was the time to completion keep ticking over.

As soon as I suspended a couple of WUs Milkyway started and completed all work it had in the queue, including the 7 hour WU (got a mighty 10 cr/h for that one).

Now I have the unfortunate circumstance of the LTD saying Milkyway has to owe other projects and now I can't get any work. I might have to micro manage BOINC for a while, running Einstein, Docking, AQUA, Ralph at the moment.

Conan.
ID: 26007 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Hanging Work Units

©2024 Astroinformatics Group