Welcome to MilkyWay@home

Invalid WUs

Message boards : Number crunching : Invalid WUs
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile SETI.USA Cluster

Send message
Joined: 19 Dec 07
Posts: 4
Credit: 19,943,618
RAC: 0
Message 33655 - Posted: 24 Nov 2009, 17:52:53 UTC

I am getting about a 10% invalid rate on one of my HD 4850s, but cannot figure out why. The stderror_out text looks exactly the same as for a valid WU. Here is what it looks like:
Name	de_s222_3s_best_4p_05r_22_596170_1259083707_0
Workunit	597171
Created	24 Nov 2009 17:28:30 UTC
Sent	24 Nov 2009 17:29:11 UTC
Received	24 Nov 2009 17:35:17 UTC
Server state	Over
Outcome	Success
Client state	Done
Exit status	0 (0x0)
Computer ID	118608
Report deadline	27 Nov 2009 17:29:11 UTC
Run time	52.78125
stderr out	

<core_client_version>6.10.17</core_client_version>
<![CDATA[
<stderr_txt>
Running Milkyway@home ATI GPU application version 0.20b (Win32, x87, CAL 1.4) by Gipsel
instructed by BOINC client to use device 1
CPU: Pentium(R) Dual-Core  CPU      E5200  @ 2.50GHz (2 cores/threads) 2.63307 GHz (409ms)

CAL Runtime: 1.4.255
Found 2 CAL devices

Device 0: ATI Radeon HD4700/4800 (RV740/RV770) 512 MB local RAM (remote 64 MB cached + 256 MB uncached)
GPU core clock: 700 MHz, memory clock: 993 MHz
800 shader units organized in 10 SIMDs with 16 VLIW units (5-issue), wavefront size 64 threads
supporting double precision

Device 1: ATI Radeon HD4700/4800 (RV740/RV770) 512 MB local RAM (remote 64 MB cached + 256 MB uncached)
GPU core clock: 700 MHz, memory clock: 993 MHz
800 shader units organized in 10 SIMDs with 16 VLIW units (5-issue), wavefront size 64 threads
supporting double precision

Starting WU on GPU 1

main integral, 320 iterations
predicted runtime per iteration is 150 ms (33.3333 ms are allowed), dividing each iteration in 5 parts
borders of the domains at 0 320 640 960 1280 1600
Calculated about 8.22242e+012 floatingpoint ops on GPU, 1.23583e+008 on FPU. Approximate GPU time 52.7813 seconds.

probability calculation (stars)
Calculated about 3.34818e+009 floatingpoint ops on FPU.

WU completed.
CPU time: 6.625 seconds,  GPU time: 52.7813 seconds,  wall clock time: 56.508 seconds,  CPU frequency: 2.63308 GHz

</stderr_txt>
]]>

Validate state	Invalid
Claimed credit	0.242683513700349
Granted credit	0
application version	0.20

Does anyone out there see anything I am missing? I can see no reason for it to be invalid.
ID: 33655 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 33661 - Posted: 24 Nov 2009, 19:17:29 UTC - in response to Message 33655.  

Gonna see what's up.
ID: 33661 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 33663 - Posted: 24 Nov 2009, 19:21:48 UTC - in response to Message 33662.  

Going through your results it looks like the ones that have an error are returning NaN or some very weird result (as in an impossible result). There might be some kind of hardware issue.

Is your GPU overclocked?
ID: 33663 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile SETI.USA Cluster

Send message
Joined: 19 Dec 07
Posts: 4
Credit: 19,943,618
RAC: 0
Message 33665 - Posted: 24 Nov 2009, 19:30:14 UTC - in response to Message 33663.  

Yes Travis. They were both mildly overclocked. I have set them both back to factory default. Let's see what happens now. Thanks.
ID: 33665 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile SETI.USA Cluster

Send message
Joined: 19 Dec 07
Posts: 4
Credit: 19,943,618
RAC: 0
Message 33673 - Posted: 24 Nov 2009, 22:20:01 UTC

Well that fixed it I think. No invalids for over 2 hours now. Strange that they were both overclocked equally but only one was having the problem. Oh well, full steam ahead again.
ID: 33673 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile David Glogau*
Avatar

Send message
Joined: 12 Aug 09
Posts: 172
Credit: 645,240,165
RAC: 0
Message 33686 - Posted: 25 Nov 2009, 1:11:01 UTC

OK, Cuda working fine, but ATI showing Computation error for all.

Windows coming up with: astronomy_0.20b_ATI_amd.exe has stopped working

I click on the close program button then it moves to the next one and errors that as well. as below

25/11/2009 1:54:31 p.m. Milkyway@home Computation for task de_s222_3s_best_4p_05r_25_689985_1259092066_0 finished
25/11/2009 1:54:31 p.m. Milkyway@home Output file de_s222_3s_best_4p_05r_25_689985_1259092066_0_0 for task de_s222_3s_best_4p_05r_25_689985_1259092066_0 absent
25/11/2009 1:54:33 p.m. Milkyway@home Computation for task de_s222_3s_best_4p_05r_25_689986_1259092066_0 finished
25/11/2009 1:54:33 p.m. Milkyway@home Output file de_s222_3s_best_4p_05r_25_689986_1259092066_0_0 for task de_s222_3s_best_4p_05r_25_689986_1259092066_0 absent
25/11/2009 1:54:34 p.m. Milkyway@home Computation for task de_s222_3s_best_4p_05r_25_689993_1259092066_0 finished
25/11/2009 1:54:34 p.m. Milkyway@home Output file de_s222_3s_best_4p_05r_25_689993_1259092066_0_0 for task de_s222_3s_best_4p_05r_25_689993_1259092066_0 absent
25/11/2009 1:54:34 p.m. Milkyway@home Computation for task de_s222_3s_best_4p_05r_25_689987_1259092066_0 finished
25/11/2009 1:54:34 p.m. Milkyway@home Output file de_s222_3s_best_4p_05r_25_689987_1259092066_0_0 for task de_s222_3s_best_4p_05r_25_689987_1259092066_0 absent

My start up config:
24/11/2009 10:44:58 p.m. Starting BOINC client version 6.10.17 for windows_x86_64
24/11/2009 10:44:58 p.m. Config: use all coprocessors
24/11/2009 10:44:58 p.m. log flags: file_xfer, sched_ops, task
24/11/2009 10:44:58 p.m. Libraries: libcurl/7.19.4 OpenSSL/0.9.8k zlib/1.2.3
24/11/2009 10:44:58 p.m. Data directory: C:\Users\Public\boinc data
24/11/2009 10:44:58 p.m. Running under account David
24/11/2009 10:44:58 p.m. Processor: 4 AuthenticAMD AMD Phenom(tm) II X4 965 Processor [AMD64 Family 16 Model 4 Stepping 2]
24/11/2009 10:44:58 p.m. Processor: 512.00 KB cache
24/11/2009 10:44:58 p.m. Processor features: fpu tsc pae nx sse sse2 pni
24/11/2009 10:44:58 p.m. OS: Microsoft Windows Vista: Home Premium x64 Edition, Service Pack 2, (06.00.6002.00)
24/11/2009 10:44:58 p.m. Memory: 8.00 GB physical, 16.05 GB virtual
24/11/2009 10:44:58 p.m. Disk: 498.05 GB total, 386.85 GB free
24/11/2009 10:44:58 p.m. Local time is UTC +13 hours
24/11/2009 10:44:58 p.m. ATI GPU 0: ATI Radeon HD 4700/4800 (RV740/RV770) (CAL version 1.4.427, 1024MB, 1000 GFLOPS peak)
24/11/2009 10:44:58 p.m. ATI GPU 1: ATI Radeon HD 4700/4800 (RV740/RV770) (CAL version 1.4.427, 1024MB, 1000 GFLOPS peak)
24/11/2009 10:44:58 p.m. Milkyway@home Found app_info.xml; using anonymous platform

It was running Collatz while the project was down, and still is without problems.

I will try installing 6.10.18 and see if that makes a difference.
ID: 33686 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 33692 - Posted: 25 Nov 2009, 20:23:19 UTC - in response to Message 33673.  

Well that fixed it I think. No invalids for over 2 hours now. Strange that they were both overclocked equally but only one was having the problem. Oh well, full steam ahead again.



Well processor manufacturing isn't an exact science, so some processors overclock better than others :)
ID: 33692 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile kashi

Send message
Joined: 30 Dec 07
Posts: 311
Credit: 149,490,184
RAC: 0
Message 33742 - Posted: 26 Nov 2009, 2:47:15 UTC - in response to Message 33686.  
Last modified: 26 Nov 2009, 2:49:44 UTC

OK, Cuda working fine, but ATI showing Computation error for all.

Windows coming up with: astronomy_0.20b_ATI_amd.exe has stopped working

I click on the close program button then it moves to the next one and errors that as well. as below

24/11/2009 10:44:58 p.m. OS: Microsoft Windows Vista: Home Premium x64 Edition, Service Pack 2, (06.00.6002.00)
24/11/2009 10:44:58 p.m. ATI GPU 1: ATI Radeon HD 4700/4800 (RV740/RV770) (CAL version 1.4.427, 1024MB, 1000 GFLOPS peak)
24/11/2009 10:44:58 p.m. Milkyway@home Found app_info.xml; using anonymous platform


Not sure, but it looks like you have the incorrect MilkyWay application version variant installed. I think you require the 64-bit one for Catalyst drivers 9.2 and above. This application variant contains "astronomy_0.20b_ATI_x64_ati.exe". You will need to change all 3 files to the correct version, not just the application itself.

Details and explanations on which application variant is the correct one for your operating system and Catalyst driver version are contained in the readme file included with the download.

If you are swapping graphics cards between different computers with different Windows operating systems (32-bit and 64-bit) and different Catalyst driver versions installed you have to ensure that you are using the correct MilkyWay application variant to match the operating system and Catalyst driver version that is installed on that box.
ID: 33742 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile JockMacMad TSBT
Avatar

Send message
Joined: 28 Jan 09
Posts: 31
Credit: 85,934,108
RAC: 0
Message 33752 - Posted: 26 Nov 2009, 10:58:16 UTC

Has something changed with the workunits?

A machine that has been running well is now erroring out:-

de_s222_3s_best_4p_05r_21_433001_1259232804_0
Workunit 1245689
Created 26 Nov 2009 10:53:26 UTC
Sent 26 Nov 2009 10:54:10 UTC
Received 26 Nov 2009 10:55:03 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status -1073741515 (0xffffffffc0000135)
Computer ID 87699
Report deadline 29 Nov 2009 10:54:10 UTC
Run time 0
stderr out <core_client_version>6.10.18</core_client_version>
<![CDATA[
<message>
- exit code -1073741515 (0xc0000135)
</message>
]]>
Validate state Invalid
Claimed credit 0
Granted credit 0
application version 0.20

every unit I receive is like this. What is exit status -1073741515 ?
ID: 33752 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile David Glogau*
Avatar

Send message
Joined: 12 Aug 09
Posts: 172
Credit: 645,240,165
RAC: 0
Message 33754 - Posted: 26 Nov 2009, 11:42:33 UTC - in response to Message 33742.  

Thanks.

All running again, seems that when I changed Cal versions it took out the AMD*.dll files as well.

FYI: the 5970 is showing a 25% speed improvement over the 4850. Not really good value, still we need to try.

Cheers.
ID: 33754 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile JockMacMad TSBT
Avatar

Send message
Joined: 28 Jan 09
Posts: 31
Credit: 85,934,108
RAC: 0
Message 33755 - Posted: 26 Nov 2009, 12:40:35 UTC
Last modified: 26 Nov 2009, 12:41:18 UTC

Don't tell me that I have two in the post.

In fact that has prompted me to go write down the questions about what would be the best 5970 app_info setup in the Optimized App thread. I was wondering what coproc value I need to utilize all those shaders. i.e. if it's 25% speed increase per unit and I can run double the units concurrently then thats aloooot more than 25% performance boost. Heres hoping :) Otherwise I will sell the 5970's and keep the 4850x2's running if 0.5 is still the best answer :)
ID: 33755 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile kashi

Send message
Joined: 30 Dec 07
Posts: 311
Credit: 149,490,184
RAC: 0
Message 33759 - Posted: 26 Nov 2009, 14:51:41 UTC - in response to Message 33754.  

Thanks.

All running again, seems that when I changed Cal versions it took out the AMD*.dll files as well.

FYI: the 5970 is showing a 25% speed improvement over the 4850. Not really good value, still we need to try.

Cheers.


Your 5970s seem to be throttling on the second core. GPU 0 and 2 are taking about 30 seconds which is about right for 1600 shaders at 725 MHz, but GPU 1 and 3 are taking about 50 seconds which is slower than it should be.

I would be experimenting with trying to reduce the load by only running one concurrent task, and/or increasing the w value from your current setting of 1.1. Perhaps you could also experiment with the f parameter to try and stop the throttling. Even if it slows down your fast GPU 0 and 2 cores a bit your total throughput should be greater and there will be less strain on your cards if you can reduce the load and stop the throttling on GPU core 1 and 3.

Have you reduced the memory speed for Milkyway? Currently with the hot weather I am running my 4890 on MilkyWay with w1.2 to keep the temperature under 90°C. I always run MilkyWay with the memory set at 500 MHz. If I am running Collatz with memory at 1,000 MHz and switch back to MilkyWay but forget to reduce the memory to 500 Mhz then the temperature quickly rises to over 100°C which is too hot for my liking.
ID: 33759 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile silent Float

Send message
Joined: 3 Jul 09
Posts: 1
Credit: 22,557,885
RAC: 0
Message 34034 - Posted: 30 Nov 2009, 22:33:11 UTC

New WU, in size about 4x what they were before.

5 WU executed and all "Completed, marked as invalid".

System: WinXP, GTX285, driver 190.62, Boinc 6.6.36, app version 0.21 (cuda23)

No problem until these new WU's.
ID: 34034 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 12 Apr 08
Posts: 621
Credit: 161,934,067
RAC: 0
Message 34055 - Posted: 1 Dec 2009, 8:35:54 UTC

Not sure what my actual error rate is, but, I have over 100 on my system with GTX 295 cards ... there are almost as many that have validated. I know it is possible it is my hardware, though I rate that low for the simple reason that Collatz and GPU Grid also run on these cards and I don't see this kind of error rate. I also did not see this kind of error rate prior to the extension of run times.

I think you should check some of these that are coming back as invalid to see if there might be an application / validator issue ...
ID: 34055 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile David Glogau*
Avatar

Send message
Joined: 12 Aug 09
Posts: 172
Credit: 645,240,165
RAC: 0
Message 34059 - Posted: 1 Dec 2009, 11:27:54 UTC - in response to Message 34055.  

Not sure what my actual error rate is, but, I have over 100 on my system with GTX 295 cards ... there are almost as many that have validated. I know it is possible it is my hardware, though I rate that low for the simple reason that Collatz and GPU Grid also run on these cards and I don't see this kind of error rate. I also did not see this kind of error rate prior to the extension of run times.

I think you should check some of these that are coming back as invalid to see if there might be an application / validator issue ...


My GTX 295's and GTX 260's are all ok with the new WU's, as is the 4850 and the 5970's. One happy cruncher here.
Oh yeah, stats. BOINC 6.10.18, App 20b, cuda 190.62, all on Vista.
ID: 34059 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 12 Apr 08
Posts: 621
Credit: 161,934,067
RAC: 0
Message 34096 - Posted: 2 Dec 2009, 9:28:54 UTC

Well, my ATI cards have no problems with the tasks it seems. So the question is why do many if not most of them fail on CUDA cards? I have a hard time believing that 4 cards suddenly went bad...
ID: 34096 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Invalid WUs

©2024 Astroinformatics Group