Welcome to MilkyWay@home

Sudden mass of WU's finishing with Computation Error


Advanced search

Message boards : Number crunching : Sudden mass of WU's finishing with Computation Error
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
ProfileXJR-Maniac
Avatar

Send message
Joined: 18 Oct 07
Posts: 35
Credit: 4,684,314
RAC: 0
3 million credit badge10 year member badge
Message 34112 - Posted: 2 Dec 2009, 20:04:06 UTC - in response to Message 34109.  

The only thing i can think of is that now that the WU size increased 'damatically', is that the cuda app is reaching the maximum time that a cuda kernel is allowed to run and therefore crashes.

One could try dviding the 'domain size' to prevent that.
Anyway, that's just a guess without having a look at the cuda code at all.

Might be something totaly different. Only Anthony will know for sure.



But why is it not crashing on all machines? I found some GTX260 GPUs that are crunching fine.

I did another WU with only BOINC running. Disabled antvirus and all other running programs, but still invalid. Next I will try a clean install of BOINC 6.10.18

Here's the content of the WU parameter file:


de_s222_3s_best_1p_01r_41
parameters [20]: 0.849478381012573 7.954547382437530 -5.675839216148361 151.090487200445807 12.349743231700344 4.081479068461466 2.303908188777022 3.287842683092169 -1.305747455655341 169.772682049654520 24.374000713727746 6.032829617788646 2.985739186116882 6.146475181027592 -11.078468810318160 180.234612303055911 17.070581989881532 0.558447994066062 0.000000000000000 1.000000000000000
metadata: i: 94


Here's the content of the result file:

de_s222_3s_best_1p_01r_41
parameters[20]: 0.84947838101257300000, 7.95454738243753030000, -5.67583921614836130000, 151.09048720044581000000, 12.34974323170034400000, 4.08147906846146570000, 2.30390818877702190000, 3.28784268309216900000, -1.30574745565534100000, 169.77268204965452000000, 24.37400071372774600000, 6.03282961778864560000, 2.98573918611688200000, 6.14647518102759170000, -11.07846881031816000000, 180.23461230305591000000, 17.07058198988153200000, 0.55844799406606205000, 0.00000000000000000000, 1.00000000000000000000
metadata: i: 94
fitness: -1.#QNAN000000000000000
stock_win32_gpu: 0.21 double


Maybe this is of any use for you. Does fitness: -1 mean the result is invalid?

ID: 34112 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileXJR-Maniac
Avatar

Send message
Joined: 18 Oct 07
Posts: 35
Credit: 4,684,314
RAC: 0
3 million credit badge10 year member badge
Message 34113 - Posted: 2 Dec 2009, 20:41:55 UTC

OK, did a clean install of BOINC 6.10.18, rebooted machine, attached to Milkyway and crunched 1 WU. Result = invalid

Your turn, I'm runnin' out of ideas.
ID: 34113 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileCrunch3r
Volunteer developer
Avatar

Send message
Joined: 17 Feb 08
Posts: 363
Credit: 258,227,990
RAC: 0
200 million credit badge10 year member badge
Message 34115 - Posted: 2 Dec 2009, 20:44:41 UTC - in response to Message 34113.  
Last modified: 2 Dec 2009, 20:45:11 UTC

OK, did a clean install of BOINC 6.10.18, rebooted machine, attached to Milkyway and crunched 1 WU. Result = invalid

Your turn, I'm runnin' out of ideas.


FWIW, try using the 190.62 drivers. I've seen the 191.07 driver crash my 8800GT on Collatz. Since i reverted back to 190.62 everything was fine again.

Join Support science! Joinc Team BOINC United now!
ID: 34115 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileDavid Glogau*
Avatar

Send message
Joined: 12 Aug 09
Posts: 172
Credit: 645,240,165
RAC: 0
500 million credit badge10 year member badge
Message 34116 - Posted: 2 Dec 2009, 22:22:58 UTC - in response to Message 34109.  

The only thing i can think of is that now that the WU size increased 'damatically', is that the cuda app is reaching the maximum time that a cuda kernel is allowed to run and therefore crashes.

One could try dviding the 'domain size' to prevent that.
Anyway, that's just a guess without having a look at the cuda code at all.

Might be something totaly different. Only Anthony will know for sure.




It's not the size, I have Seti units that take hours to finish on Cuda.
I had a big problem with Win 7, so went back to Vista and it was OK. My GTX 260's are running fine Both on i7 cores running Vista SP2, 190.62, BOINC 6.10.18

I seem to recall I had to uninstall BOINC install 6.10.17. Uninstall 6.10.17, then reinstall 6.10.18, at some point. Just can't remember what I was trying to fix.
ID: 34116 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileCrunch3r
Volunteer developer
Avatar

Send message
Joined: 17 Feb 08
Posts: 363
Credit: 258,227,990
RAC: 0
200 million credit badge10 year member badge
Message 34118 - Posted: 2 Dec 2009, 23:00:48 UTC - in response to Message 34116.  

The only thing i can think of is that now that the WU size increased 'damatically', is that the cuda app is reaching the maximum time that a cuda kernel is allowed to run and therefore crashes.

One could try dviding the 'domain size' to prevent that.
Anyway, that's just a guess without having a look at the cuda code at all.

Might be something totaly different. Only Anthony will know for sure.




It's not the size, I have Seti units that take hours to finish on Cuda.
I had a big problem with Win 7, so went back to Vista and it was OK. My GTX 260's are running fine Both on i7 cores running Vista SP2, 190.62, BOINC 6.10.18

I seem to recall I had to uninstall BOINC install 6.10.17. Uninstall 6.10.17, then reinstall 6.10.18, at some point. Just can't remember what I was trying to fix.


I'm not talking about WU size here... please read my post,especially the bold part.

Join Support science! Joinc Team BOINC United now!
ID: 34118 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileXJR-Maniac
Avatar

Send message
Joined: 18 Oct 07
Posts: 35
Credit: 4,684,314
RAC: 0
3 million credit badge10 year member badge
Message 34119 - Posted: 2 Dec 2009, 23:09:28 UTC - in response to Message 34115.  

FWIW, try using the 190.62 drivers. I've seen the 191.07 driver crash my 8800GT on Collatz. Since i reverted back to 190.62 everything was fine again.


Mope, it's not about drivers.

I have an emergency install of WinXP x86 on another partition so I booted up this one, installed driver 190.62 and BOINC 6.10.18 but still invalid.

And I did NOT install BOINC as a service because I remembered that there was an issue with CUDA errors related to service installation but it was only on Vista and Windows 7. This was due to security and services running in different sessions on those versions. That's why I'm still using XP.

Can someone please suspend network activity for one WU and look into the result file of the finished WU to see what this "fitness" thing is good for? If fitness on a valid WU is set to -1 too then it means nothing, I think.

This was my last shot for now so I'm out of MW for today. It's a real shame, because MW is the project I bought this two GTX260 crunchers for.
ID: 34119 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Starfire

Send message
Joined: 19 Feb 09
Posts: 32
Credit: 32,843,308
RAC: 0
30 million credit badge10 year member badge
Message 34120 - Posted: 2 Dec 2009, 23:32:53 UTC - in response to Message 34119.  

Can someone please suspend network activity for one WU and look into the result file of the finished WU to see what this "fitness" thing is good for? If fitness on a valid WU is set to -1 too then it means nothing, I think.


The following results were calculated using Gipsels ATI app - fitness is displaying a real number here:

de_s222_3s_best_1p_01r_44
parameters [20]: 0.726530534475273 1.000000000000000 -11.309963264750689 164.068245039415330 29.351026000962452 3.773916232705108 1.607818178975123 1.291489947920174 -17.570005048174835 158.561785668195650 18.510896578133526 5.738129192817610 3.323711933363694 11.394196875578736 2.936906508509978 162.759205639023720 0.700000000000000 1.444900940888767 0.408269541430983 14.596364746148591
metadata: i: 77
fitness: -3.200465856044408
Gipsel_GPU_CAL_0.20_x64: 0.20


de_s222_3s_best_1p_01r_44
parameters [20]: 0.968326917012557 1.000000000000000 -10.203487737339231 164.345084867324940 25.920729630235499 5.304565981764101 2.004622029391751 7.127507855253272 -16.032826229817438 154.840719713304740 11.483539537133694 3.739142191627291 3.987127183052219 10.319835656129992 1.401691284083597 162.058382692022750 12.793184654502657 0.833880293137536 0.817510569264601 16.330915579064580
metadata: i: 97
fitness: -3.231483990977578
Gipsel_GPU_CAL_0.20_x64: 0.20


de_s222_3s_best_1p_01r_44
parameters [20]: 1.000000000000000 1.055320290988556 -10.887774134384429 180.667161994741240 32.615322424742459 5.778394578605021 1.360893960164031 4.648841479380868 -15.579053670534684 159.303785370296170 19.866325091491856 3.582786323103195 3.739731946757967 12.959971223927784 -0.576617745668031 156.690938802085500 0.700000000000000 1.326891535596566 1.348254152737073 14.256789232543523
metadata: i: 98
fitness: -3.197918134189094
Gipsel_GPU_CAL_0.20_x64: 0.20

Starfire
ID: 34120 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileDavid Glogau*
Avatar

Send message
Joined: 12 Aug 09
Posts: 172
Credit: 645,240,165
RAC: 0
500 million credit badge10 year member badge
Message 34122 - Posted: 3 Dec 2009, 0:30:31 UTC - in response to Message 34118.  

Sorry, my bad.
ID: 34122 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileBruce
Avatar

Send message
Joined: 28 Apr 08
Posts: 1415
Credit: 2,716,428
RAC: 0
2 million credit badge10 year member badge
Message 34125 - Posted: 3 Dec 2009, 1:06:15 UTC - in response to Message 34122.  

@ David
Have you tried the new nVidia drivers? I'm using 195.62 I think this is the newest driver out.
;-p
ID: 34125 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileDavid Glogau*
Avatar

Send message
Joined: 12 Aug 09
Posts: 172
Credit: 645,240,165
RAC: 0
500 million credit badge10 year member badge
Message 34127 - Posted: 3 Dec 2009, 1:31:34 UTC - in response to Message 34125.  

@ David
Have you tried the new nVidia drivers? I'm using 195.62 I think this is the newest driver out.
;-p


I tried and got so many VDU errors that I went back. I have only one puter on Win 7 with the 195 driver, but don't run MW on it because I can't. Struggle to run Seti as well for some reason.

Of course, it could be the eight bloody Cosmo Wu's sucking up 98% of the eight GB of ram available.
ID: 34127 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileXJR-Maniac
Avatar

Send message
Joined: 18 Oct 07
Posts: 35
Credit: 4,684,314
RAC: 0
3 million credit badge10 year member badge
Message 34128 - Posted: 3 Dec 2009, 1:31:50 UTC - in response to Message 34120.  
Last modified: 3 Dec 2009, 1:37:22 UTC

The following results were calculated using Gipsels ATI app - fitness is displaying a real number here:

fitness: -3.200465856044408
Gipsel_GPU_CAL_0.20_x64: 0.20



OK, thanks Starfire, this is something completely different. On my results, this is:

fitness: -1.#QNAN000000000000000
stock_win32_gpu: 0.21 double


I'm using the stock application like others do and it never caused me any trouble since tuesday the 1st of december.

OK, very last chance fot tonite. I'll try the latest driver befor I go to bed, even though the 191.07 is running fine for others.
CUL8R
ID: 34128 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileDavid Glogau*
Avatar

Send message
Joined: 12 Aug 09
Posts: 172
Credit: 645,240,165
RAC: 0
500 million credit badge10 year member badge
Message 34130 - Posted: 3 Dec 2009, 1:42:46 UTC

XJR, from one of your WU's

<core_client_version>6.10.18</core_client_version>
<![CDATA[
<stderr_txt>
Device index specified on the command line was 0
Looking for a Double Precision capable NVIDIA GPU
The device GeForce GTX 260 specified on the command line can be used
called boinc_finish

</stderr_txt>
]]>

Validate state Invalid
Claimed credit 0.0186216440164967
Granted credit 0
application version 0.21


Its weird, it should be okey, but it isn't. Seems you are doing everything right, I am stumped. Seems to validate the card then finish the WU. Interesting bug.
ID: 34130 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileXJR-Maniac
Avatar

Send message
Joined: 18 Oct 07
Posts: 35
Credit: 4,684,314
RAC: 0
3 million credit badge10 year member badge
Message 34132 - Posted: 3 Dec 2009, 2:11:26 UTC

OK folks, this was the last driver test I've done for you ;-)))

Even with the 195.62 my results are invalid.

@David: This is what's driving me nuts. All looks fine, only the line with the fitness argument looks a bit weird. But where are the coders if you need them. There must be someone who knows what this fitness thingy is all about.

I'm out now. Have to get the cruncher up again for the other projects and then I'll have another Talisker for some sweet dreams or it will become a MW nightmare!

ID: 34132 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
boinc64
Avatar

Send message
Joined: 5 Oct 09
Posts: 3
Credit: 1,328,091
RAC: 0
1 million credit badge10 year member badge
Message 34138 - Posted: 3 Dec 2009, 8:59:28 UTC - in response to Message 34132.  

Fitness is the result of the calculation, in your case the result is -1.#QNAN
a quiet NaN (Not a Number) due to an error in the computation. For what can
cause these things see : http://en.wikipedia.org/wiki/QNaN
What the cause is in your case I haven't got a clue (hardware ?)
ID: 34138 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfilePaul D. Buck

Send message
Joined: 12 Apr 08
Posts: 621
Credit: 161,934,067
RAC: 0
100 million credit badge10 year member badge
Message 34139 - Posted: 3 Dec 2009, 9:48:03 UTC

My problem is that on two systems which had been working fine I was getting a high failure rate. On the system with a single Nvidia GPU I get valid results, ATI cards (dual in system) I get valid results.

On the pair of GTX295 cards and GTX260 cards I was getting a few valid results but mostly invalid results.

I can root around some more I suppose to see if I can get more data on memory sizes, but, my question is if the cards are limited in one way or another, why did they get any valid results at all?

With "Insta-purge" (TM) turned on the results are gone so we cannot look back ... but I think there is a weakness in the CUDA application that this expanded task size brings to the fore.

Travis? Comments?

In the mean time I have those systems that were returning errors in NNT so I don't waste my time or yours ... or screw up the streams with bad values ...
ID: 34139 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileXJR-Maniac
Avatar

Send message
Joined: 18 Oct 07
Posts: 35
Credit: 4,684,314
RAC: 0
3 million credit badge10 year member badge
Message 34142 - Posted: 3 Dec 2009, 12:47:11 UTC - in response to Message 34138.  

Fitness is the result of the calculation, in your case the result is -1.#QNAN
a quiet NaN (Not a Number) due to an error in the computation. For what can
cause these things see : http://en.wikipedia.org/wiki/QNaN
What the cause is in your case I haven't got a clue (hardware ?)


OK, this is a small hint of a calculation error. But a hardware problem seems to be very unlikely, or two GPUs have to fail exactly the same day. And it's not only my GPUs that fail. Are the GTX260 cards some kind of degraded material e.g. GTX280 chips that didn't pass QA or something? As Paul said in his last post it's more likely that there's something wrong with the application that comes to light due to the increased WU size.

Would it be possible to reduce the WU size to e.g. twice the size of the former ones instead of four times?
ID: 34142 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileDavid Glogau*
Avatar

Send message
Joined: 12 Aug 09
Posts: 172
Credit: 645,240,165
RAC: 0
500 million credit badge10 year member badge
Message 34144 - Posted: 3 Dec 2009, 13:16:25 UTC - in response to Message 34142.  

I have a machine with two GTX 295's and one GTX 260, it is running the new WU's with NO problems.

NB: None of my cards, or cpu's, are overclocked.
ID: 34144 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileXJR-Maniac
Avatar

Send message
Joined: 18 Oct 07
Posts: 35
Credit: 4,684,314
RAC: 0
3 million credit badge10 year member badge
Message 34147 - Posted: 3 Dec 2009, 15:24:05 UTC
Last modified: 3 Dec 2009, 16:14:09 UTC

I've found an old backup with a short MW WU and it finishes successful. Fitness is a real number, like the results posted by starfire.

My GPUs are NOT overclocked, at least not by me.

Here's the clock settings:

Core: 576MHz
Shader: 1242MHz
Memory: 1000MHz

Clocks read by RivaTuner

OK, I've tried the following:

Core: 400MHz
Shader: 862MHz
Memory: 800MHz

And, what a surprise, the new long WUs still finish invalid.
ID: 34147 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileXJR-Maniac
Avatar

Send message
Joined: 18 Oct 07
Posts: 35
Credit: 4,684,314
RAC: 0
3 million credit badge10 year member badge
Message 34149 - Posted: 3 Dec 2009, 16:32:38 UTC
Last modified: 3 Dec 2009, 16:38:04 UTC

Next I tried overclocking, maybe there won't be errors if the WU finishes faster, but still invalid. What a surprise.

Can someone please post the clocks of his GTX260?

@David: Can you please take a look at your result files, just to see what the fitness parameter says? Especially on your GTX260. It should be a real number like the results from Starfire. Thank you!
ID: 34149 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jotun263

Send message
Joined: 24 Aug 09
Posts: 5
Credit: 519,653
RAC: 0
500 thousand credit badge10 year member badge
Message 34150 - Posted: 3 Dec 2009, 17:29:13 UTC - in response to Message 34149.  

I have a GTX280 at standard speed rates, installed in a slightly overclocked Q9550-system with Win XP64 on it.
Boinc version: 6.10.11
NVIDIA driver: 190.62

I got the same errors as XJR-Maniac... WUs completed, but most of them marked as invalid. Fitness: -1.#QNAN... No shared memory errors or something similar. My next step is to update BOINC and the NVIDIA driver, but after reading this thread I don't believe that I will have more success than all others here. SETI and other CUDA-based apps work fine.
If anyone finds the reason please let me know...
ID: 34150 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Sudden mass of WU's finishing with Computation Error

©2019 Astroinformatics Group