Message boards :
Number crunching :
Sudden mass of WU's finishing with Computation Error
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0 |
The only thing i can think of is that now that the WU size increased 'damatically', is that the cuda app is reaching the maximum time that a cuda kernel is allowed to run and therefore crashes. But why is it not crashing on all machines? I found some GTX260 GPUs that are crunching fine. I did another WU with only BOINC running. Disabled antvirus and all other running programs, but still invalid. Next I will try a clean install of BOINC 6.10.18 Here's the content of the WU parameter file: de_s222_3s_best_1p_01r_41 parameters [20]: 0.849478381012573 7.954547382437530 -5.675839216148361 151.090487200445807 12.349743231700344 4.081479068461466 2.303908188777022 3.287842683092169 -1.305747455655341 169.772682049654520 24.374000713727746 6.032829617788646 2.985739186116882 6.146475181027592 -11.078468810318160 180.234612303055911 17.070581989881532 0.558447994066062 0.000000000000000 1.000000000000000 metadata: i: 94 Here's the content of the result file: de_s222_3s_best_1p_01r_41 parameters[20]: 0.84947838101257300000, 7.95454738243753030000, -5.67583921614836130000, 151.09048720044581000000, 12.34974323170034400000, 4.08147906846146570000, 2.30390818877702190000, 3.28784268309216900000, -1.30574745565534100000, 169.77268204965452000000, 24.37400071372774600000, 6.03282961778864560000, 2.98573918611688200000, 6.14647518102759170000, -11.07846881031816000000, 180.23461230305591000000, 17.07058198988153200000, 0.55844799406606205000, 0.00000000000000000000, 1.00000000000000000000 metadata: i: 94 fitness: -1.#QNAN000000000000000 stock_win32_gpu: 0.21 double Maybe this is of any use for you. Does fitness: -1 mean the result is invalid? |
Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0 |
OK, did a clean install of BOINC 6.10.18, rebooted machine, attached to Milkyway and crunched 1 WU. Result = invalid Your turn, I'm runnin' out of ideas. |
Send message Joined: 17 Feb 08 Posts: 363 Credit: 258,227,990 RAC: 0 |
OK, did a clean install of BOINC 6.10.18, rebooted machine, attached to Milkyway and crunched 1 WU. Result = invalid FWIW, try using the 190.62 drivers. I've seen the 191.07 driver crash my 8800GT on Collatz. Since i reverted back to 190.62 everything was fine again. Join Support science! Joinc Team BOINC United now! |
Send message Joined: 12 Aug 09 Posts: 172 Credit: 645,240,165 RAC: 0 |
The only thing i can think of is that now that the WU size increased 'damatically', is that the cuda app is reaching the maximum time that a cuda kernel is allowed to run and therefore crashes. It's not the size, I have Seti units that take hours to finish on Cuda. I had a big problem with Win 7, so went back to Vista and it was OK. My GTX 260's are running fine Both on i7 cores running Vista SP2, 190.62, BOINC 6.10.18 I seem to recall I had to uninstall BOINC install 6.10.17. Uninstall 6.10.17, then reinstall 6.10.18, at some point. Just can't remember what I was trying to fix. |
Send message Joined: 17 Feb 08 Posts: 363 Credit: 258,227,990 RAC: 0 |
The only thing i can think of is that now that the WU size increased 'damatically', is that the cuda app is reaching the maximum time that a cuda kernel is allowed to run and therefore crashes. I'm not talking about WU size here... please read my post,especially the bold part. Join Support science! Joinc Team BOINC United now! |
Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0 |
FWIW, try using the 190.62 drivers. I've seen the 191.07 driver crash my 8800GT on Collatz. Since i reverted back to 190.62 everything was fine again. Mope, it's not about drivers. I have an emergency install of WinXP x86 on another partition so I booted up this one, installed driver 190.62 and BOINC 6.10.18 but still invalid. And I did NOT install BOINC as a service because I remembered that there was an issue with CUDA errors related to service installation but it was only on Vista and Windows 7. This was due to security and services running in different sessions on those versions. That's why I'm still using XP. Can someone please suspend network activity for one WU and look into the result file of the finished WU to see what this "fitness" thing is good for? If fitness on a valid WU is set to -1 too then it means nothing, I think. This was my last shot for now so I'm out of MW for today. It's a real shame, because MW is the project I bought this two GTX260 crunchers for. |
Send message Joined: 19 Feb 09 Posts: 32 Credit: 32,843,308 RAC: 0 |
Can someone please suspend network activity for one WU and look into the result file of the finished WU to see what this "fitness" thing is good for? If fitness on a valid WU is set to -1 too then it means nothing, I think. The following results were calculated using Gipsels ATI app - fitness is displaying a real number here: de_s222_3s_best_1p_01r_44 parameters [20]: 0.726530534475273 1.000000000000000 -11.309963264750689 164.068245039415330 29.351026000962452 3.773916232705108 1.607818178975123 1.291489947920174 -17.570005048174835 158.561785668195650 18.510896578133526 5.738129192817610 3.323711933363694 11.394196875578736 2.936906508509978 162.759205639023720 0.700000000000000 1.444900940888767 0.408269541430983 14.596364746148591 metadata: i: 77 fitness: -3.200465856044408 Gipsel_GPU_CAL_0.20_x64: 0.20 de_s222_3s_best_1p_01r_44 parameters [20]: 0.968326917012557 1.000000000000000 -10.203487737339231 164.345084867324940 25.920729630235499 5.304565981764101 2.004622029391751 7.127507855253272 -16.032826229817438 154.840719713304740 11.483539537133694 3.739142191627291 3.987127183052219 10.319835656129992 1.401691284083597 162.058382692022750 12.793184654502657 0.833880293137536 0.817510569264601 16.330915579064580 metadata: i: 97 fitness: -3.231483990977578 Gipsel_GPU_CAL_0.20_x64: 0.20 de_s222_3s_best_1p_01r_44 parameters [20]: 1.000000000000000 1.055320290988556 -10.887774134384429 180.667161994741240 32.615322424742459 5.778394578605021 1.360893960164031 4.648841479380868 -15.579053670534684 159.303785370296170 19.866325091491856 3.582786323103195 3.739731946757967 12.959971223927784 -0.576617745668031 156.690938802085500 0.700000000000000 1.326891535596566 1.348254152737073 14.256789232543523 metadata: i: 98 fitness: -3.197918134189094 Gipsel_GPU_CAL_0.20_x64: 0.20 Starfire |
Send message Joined: 12 Aug 09 Posts: 172 Credit: 645,240,165 RAC: 0 |
Sorry, my bad. |
Send message Joined: 28 Apr 08 Posts: 1415 Credit: 2,716,428 RAC: 0 |
@ David Have you tried the new nVidia drivers? I'm using 195.62 I think this is the newest driver out. ;-p |
Send message Joined: 12 Aug 09 Posts: 172 Credit: 645,240,165 RAC: 0 |
@ David I tried and got so many VDU errors that I went back. I have only one puter on Win 7 with the 195 driver, but don't run MW on it because I can't. Struggle to run Seti as well for some reason. Of course, it could be the eight bloody Cosmo Wu's sucking up 98% of the eight GB of ram available. |
Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0 |
The following results were calculated using Gipsels ATI app - fitness is displaying a real number here: OK, thanks Starfire, this is something completely different. On my results, this is: fitness: -1.#QNAN000000000000000 stock_win32_gpu: 0.21 double I'm using the stock application like others do and it never caused me any trouble since tuesday the 1st of december. OK, very last chance fot tonite. I'll try the latest driver befor I go to bed, even though the 191.07 is running fine for others. CUL8R |
Send message Joined: 12 Aug 09 Posts: 172 Credit: 645,240,165 RAC: 0 |
XJR, from one of your WU's <core_client_version>6.10.18</core_client_version> Its weird, it should be okey, but it isn't. Seems you are doing everything right, I am stumped. Seems to validate the card then finish the WU. Interesting bug. |
Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0 |
OK folks, this was the last driver test I've done for you ;-))) Even with the 195.62 my results are invalid. @David: This is what's driving me nuts. All looks fine, only the line with the fitness argument looks a bit weird. But where are the coders if you need them. There must be someone who knows what this fitness thingy is all about. I'm out now. Have to get the cruncher up again for the other projects and then I'll have another Talisker for some sweet dreams or it will become a MW nightmare! |
Send message Joined: 5 Oct 09 Posts: 3 Credit: 1,328,091 RAC: 0 |
Fitness is the result of the calculation, in your case the result is -1.#QNAN a quiet NaN (Not a Number) due to an error in the computation. For what can cause these things see : http://en.wikipedia.org/wiki/QNaN What the cause is in your case I haven't got a clue (hardware ?) |
Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0 |
My problem is that on two systems which had been working fine I was getting a high failure rate. On the system with a single Nvidia GPU I get valid results, ATI cards (dual in system) I get valid results. On the pair of GTX295 cards and GTX260 cards I was getting a few valid results but mostly invalid results. I can root around some more I suppose to see if I can get more data on memory sizes, but, my question is if the cards are limited in one way or another, why did they get any valid results at all? With "Insta-purge" (TM) turned on the results are gone so we cannot look back ... but I think there is a weakness in the CUDA application that this expanded task size brings to the fore. Travis? Comments? In the mean time I have those systems that were returning errors in NNT so I don't waste my time or yours ... or screw up the streams with bad values ... |
Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0 |
Fitness is the result of the calculation, in your case the result is -1.#QNAN OK, this is a small hint of a calculation error. But a hardware problem seems to be very unlikely, or two GPUs have to fail exactly the same day. And it's not only my GPUs that fail. Are the GTX260 cards some kind of degraded material e.g. GTX280 chips that didn't pass QA or something? As Paul said in his last post it's more likely that there's something wrong with the application that comes to light due to the increased WU size. Would it be possible to reduce the WU size to e.g. twice the size of the former ones instead of four times? |
Send message Joined: 12 Aug 09 Posts: 172 Credit: 645,240,165 RAC: 0 |
I have a machine with two GTX 295's and one GTX 260, it is running the new WU's with NO problems. NB: None of my cards, or cpu's, are overclocked. |
Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0 |
I've found an old backup with a short MW WU and it finishes successful. Fitness is a real number, like the results posted by starfire. My GPUs are NOT overclocked, at least not by me. Here's the clock settings: Core: 576MHz Shader: 1242MHz Memory: 1000MHz Clocks read by RivaTuner OK, I've tried the following: Core: 400MHz Shader: 862MHz Memory: 800MHz And, what a surprise, the new long WUs still finish invalid. |
Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0 |
Next I tried overclocking, maybe there won't be errors if the WU finishes faster, but still invalid. What a surprise. Can someone please post the clocks of his GTX260? @David: Can you please take a look at your result files, just to see what the fitness parameter says? Especially on your GTX260. It should be a real number like the results from Starfire. Thank you! |
Send message Joined: 24 Aug 09 Posts: 5 Credit: 519,653 RAC: 0 |
I have a GTX280 at standard speed rates, installed in a slightly overclocked Q9550-system with Win XP64 on it. Boinc version: 6.10.11 NVIDIA driver: 190.62 I got the same errors as XJR-Maniac... WUs completed, but most of them marked as invalid. Fitness: -1.#QNAN... No shared memory errors or something similar. My next step is to update BOINC and the NVIDIA driver, but after reading this thread I don't believe that I will have more success than all others here. SETI and other CUDA-based apps work fine. If anyone finds the reason please let me know... |
©2024 Astroinformatics Group