Welcome to MilkyWay@home

Sudden mass of WU's finishing with Computation Error


Advanced search

Message boards : Number crunching : Sudden mass of WU's finishing with Computation Error
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
ProfileBruce
Avatar

Send message
Joined: 28 Apr 08
Posts: 1415
Credit: 2,716,428
RAC: 0
2 million credit badge10 year member badge
Message 34151 - Posted: 3 Dec 2009, 17:55:24 UTC - in response to Message 34149.  

Next I tried overclocking, maybe there won't be errors if the WU finishes faster, but still invalid. What a surprise.

Can someone please post the clocks of his GTX260?

@David: Can you please take a look at your result files, just to see what the fitness parameter says? Especially on your GTX260. It should be a real number like the results from Starfire. Thank you!


These are the clock speeds for a GTX260 from nVidia site specs
Graphics 576
Processor 1242
Memory 999
hope it helps
;-p
ID: 34151 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileXJR-Maniac
Avatar

Send message
Joined: 18 Oct 07
Posts: 35
Credit: 4,684,314
RAC: 0
3 million credit badge10 year member badge
Message 34153 - Posted: 3 Dec 2009, 18:03:00 UTC - in response to Message 34151.  

These are the clock speeds for a GTX260 from nVidia site specs
Graphics 576
Processor 1242
Memory 999
hope it helps
;-p


Thanks Bruce, that's exactly what my GPUs are running on (see below).

OK, as long as there is not much help from the project responsibles here and as long as I'm out of new ideas, I'm gonna disable CUDA on MW and enable it on Einstein. It's less of a waste then when I don't use my GPUs. OK, theres Seti and Collats but not that often ;-))) It's a shame because I think MW is a very interesting project.

To check if there's something wrong with my GPUs and to keep my fans spinning, I joined GPUGrid. I'll keep you informed, if I find something interesting there.
ID: 34153 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Thamir Ghaslan

Send message
Joined: 31 Mar 08
Posts: 61
Credit: 18,325,284
RAC: 0
10 million credit badge10 year member badge
Message 34155 - Posted: 3 Dec 2009, 19:27:08 UTC

MW is still my backup project, collatz is my primary.

I have to say I'm pleased with the 4x increase in wokrunit size, and no problems what so ever on a 4870x2.

ID: 34155 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfilePaul D. Buck

Send message
Joined: 12 Apr 08
Posts: 621
Credit: 161,934,067
RAC: 0
100 million credit badge10 year member badge
Message 34187 - Posted: 4 Dec 2009, 12:06:59 UTC - in response to Message 34155.  

MW is still my backup project, collatz is my primary.

I have to say I'm pleased with the 4x increase in wokrunit size, and no problems what so ever on a 4870x2.

I am pretty happy with them on my GTX280 and ATI cards too ... but the fact that some of them run and some of them complete, or even most of them running and completing is not a comfort to those whose systems suddenly started failing tasks when they were increased in size. Again, it is possible that several hundred cards suddenly failed ... but it is far more likely that there is something else going on ...

If the calculations are running into NaN situations then it would seem that there is the potential for overflows or underflows on some classes of cards.
ID: 34187 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileXJR-Maniac
Avatar

Send message
Joined: 18 Oct 07
Posts: 35
Credit: 4,684,314
RAC: 0
3 million credit badge10 year member badge
Message 34202 - Posted: 4 Dec 2009, 15:30:49 UTC
Last modified: 4 Dec 2009, 15:50:01 UTC

After I successfully finished four WUs on GPUGRID, two on each GPU, I'm pretty sure that my GPUs are OK. There is a thread in the GPUGRID forum where a problem regarding failing GTX260 GPUs is discussed. Seems that some cards don't like fast fourier transformations (fft), mainly the older ones with 192 shaders and with the reference design (one fan), too.

So mine both are the new ones with 216 shaders, 55nm architecture and they have two fans. One is a palit, the other is a gainward but they look quite the same, only the colours are different.

BIOS version of both cards is 62.00.49.00.03


Can someone with a non failing, two fan 55nm GTX260/216 post his/her BIOS version, please? You can use GPU-Z to get it. Maybe I can find a version that is running on my cards, too.

Is there a debug mode for the MW CUDA application with enhanced logging so one can see what causes the NaN result? Seems that the app isn't crashing. It runs till the end but creates an invalid result. There should we some kind of debug mode or how do the developers test the apps??? OK, sometimes it seems that there are no tests at all ;-)))


Edit: I've read through this thread again and it seems that, unlike to GPUGRID, here at MW there are not only GTX260 GPUs that fail. There's a GTX280 and Paul's GTX2965, too. I can understand why the GTX280 could fail too because when I got it right a GTX260 is a GTX280 that didn't pass QA due to defective shaders. So they cut off the defective shader ALUs to 216, reduce clocks and memory and sell it as GTX260.

But where's the link to the 295ers? Aren't they dual 275ers? So the 275/295er chips are not the same as the 260/280ers?

Who knows. Maybe there will be a day to come when you can read that a pear is an apple that didn't make it through QA due to it's irregular form.
ID: 34202 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
| ganja |

Send message
Joined: 11 Oct 08
Posts: 8
Credit: 8,821,808
RAC: 0
5 million credit badge10 year member badge
Message 34205 - Posted: 4 Dec 2009, 16:51:17 UTC - in response to Message 34202.  

I'm not having any problems on my gpu's 2 gtx 260-216 cores
windows 7 x64
ninvidia driver 195.39 cuda 3
boinc 6.10.19 windowsx64
ID: 34205 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilekashi

Send message
Joined: 30 Dec 07
Posts: 311
Credit: 149,222,294
RAC: 25,948
100 million credit badge10 year member badge
Message 34207 - Posted: 4 Dec 2009, 17:42:46 UTC

It is possible that this is a memory buffer issue. If so those with a smaller amount of system memory or less bandwidth on an older architecture and/or multiple nVidia GPU cores would be more likely to experience the issue than those with a larger amount of memory/newer system and/or only a single nVidia GPU core.

I have no experience with nVidia cards, this is only theory, it would need to be tested by someone who has the errors either installing more memory or only running one nVidia GPU per box and seeing if it fixed it. I suppose you could also test it by swapping an nVidia card that is erroring to another system that has more available memory.
ID: 34207 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileThe Gas Giant
Avatar

Send message
Joined: 24 Dec 07
Posts: 1947
Credit: 240,884,648
RAC: 0
200 million credit badge10 year member badge
Message 34209 - Posted: 4 Dec 2009, 17:49:17 UTC

I've been following this thread and I'm just tossing up ideas...

What is the physical set up for the computers that return invalid wu's? Could it be that the power supply is a little light? Is there an issue with the BIOS of the motherboard so the rev needs updating?
ID: 34209 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileXJR-Maniac
Avatar

Send message
Joined: 18 Oct 07
Posts: 35
Credit: 4,684,314
RAC: 0
3 million credit badge10 year member badge
Message 34210 - Posted: 4 Dec 2009, 18:04:25 UTC - in response to Message 34207.  

It is possible that this is a memory buffer issue. If so those with a smaller amount of system memory or less bandwidth on an older architecture and/or multiple nVidia GPU cores would be more likely to experience the issue than those with a larger amount of memory/newer system and/or only a single nVidia GPU core.

I have no experience with nVidia cards, this is only theory, it would need to be tested by someone who has the errors either installing more memory or only running one nVidia GPU per box and seeing if it fixed it. I suppose you could also test it by swapping an nVidia card that is erroring to another system that has more available memory.


There's only one GTX260 running per machine. OK, the WinXP box has a 9400GT but it's for display usage only because it created errors on both Seti and Collatz.

Both machines have 4GB of RAM installed. And before someone starts to complain, yes I'm aware of the fact that WinXP 32Bit has only access to 3GB! I saw that I forgot to mention RAM in my system overview below.

But how could a CUDA application be concerned by amount system memory? OK, there's a small part that is running on CPU but is that of any interest? And no matter if I have run the MW app standalone or with others, it fails in every situation. Only the short ones finish with valid results.
ID: 34210 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileXJR-Maniac
Avatar

Send message
Joined: 18 Oct 07
Posts: 35
Credit: 4,684,314
RAC: 0
3 million credit badge10 year member badge
Message 34211 - Posted: 4 Dec 2009, 19:02:29 UTC - in response to Message 34209.  

I've been following this thread and I'm just tossing up ideas...

What is the physical set up for the computers that return invalid wu's? Could it be that the power supply is a little light? Is there an issue with the BIOS of the motherboard so the rev needs updating?


Here's my PSUs:

Enermax Pro 82+ 425W in use since 9 months
Enermax Pro 82+ 625W in use since 1 month

OK, the 425W is a little bit at the limit but the machine with the 625W fails, too. The machine with the 625W PSU has another 9400GT running, but as I mentioned before it is only for display use. And according to nvidia the 9400GT needs another 50W, so the 625W PSU should be sufficient.

BIOS of both MBs is flashed to the latest available version. The boards are different types, Asus P5QL-E and Asus P5Q Pro Turbo.

And hey, it's only Milkyway that has a problem with my hardware! The GPUGRID WUs are running nearly 8 hours without interruption and they finish valid! So do Seti, Seti Beta, Collatz and Einstein. So who do you think is the one that should do all that trouble shooting? Me? I'm getting more and more sick if this!

OK, I should stop typing because I'm getting angry right now! And then I mostly loose my countenance.

ID: 34211 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Thamir Ghaslan

Send message
Joined: 31 Mar 08
Posts: 61
Credit: 18,325,284
RAC: 0
10 million credit badge10 year member badge
Message 34214 - Posted: 4 Dec 2009, 21:48:51 UTC

I doubt this is a hardware issue.

And I have'nt heard from Gipsel in a while. </hint> :P

Notice how these errors only crept up with the 4x increase in WU size.

Time for another GPU software update to correctly handle the 4x increase?
ID: 34214 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
| ganja |

Send message
Joined: 11 Oct 08
Posts: 8
Credit: 8,821,808
RAC: 0
5 million credit badge10 year member badge
Message 34215 - Posted: 4 Dec 2009, 21:58:16 UTC - in response to Message 34205.  
Last modified: 4 Dec 2009, 22:01:45 UTC

I'm not having any problems on my gpu's 2 gtx 260-216 cores
windows 7 x64
ninvidia driver 195.39 cuda 3
boinc 6.10.19 windowsx64

My box is only like 3 or 4 months old...
asus a6t motherboard Intel I7 920 2.67 o/c to 3.4 and I have 12gigs of DDR3 ram
the video card are bothe evga gtx260-216 core factory o/c which I have o/c even more.
ID: 34215 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileXJR-Maniac
Avatar

Send message
Joined: 18 Oct 07
Posts: 35
Credit: 4,684,314
RAC: 0
3 million credit badge10 year member badge
Message 34218 - Posted: 4 Dec 2009, 23:13:09 UTC - in response to Message 34215.  


My box is only like 3 or 4 months old...
asus a6t motherboard Intel I7 920 2.67 o/c to 3.4 and I have 12gigs of DDR3 ram
the video card are bothe evga gtx260-216 core factory o/c which I have o/c even more.


Hello ganja, can you please post the bios version of your GTX260 cards? Thank you.

Let me ask to the public again it someone knows if the MW app can be run in some kind of debug mode?

ID: 34218 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Donnie
Avatar

Send message
Joined: 19 Jul 08
Posts: 67
Credit: 272,086,462
RAC: 0
200 million credit badge10 year member badge
Message 34221 - Posted: 4 Dec 2009, 23:49:41 UTC - in response to Message 34218.  


My box is only like 3 or 4 months old...
asus a6t motherboard Intel I7 920 2.67 o/c to 3.4 and I have 12gigs of DDR3 ram
the video card are bothe evga gtx260-216 core factory o/c which I have o/c even more.


Hello ganja, can you please post the bios version of your GTX260 cards? Thank you.

Let me ask to the public again it someone knows if the MW app can be run in some kind of debug mode?


Dude, there's a config file you can create to track operations of WUs, but I've never used it. This is the link:

Client Configuration
ID: 34221 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mark Henderson

Send message
Joined: 18 Jul 09
Posts: 7
Credit: 2,373,140
RAC: 0
2 million credit badge10 year member badge
Message 34228 - Posted: 5 Dec 2009, 0:17:01 UTC
Last modified: 5 Dec 2009, 0:36:11 UTC

All of mine are invalid now also. 2 EVGA 260, core 216s. They complete fine but marked invalid.

1 is bios: 62.00.38.00.50
2 is bios: 62.00.4c.00.50

Nvidia 195.62
Win. XP64
Boinc 6.10.18

Overclocked or stock, same invalid result.
ID: 34228 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilekashi

Send message
Joined: 30 Dec 07
Posts: 311
Credit: 149,222,294
RAC: 25,948
100 million credit badge10 year member badge
Message 34229 - Posted: 5 Dec 2009, 0:37:07 UTC - in response to Message 34210.  
Last modified: 5 Dec 2009, 0:55:03 UTC

There's only one GTX260 running per machine. OK, the WinXP box has a 9400GT but it's for display usage only because it created errors on both Seti and Collatz.

Both machines have 4GB of RAM installed. And before someone starts to complain, yes I'm aware of the fact that WinXP 32Bit has only access to 3GB! I saw that I forgot to mention RAM in my system overview below.

But how could a CUDA application be concerned by amount system memory? OK, there's a small part that is running on CPU but is that of any interest? And no matter if I have run the MW app standalone or with others, it fails in every situation. Only the short ones finish with valid results.


I don't know the technical details of how GPU applications use system memory or use it to remap the video memory, I just know that they do. Errors were reported a while ago with the Collatz ATI application from people who were running multiple 1GB graphics memory ATI GPUs and had only 1 GB of system memory. More recently someone with only 2GB of memory also got those buffer errors when trying to run 2 HD 5970 cards. The solution in both cases was to limit the amount of video memory used by the application with the use of the r parameter.

I know Collatz ATI is a different application from MilkyWay ATI or CUDA and is more dependent on video card memory bandwidth. However system memory would still need to be used and when the MilkyWay WU is 4 times longer it possiubly uses more than previously.

Because I had noticed that some who reported no problems even with multiple nVidia cards had 8GB or 12GB of system memory and some of those who were reporting problems had 3GB reported as available or 4GB available with multiple GPU cores I reasoned that memory buffer issues may be causing a problem. Could be wrong, just raising the possibility to try and help narrow down what is causing these errors for some while others are unaffected.

Therefore it is possible that those who have a large amount of system memory installed and use a Windows 64-bit operating system may not encounter these errors. Also graphics card memory needs to be remapped to system memory so multiple cards could still cause a problem even if one of them is not being used for MilkyWay CUDA processing.
ID: 34229 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Donnie
Avatar

Send message
Joined: 19 Jul 08
Posts: 67
Credit: 272,086,462
RAC: 0
200 million credit badge10 year member badge
Message 34230 - Posted: 5 Dec 2009, 0:52:24 UTC - in response to Message 34229.  
Last modified: 5 Dec 2009, 0:53:33 UTC

There's only one GTX260 running per machine. OK, the WinXP box has a 9400GT but it's for display usage only because it created errors on both Seti and Collatz.

Both machines have 4GB of RAM installed. And before someone starts to complain, yes I'm aware of the fact that WinXP 32Bit has only access to 3GB! I saw that I forgot to mention RAM in my system overview below.

But how could a CUDA application be concerned by amount system memory? OK, there's a small part that is running on CPU but is that of any interest? And no matter if I have run the MW app standalone or with others, it fails in every situation. Only the short ones finish with valid results.


I don't know the technical details of how GPU applications use system memory or use it to remap the video memory, I just know that they do. Errors were reported a while ago with the Collatz ATI application from people who were running multiple 1GB graphics memory ATI GPUs and had only 1 GB of system memory. More recently someone with only 2GB of memory also got those buffer errors when trying to run 2 HD 5970 cards. The solution in both cases was to limit the amount of video memory used by the application with the use of the r parameter.

I know Collatz ATI is a different application from MilkyWay ATI or CUDA and is more dependent on video card memory bandwidth. However system memory would still need to be used and when the MilkyWay WU is 4 times longer it possiubly uses more than previously.

Because I had noticed that some who reported no problems even with multiple nVidia cards had 8GB or 12GB of system memory and some of those who were reporting problems had 3GB reported as available or 4GB available with multiple GPU cores I reasoned that memory buffer issues may be causing a problem. Could be wrong, just raising the possibility to try and help narrow down what is causing these errors for some while others are unaffected.

Therefore it is possible that those that have a large amount of system memory installed and use a Windows 64-bit operating system may not encounter these errors.


I also remember someone had to use r=20 in the app_info file to keep WUs from erroring out. These people may not use the app_info file though? I believe the default is 30?
ID: 34230 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilekashi

Send message
Joined: 30 Dec 07
Posts: 311
Credit: 149,222,294
RAC: 25,948
100 million credit badge10 year member badge
Message 34231 - Posted: 5 Dec 2009, 1:04:56 UTC - in response to Message 34230.  

I also remember someone had to use r=20 in the app_info file to keep WUs from erroring out. These people may not use the app_info file though? I believe the default is 30?

Yes the recent example needed to use r20 with his 2 HD 5970s and previous ones used r25 and r26. I know there is no r parameter available for MilkyWay ATI and MilkyWay CUDA does not use an app_info.xml file, it was just part of explaining the significance of memory buffer issues.
ID: 34231 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileDavid Glogau*
Avatar

Send message
Joined: 12 Aug 09
Posts: 172
Credit: 645,240,165
RAC: 0
500 million credit badge10 year member badge
Message 34232 - Posted: 5 Dec 2009, 1:06:23 UTC - in response to Message 34229.  

I agree with this assessment. I have one machine, an i7 with 8 cores, 2 x 295's and one 260, and, 8 GB of RAM running Cosmology which is a memory pig. I can not run MW on this machine. But on a similar machine running CDN I have no problems.

Prior to the increased WU size I could (mostly) run the WU's on the Cosmo box. I have several 4GB sticks on order to help with the out of memory problems that crop up on the Cosmo box, so when I get them installed (12 GB total) I will try MW again and see if they run OK.
ID: 34232 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilekashi

Send message
Joined: 30 Dec 07
Posts: 311
Credit: 149,222,294
RAC: 25,948
100 million credit badge10 year member badge
Message 34242 - Posted: 5 Dec 2009, 3:00:05 UTC
Last modified: 5 Dec 2009, 3:47:25 UTC

Well that seems to confirm my supposition that it is something to do with memory.

I think the CUDA version was developed by the project, so it may take some time before this can be fixed. Due to the architecture of current nVidia cards it is possible that it is not able to be fixed without reducing performance. Therefore in the meantime here are some suggestions for possible workarounds:

* For those with 4GB of memory and Windows 32-bit, use a 64-bit Windows version so that usable memory is not limited to 3GB.
* Install more memory.
* Install only a single or single core video card.
* Remove additional video cards even if they are not being used by MilkyWay CUDA.
* Don't run other projects or applications that are memory intensive.
* Every little may help, so disable unnecessary services, set some others to manual instead of auto. Blackviper's Service Configurations may be useful for this.

These are only suggestions, I do not know which combination of them may work for a particular configuration. Others with personal experience of nVidia cards will be able to give better advice or corrections. I know some of these suggestions are not possible or will be unacceptable for some contributors, it is not my intention to offend.
ID: 34242 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Sudden mass of WU's finishing with Computation Error

©2019 Astroinformatics Group