Sudden mass of WU's finishing with Computation Error

Author	Message
Bruce Send message Joined: 28 Apr 08 Posts: 1415 Credit: 2,716,428 RAC: 0	Message 34151 - Posted: 3 Dec 2009, 17:55:24 UTC - in response to Message 34149. Next I tried overclocking, maybe there won't be errors if the WU finishes faster, but still invalid. What a surprise. Can someone please post the clocks of his GTX260? @David: Can you please take a look at your result files, just to see what the fitness parameter says? Especially on your GTX260. It should be a real number like the results from Starfire. Thank you! These are the clock speeds for a GTX260 from nVidia site specs Graphics 576 Processor 1242 Memory 999 hope it helps ;-p ID: 34151 · Rating: 0 · rate: / Reply Quote

XJR-Maniac Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0	Message 34153 - Posted: 3 Dec 2009, 18:03:00 UTC - in response to Message 34151. These are the clock speeds for a GTX260 from nVidia site specs Graphics 576 Processor 1242 Memory 999 hope it helps ;-p Thanks Bruce, that's exactly what my GPUs are running on (see below). OK, as long as there is not much help from the project responsibles here and as long as I'm out of new ideas, I'm gonna disable CUDA on MW and enable it on Einstein. It's less of a waste then when I don't use my GPUs. OK, theres Seti and Collats but not that often ;-))) It's a shame because I think MW is a very interesting project. To check if there's something wrong with my GPUs and to keep my fans spinning, I joined GPUGrid. I'll keep you informed, if I find something interesting there. ID: 34153 · Rating: 0 · rate: / Reply Quote

Thamir Ghaslan Send message Joined: 31 Mar 08 Posts: 61 Credit: 18,325,284 RAC: 0	Message 34155 - Posted: 3 Dec 2009, 19:27:08 UTC MW is still my backup project, collatz is my primary. I have to say I'm pleased with the 4x increase in wokrunit size, and no problems what so ever on a 4870x2. ID: 34155 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0	Message 34187 - Posted: 4 Dec 2009, 12:06:59 UTC - in response to Message 34155. MW is still my backup project, collatz is my primary. I have to say I'm pleased with the 4x increase in wokrunit size, and no problems what so ever on a 4870x2. I am pretty happy with them on my GTX280 and ATI cards too ... but the fact that some of them run and some of them complete, or even most of them running and completing is not a comfort to those whose systems suddenly started failing tasks when they were increased in size. Again, it is possible that several hundred cards suddenly failed ... but it is far more likely that there is something else going on ... If the calculations are running into NaN situations then it would seem that there is the potential for overflows or underflows on some classes of cards. ID: 34187 · Rating: 0 · rate: / Reply Quote

XJR-Maniac Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0	Message 34202 - Posted: 4 Dec 2009, 15:30:49 UTC Last modified: 4 Dec 2009, 15:50:01 UTC After I successfully finished four WUs on GPUGRID, two on each GPU, I'm pretty sure that my GPUs are OK. There is a thread in the GPUGRID forum where a problem regarding failing GTX260 GPUs is discussed. Seems that some cards don't like fast fourier transformations (fft), mainly the older ones with 192 shaders and with the reference design (one fan), too. So mine both are the new ones with 216 shaders, 55nm architecture and they have two fans. One is a palit, the other is a gainward but they look quite the same, only the colours are different. BIOS version of both cards is 62.00.49.00.03 Can someone with a non failing, two fan 55nm GTX260/216 post his/her BIOS version, please? You can use GPU-Z to get it. Maybe I can find a version that is running on my cards, too. Is there a debug mode for the MW CUDA application with enhanced logging so one can see what causes the NaN result? Seems that the app isn't crashing. It runs till the end but creates an invalid result. There should we some kind of debug mode or how do the developers test the apps??? OK, sometimes it seems that there are no tests at all ;-))) Edit: I've read through this thread again and it seems that, unlike to GPUGRID, here at MW there are not only GTX260 GPUs that fail. There's a GTX280 and Paul's GTX2965, too. I can understand why the GTX280 could fail too because when I got it right a GTX260 is a GTX280 that didn't pass QA due to defective shaders. So they cut off the defective shader ALUs to 216, reduce clocks and memory and sell it as GTX260. But where's the link to the 295ers? Aren't they dual 275ers? So the 275/295er chips are not the same as the 260/280ers? Who knows. Maybe there will be a day to come when you can read that a pear is an apple that didn't make it through QA due to it's irregular form. ID: 34202 · Rating: 0 · rate: / Reply Quote

\| ganja \| Send message Joined: 11 Oct 08 Posts: 8 Credit: 8,821,808 RAC: 0	Message 34205 - Posted: 4 Dec 2009, 16:51:17 UTC - in response to Message 34202. I'm not having any problems on my gpu's 2 gtx 260-216 cores windows 7 x64 ninvidia driver 195.39 cuda 3 boinc 6.10.19 windowsx64 ID: 34205 · Rating: 0 · rate: / Reply Quote

kashi Send message Joined: 30 Dec 07 Posts: 311 Credit: 149,490,184 RAC: 0	Message 34207 - Posted: 4 Dec 2009, 17:42:46 UTC It is possible that this is a memory buffer issue. If so those with a smaller amount of system memory or less bandwidth on an older architecture and/or multiple nVidia GPU cores would be more likely to experience the issue than those with a larger amount of memory/newer system and/or only a single nVidia GPU core. I have no experience with nVidia cards, this is only theory, it would need to be tested by someone who has the errors either installing more memory or only running one nVidia GPU per box and seeing if it fixed it. I suppose you could also test it by swapping an nVidia card that is erroring to another system that has more available memory. ID: 34207 · Rating: 0 · rate: / Reply Quote

The Gas Giant Send message Joined: 24 Dec 07 Posts: 1947 Credit: 240,884,648 RAC: 0	Message 34209 - Posted: 4 Dec 2009, 17:49:17 UTC I've been following this thread and I'm just tossing up ideas... What is the physical set up for the computers that return invalid wu's? Could it be that the power supply is a little light? Is there an issue with the BIOS of the motherboard so the rev needs updating? ID: 34209 · Rating: 0 · rate: / Reply Quote

XJR-Maniac Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0	Message 34210 - Posted: 4 Dec 2009, 18:04:25 UTC - in response to Message 34207. It is possible that this is a memory buffer issue. If so those with a smaller amount of system memory or less bandwidth on an older architecture and/or multiple nVidia GPU cores would be more likely to experience the issue than those with a larger amount of memory/newer system and/or only a single nVidia GPU core. I have no experience with nVidia cards, this is only theory, it would need to be tested by someone who has the errors either installing more memory or only running one nVidia GPU per box and seeing if it fixed it. I suppose you could also test it by swapping an nVidia card that is erroring to another system that has more available memory. There's only one GTX260 running per machine. OK, the WinXP box has a 9400GT but it's for display usage only because it created errors on both Seti and Collatz. Both machines have 4GB of RAM installed. And before someone starts to complain, yes I'm aware of the fact that WinXP 32Bit has only access to 3GB! I saw that I forgot to mention RAM in my system overview below. But how could a CUDA application be concerned by amount system memory? OK, there's a small part that is running on CPU but is that of any interest? And no matter if I have run the MW app standalone or with others, it fails in every situation. Only the short ones finish with valid results. ID: 34210 · Rating: 0 · rate: / Reply Quote

XJR-Maniac Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0	Message 34211 - Posted: 4 Dec 2009, 19:02:29 UTC - in response to Message 34209. I've been following this thread and I'm just tossing up ideas... What is the physical set up for the computers that return invalid wu's? Could it be that the power supply is a little light? Is there an issue with the BIOS of the motherboard so the rev needs updating? Here's my PSUs: Enermax Pro 82+ 425W in use since 9 months Enermax Pro 82+ 625W in use since 1 month OK, the 425W is a little bit at the limit but the machine with the 625W fails, too. The machine with the 625W PSU has another 9400GT running, but as I mentioned before it is only for display use. And according to nvidia the 9400GT needs another 50W, so the 625W PSU should be sufficient. BIOS of both MBs is flashed to the latest available version. The boards are different types, Asus P5QL-E and Asus P5Q Pro Turbo. And hey, it's only Milkyway that has a problem with my hardware! The GPUGRID WUs are running nearly 8 hours without interruption and they finish valid! So do Seti, Seti Beta, Collatz and Einstein. So who do you think is the one that should do all that trouble shooting? Me? I'm getting more and more sick if this! OK, I should stop typing because I'm getting angry right now! And then I mostly loose my countenance. ID: 34211 · Rating: 0 · rate: / Reply Quote

Thamir Ghaslan Send message Joined: 31 Mar 08 Posts: 61 Credit: 18,325,284 RAC: 0	Message 34214 - Posted: 4 Dec 2009, 21:48:51 UTC I doubt this is a hardware issue. And I have'nt heard from Gipsel in a while. </hint> :P Notice how these errors only crept up with the 4x increase in WU size. Time for another GPU software update to correctly handle the 4x increase? ID: 34214 · Rating: 0 · rate: / Reply Quote

\| ganja \| Send message Joined: 11 Oct 08 Posts: 8 Credit: 8,821,808 RAC: 0	Message 34215 - Posted: 4 Dec 2009, 21:58:16 UTC - in response to Message 34205. Last modified: 4 Dec 2009, 22:01:45 UTC I'm not having any problems on my gpu's 2 gtx 260-216 cores windows 7 x64 ninvidia driver 195.39 cuda 3 boinc 6.10.19 windowsx64 My box is only like 3 or 4 months old... asus a6t motherboard Intel I7 920 2.67 o/c to 3.4 and I have 12gigs of DDR3 ram the video card are bothe evga gtx260-216 core factory o/c which I have o/c even more. ID: 34215 · Rating: 0 · rate: / Reply Quote

XJR-Maniac Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0	Message 34218 - Posted: 4 Dec 2009, 23:13:09 UTC - in response to Message 34215. My box is only like 3 or 4 months old... asus a6t motherboard Intel I7 920 2.67 o/c to 3.4 and I have 12gigs of DDR3 ram the video card are bothe evga gtx260-216 core factory o/c which I have o/c even more. Hello ganja, can you please post the bios version of your GTX260 cards? Thank you. Let me ask to the public again it someone knows if the MW app can be run in some kind of debug mode? ID: 34218 · Rating: 0 · rate: / Reply Quote

Donnie Send message Joined: 19 Jul 08 Posts: 67 Credit: 272,086,462 RAC: 0	Message 34221 - Posted: 4 Dec 2009, 23:49:41 UTC - in response to Message 34218. My box is only like 3 or 4 months old... asus a6t motherboard Intel I7 920 2.67 o/c to 3.4 and I have 12gigs of DDR3 ram the video card are bothe evga gtx260-216 core factory o/c which I have o/c even more. Hello ganja, can you please post the bios version of your GTX260 cards? Thank you. Let me ask to the public again it someone knows if the MW app can be run in some kind of debug mode? Dude, there's a config file you can create to track operations of WUs, but I've never used it. This is the link: Client Configuration ID: 34221 · Rating: 0 · rate: / Reply Quote

Mark Henderson Send message Joined: 18 Jul 09 Posts: 7 Credit: 2,373,140 RAC: 0	Message 34228 - Posted: 5 Dec 2009, 0:17:01 UTC Last modified: 5 Dec 2009, 0:36:11 UTC All of mine are invalid now also. 2 EVGA 260, core 216s. They complete fine but marked invalid. 1 is bios: 62.00.38.00.50 2 is bios: 62.00.4c.00.50 Nvidia 195.62 Win. XP64 Boinc 6.10.18 Overclocked or stock, same invalid result. ID: 34228 · Rating: 0 · rate: / Reply Quote

kashi Send message Joined: 30 Dec 07 Posts: 311 Credit: 149,490,184 RAC: 0	Message 34229 - Posted: 5 Dec 2009, 0:37:07 UTC - in response to Message 34210. Last modified: 5 Dec 2009, 0:55:03 UTC There's only one GTX260 running per machine. OK, the WinXP box has a 9400GT but it's for display usage only because it created errors on both Seti and Collatz. Both machines have 4GB of RAM installed. And before someone starts to complain, yes I'm aware of the fact that WinXP 32Bit has only access to 3GB! I saw that I forgot to mention RAM in my system overview below. But how could a CUDA application be concerned by amount system memory? OK, there's a small part that is running on CPU but is that of any interest? And no matter if I have run the MW app standalone or with others, it fails in every situation. Only the short ones finish with valid results. I don't know the technical details of how GPU applications use system memory or use it to remap the video memory, I just know that they do. Errors were reported a while ago with the Collatz ATI application from people who were running multiple 1GB graphics memory ATI GPUs and had only 1 GB of system memory. More recently someone with only 2GB of memory also got those buffer errors when trying to run 2 HD 5970 cards. The solution in both cases was to limit the amount of video memory used by the application with the use of the r parameter. I know Collatz ATI is a different application from MilkyWay ATI or CUDA and is more dependent on video card memory bandwidth. However system memory would still need to be used and when the MilkyWay WU is 4 times longer it possiubly uses more than previously. Because I had noticed that some who reported no problems even with multiple nVidia cards had 8GB or 12GB of system memory and some of those who were reporting problems had 3GB reported as available or 4GB available with multiple GPU cores I reasoned that memory buffer issues may be causing a problem. Could be wrong, just raising the possibility to try and help narrow down what is causing these errors for some while others are unaffected. Therefore it is possible that those who have a large amount of system memory installed and use a Windows 64-bit operating system may not encounter these errors. Also graphics card memory needs to be remapped to system memory so multiple cards could still cause a problem even if one of them is not being used for MilkyWay CUDA processing. ID: 34229 · Rating: 0 · rate: / Reply Quote

Donnie Send message Joined: 19 Jul 08 Posts: 67 Credit: 272,086,462 RAC: 0	Message 34230 - Posted: 5 Dec 2009, 0:52:24 UTC - in response to Message 34229. Last modified: 5 Dec 2009, 0:53:33 UTC There's only one GTX260 running per machine. OK, the WinXP box has a 9400GT but it's for display usage only because it created errors on both Seti and Collatz. Both machines have 4GB of RAM installed. And before someone starts to complain, yes I'm aware of the fact that WinXP 32Bit has only access to 3GB! I saw that I forgot to mention RAM in my system overview below. But how could a CUDA application be concerned by amount system memory? OK, there's a small part that is running on CPU but is that of any interest? And no matter if I have run the MW app standalone or with others, it fails in every situation. Only the short ones finish with valid results. I don't know the technical details of how GPU applications use system memory or use it to remap the video memory, I just know that they do. Errors were reported a while ago with the Collatz ATI application from people who were running multiple 1GB graphics memory ATI GPUs and had only 1 GB of system memory. More recently someone with only 2GB of memory also got those buffer errors when trying to run 2 HD 5970 cards. The solution in both cases was to limit the amount of video memory used by the application with the use of the r parameter. I know Collatz ATI is a different application from MilkyWay ATI or CUDA and is more dependent on video card memory bandwidth. However system memory would still need to be used and when the MilkyWay WU is 4 times longer it possiubly uses more than previously. Because I had noticed that some who reported no problems even with multiple nVidia cards had 8GB or 12GB of system memory and some of those who were reporting problems had 3GB reported as available or 4GB available with multiple GPU cores I reasoned that memory buffer issues may be causing a problem. Could be wrong, just raising the possibility to try and help narrow down what is causing these errors for some while others are unaffected. Therefore it is possible that those that have a large amount of system memory installed and use a Windows 64-bit operating system may not encounter these errors. I also remember someone had to use r=20 in the app_info file to keep WUs from erroring out. These people may not use the app_info file though? I believe the default is 30? ID: 34230 · Rating: 0 · rate: / Reply Quote

kashi Send message Joined: 30 Dec 07 Posts: 311 Credit: 149,490,184 RAC: 0	Message 34231 - Posted: 5 Dec 2009, 1:04:56 UTC - in response to Message 34230. I also remember someone had to use r=20 in the app_info file to keep WUs from erroring out. These people may not use the app_info file though? I believe the default is 30? Yes the recent example needed to use r20 with his 2 HD 5970s and previous ones used r25 and r26. I know there is no r parameter available for MilkyWay ATI and MilkyWay CUDA does not use an app_info.xml file, it was just part of explaining the significance of memory buffer issues. ID: 34231 · Rating: 0 · rate: / Reply Quote

David Glogau* Send message Joined: 12 Aug 09 Posts: 172 Credit: 645,240,165 RAC: 0	Message 34232 - Posted: 5 Dec 2009, 1:06:23 UTC - in response to Message 34229. I agree with this assessment. I have one machine, an i7 with 8 cores, 2 x 295's and one 260, and, 8 GB of RAM running Cosmology which is a memory pig. I can not run MW on this machine. But on a similar machine running CDN I have no problems. Prior to the increased WU size I could (mostly) run the WU's on the Cosmo box. I have several 4GB sticks on order to help with the out of memory problems that crop up on the Cosmo box, so when I get them installed (12 GB total) I will try MW again and see if they run OK. ID: 34232 · Rating: 0 · rate: / Reply Quote

kashi Send message Joined: 30 Dec 07 Posts: 311 Credit: 149,490,184 RAC: 0	Message 34242 - Posted: 5 Dec 2009, 3:00:05 UTC Last modified: 5 Dec 2009, 3:47:25 UTC Well that seems to confirm my supposition that it is something to do with memory. I think the CUDA version was developed by the project, so it may take some time before this can be fixed. Due to the architecture of current nVidia cards it is possible that it is not able to be fixed without reducing performance. Therefore in the meantime here are some suggestions for possible workarounds: * For those with 4GB of memory and Windows 32-bit, use a 64-bit Windows version so that usable memory is not limited to 3GB. * Install more memory. * Install only a single or single core video card. * Remove additional video cards even if they are not being used by MilkyWay CUDA. * Don't run other projects or applications that are memory intensive. * Every little may help, so disable unnecessary services, set some others to manual instead of auto. Blackviper's Service Configurations may be useful for this. These are only suggestions, I do not know which combination of them may work for a particular configuration. Others with personal experience of nVidia cards will be able to give better advice or corrections. I know some of these suggestions are not possible or will be unacceptable for some contributors, it is not my intention to offend. ID: 34242 · Rating: 0 · rate: / Reply Quote