Message boards :
Number crunching :
Sudden mass of WU's finishing with Computation Error
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Send message Joined: 28 Apr 08 Posts: 1415 Credit: 2,716,428 RAC: 0 |
Next I tried overclocking, maybe there won't be errors if the WU finishes faster, but still invalid. What a surprise. These are the clock speeds for a GTX260 from nVidia site specs Graphics 576 Processor 1242 Memory 999 hope it helps ;-p |
Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0 |
These are the clock speeds for a GTX260 from nVidia site specs Thanks Bruce, that's exactly what my GPUs are running on (see below). OK, as long as there is not much help from the project responsibles here and as long as I'm out of new ideas, I'm gonna disable CUDA on MW and enable it on Einstein. It's less of a waste then when I don't use my GPUs. OK, theres Seti and Collats but not that often ;-))) It's a shame because I think MW is a very interesting project. To check if there's something wrong with my GPUs and to keep my fans spinning, I joined GPUGrid. I'll keep you informed, if I find something interesting there. |
Send message Joined: 31 Mar 08 Posts: 61 Credit: 18,325,284 RAC: 0 |
MW is still my backup project, collatz is my primary. I have to say I'm pleased with the 4x increase in wokrunit size, and no problems what so ever on a 4870x2. |
Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0 |
MW is still my backup project, collatz is my primary. I am pretty happy with them on my GTX280 and ATI cards too ... but the fact that some of them run and some of them complete, or even most of them running and completing is not a comfort to those whose systems suddenly started failing tasks when they were increased in size. Again, it is possible that several hundred cards suddenly failed ... but it is far more likely that there is something else going on ... If the calculations are running into NaN situations then it would seem that there is the potential for overflows or underflows on some classes of cards. |
Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0 |
After I successfully finished four WUs on GPUGRID, two on each GPU, I'm pretty sure that my GPUs are OK. There is a thread in the GPUGRID forum where a problem regarding failing GTX260 GPUs is discussed. Seems that some cards don't like fast fourier transformations (fft), mainly the older ones with 192 shaders and with the reference design (one fan), too. So mine both are the new ones with 216 shaders, 55nm architecture and they have two fans. One is a palit, the other is a gainward but they look quite the same, only the colours are different. BIOS version of both cards is 62.00.49.00.03 Can someone with a non failing, two fan 55nm GTX260/216 post his/her BIOS version, please? You can use GPU-Z to get it. Maybe I can find a version that is running on my cards, too. Is there a debug mode for the MW CUDA application with enhanced logging so one can see what causes the NaN result? Seems that the app isn't crashing. It runs till the end but creates an invalid result. There should we some kind of debug mode or how do the developers test the apps??? OK, sometimes it seems that there are no tests at all ;-))) Edit: I've read through this thread again and it seems that, unlike to GPUGRID, here at MW there are not only GTX260 GPUs that fail. There's a GTX280 and Paul's GTX2965, too. I can understand why the GTX280 could fail too because when I got it right a GTX260 is a GTX280 that didn't pass QA due to defective shaders. So they cut off the defective shader ALUs to 216, reduce clocks and memory and sell it as GTX260. But where's the link to the 295ers? Aren't they dual 275ers? So the 275/295er chips are not the same as the 260/280ers? Who knows. Maybe there will be a day to come when you can read that a pear is an apple that didn't make it through QA due to it's irregular form. |
Send message Joined: 11 Oct 08 Posts: 8 Credit: 8,821,808 RAC: 0 |
I'm not having any problems on my gpu's 2 gtx 260-216 cores windows 7 x64 ninvidia driver 195.39 cuda 3 boinc 6.10.19 windowsx64 |
Send message Joined: 30 Dec 07 Posts: 311 Credit: 149,490,184 RAC: 0 |
It is possible that this is a memory buffer issue. If so those with a smaller amount of system memory or less bandwidth on an older architecture and/or multiple nVidia GPU cores would be more likely to experience the issue than those with a larger amount of memory/newer system and/or only a single nVidia GPU core. I have no experience with nVidia cards, this is only theory, it would need to be tested by someone who has the errors either installing more memory or only running one nVidia GPU per box and seeing if it fixed it. I suppose you could also test it by swapping an nVidia card that is erroring to another system that has more available memory. |
Send message Joined: 24 Dec 07 Posts: 1947 Credit: 240,884,648 RAC: 0 |
I've been following this thread and I'm just tossing up ideas... What is the physical set up for the computers that return invalid wu's? Could it be that the power supply is a little light? Is there an issue with the BIOS of the motherboard so the rev needs updating? |
Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0 |
It is possible that this is a memory buffer issue. If so those with a smaller amount of system memory or less bandwidth on an older architecture and/or multiple nVidia GPU cores would be more likely to experience the issue than those with a larger amount of memory/newer system and/or only a single nVidia GPU core. There's only one GTX260 running per machine. OK, the WinXP box has a 9400GT but it's for display usage only because it created errors on both Seti and Collatz. Both machines have 4GB of RAM installed. And before someone starts to complain, yes I'm aware of the fact that WinXP 32Bit has only access to 3GB! I saw that I forgot to mention RAM in my system overview below. But how could a CUDA application be concerned by amount system memory? OK, there's a small part that is running on CPU but is that of any interest? And no matter if I have run the MW app standalone or with others, it fails in every situation. Only the short ones finish with valid results. |
Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0 |
I've been following this thread and I'm just tossing up ideas... Here's my PSUs: Enermax Pro 82+ 425W in use since 9 months Enermax Pro 82+ 625W in use since 1 month OK, the 425W is a little bit at the limit but the machine with the 625W fails, too. The machine with the 625W PSU has another 9400GT running, but as I mentioned before it is only for display use. And according to nvidia the 9400GT needs another 50W, so the 625W PSU should be sufficient. BIOS of both MBs is flashed to the latest available version. The boards are different types, Asus P5QL-E and Asus P5Q Pro Turbo. And hey, it's only Milkyway that has a problem with my hardware! The GPUGRID WUs are running nearly 8 hours without interruption and they finish valid! So do Seti, Seti Beta, Collatz and Einstein. So who do you think is the one that should do all that trouble shooting? Me? I'm getting more and more sick if this! OK, I should stop typing because I'm getting angry right now! And then I mostly loose my countenance. |
Send message Joined: 31 Mar 08 Posts: 61 Credit: 18,325,284 RAC: 0 |
I doubt this is a hardware issue. And I have'nt heard from Gipsel in a while. </hint> :P Notice how these errors only crept up with the 4x increase in WU size. Time for another GPU software update to correctly handle the 4x increase? |
Send message Joined: 11 Oct 08 Posts: 8 Credit: 8,821,808 RAC: 0 |
I'm not having any problems on my gpu's 2 gtx 260-216 cores My box is only like 3 or 4 months old... asus a6t motherboard Intel I7 920 2.67 o/c to 3.4 and I have 12gigs of DDR3 ram the video card are bothe evga gtx260-216 core factory o/c which I have o/c even more. |
Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0 |
Hello ganja, can you please post the bios version of your GTX260 cards? Thank you. Let me ask to the public again it someone knows if the MW app can be run in some kind of debug mode? |
Send message Joined: 19 Jul 08 Posts: 67 Credit: 272,086,462 RAC: 0 |
Dude, there's a config file you can create to track operations of WUs, but I've never used it. This is the link: Client Configuration |
Send message Joined: 18 Jul 09 Posts: 7 Credit: 2,373,140 RAC: 0 |
All of mine are invalid now also. 2 EVGA 260, core 216s. They complete fine but marked invalid. 1 is bios: 62.00.38.00.50 2 is bios: 62.00.4c.00.50 Nvidia 195.62 Win. XP64 Boinc 6.10.18 Overclocked or stock, same invalid result. |
Send message Joined: 30 Dec 07 Posts: 311 Credit: 149,490,184 RAC: 0 |
There's only one GTX260 running per machine. OK, the WinXP box has a 9400GT but it's for display usage only because it created errors on both Seti and Collatz. I don't know the technical details of how GPU applications use system memory or use it to remap the video memory, I just know that they do. Errors were reported a while ago with the Collatz ATI application from people who were running multiple 1GB graphics memory ATI GPUs and had only 1 GB of system memory. More recently someone with only 2GB of memory also got those buffer errors when trying to run 2 HD 5970 cards. The solution in both cases was to limit the amount of video memory used by the application with the use of the r parameter. I know Collatz ATI is a different application from MilkyWay ATI or CUDA and is more dependent on video card memory bandwidth. However system memory would still need to be used and when the MilkyWay WU is 4 times longer it possiubly uses more than previously. Because I had noticed that some who reported no problems even with multiple nVidia cards had 8GB or 12GB of system memory and some of those who were reporting problems had 3GB reported as available or 4GB available with multiple GPU cores I reasoned that memory buffer issues may be causing a problem. Could be wrong, just raising the possibility to try and help narrow down what is causing these errors for some while others are unaffected. Therefore it is possible that those who have a large amount of system memory installed and use a Windows 64-bit operating system may not encounter these errors. Also graphics card memory needs to be remapped to system memory so multiple cards could still cause a problem even if one of them is not being used for MilkyWay CUDA processing. |
Send message Joined: 19 Jul 08 Posts: 67 Credit: 272,086,462 RAC: 0 |
There's only one GTX260 running per machine. OK, the WinXP box has a 9400GT but it's for display usage only because it created errors on both Seti and Collatz. I also remember someone had to use r=20 in the app_info file to keep WUs from erroring out. These people may not use the app_info file though? I believe the default is 30? |
Send message Joined: 30 Dec 07 Posts: 311 Credit: 149,490,184 RAC: 0 |
I also remember someone had to use r=20 in the app_info file to keep WUs from erroring out. These people may not use the app_info file though? I believe the default is 30? Yes the recent example needed to use r20 with his 2 HD 5970s and previous ones used r25 and r26. I know there is no r parameter available for MilkyWay ATI and MilkyWay CUDA does not use an app_info.xml file, it was just part of explaining the significance of memory buffer issues. |
Send message Joined: 12 Aug 09 Posts: 172 Credit: 645,240,165 RAC: 0 |
I agree with this assessment. I have one machine, an i7 with 8 cores, 2 x 295's and one 260, and, 8 GB of RAM running Cosmology which is a memory pig. I can not run MW on this machine. But on a similar machine running CDN I have no problems. Prior to the increased WU size I could (mostly) run the WU's on the Cosmo box. I have several 4GB sticks on order to help with the out of memory problems that crop up on the Cosmo box, so when I get them installed (12 GB total) I will try MW again and see if they run OK. |
Send message Joined: 30 Dec 07 Posts: 311 Credit: 149,490,184 RAC: 0 |
Well that seems to confirm my supposition that it is something to do with memory. I think the CUDA version was developed by the project, so it may take some time before this can be fixed. Due to the architecture of current nVidia cards it is possible that it is not able to be fixed without reducing performance. Therefore in the meantime here are some suggestions for possible workarounds: * For those with 4GB of memory and Windows 32-bit, use a 64-bit Windows version so that usable memory is not limited to 3GB. * Install more memory. * Install only a single or single core video card. * Remove additional video cards even if they are not being used by MilkyWay CUDA. * Don't run other projects or applications that are memory intensive. * Every little may help, so disable unnecessary services, set some others to manual instead of auto. Blackviper's Service Configurations may be useful for this. These are only suggestions, I do not know which combination of them may work for a particular configuration. Others with personal experience of nVidia cards will be able to give better advice or corrections. I know some of these suggestions are not possible or will be unacceptable for some contributors, it is not my intention to offend. |
©2024 Astroinformatics Group