Sudden mass of WU's finishing with Computation Error

Author	Message
XJR-Maniac Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0	Message 34246 - Posted: 5 Dec 2009, 4:16:47 UTC - in response to Message 34221. Dude, there's a config file you can create to track operations of WUs, but I've never used it. This is the link: Client Configuration I'm aware of that but I don't want to debug the core client but the MW application. On that page I couldn't find any command that could do that. I found the command <coproc_debug> but it's not really for debugging MW. The result of this command was this: [Milkyway@home] [coproc_debug] Assigning CUDA instance 0 to de_s222_3s_best_1p_01r_41_701094_1259781480_0 There should be a debug command for the MW app. ID: 34246 · Rating: 0 · rate: / Reply Quote

XJR-Maniac Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0	Message 34247 - Posted: 5 Dec 2009, 4:32:00 UTC - in response to Message 34229. I don't know the technical details of how GPU applications use system memory or use it to remap the video memory, I just know that they do. Errors were reported a while ago with the Collatz ATI application from people who were running multiple 1GB graphics memory ATI GPUs and had only 1 GB of system memory. More recently someone with only 2GB of memory also got those buffer errors when trying to run 2 HD 5970 cards. The solution in both cases was to limit the amount of video memory used by the application with the use of the r parameter. I know Collatz ATI is a different application from MilkyWay ATI or CUDA and is more dependent on video card memory bandwidth. However system memory would still need to be used and when the MilkyWay WU is 4 times longer it possiubly uses more than previously. Because I had noticed that some who reported no problems even with multiple nVidia cards had 8GB or 12GB of system memory and some of those who were reporting problems had 3GB reported as available or 4GB available with multiple GPU cores I reasoned that memory buffer issues may be causing a problem. Could be wrong, just raising the possibility to try and help narrow down what is causing these errors for some while others are unaffected. Therefore it is possible that those who have a large amount of system memory installed and use a Windows 64-bit operating system may not encounter these errors. Also graphics card memory needs to be remapped to system memory so multiple cards could still cause a problem even if one of them is not being used for MilkyWay CUDA processing. OK, I understand what you mean. What I always thought was that video memory is mapped BEHIND the physical address range. So that's why WinXP 32 cannot handle more than 3GB because it cannot use PAE by design. On Windows2003 Enhanced Server I can enable PAE and enable memory remapping in BIOS so that video memory can be remapped above the installed memory, no matter how much is installed. So tomorrow I will give it a shot and move the 4GB to the Win2003 machine with PAE on so that the full 8GB are available and try to run MW stand alone. ID: 34247 · Rating: 0 · rate: / Reply Quote

XJR-Maniac Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0	Message 34265 - Posted: 5 Dec 2009, 17:44:20 UTC Last modified: 5 Dec 2009, 17:45:55 UTC Bad news, it's not about insufficient system memory. I ran four long 4x MW WUs with 8GB of system RAM, only one GPU installed and all unnecessary services and processes disabled on my Win2003 Server Enterprise Edition with PAE enabled and it still finished with "fitness: 1.#QNAN000000000000000". I ran them with network disabled so don't be surprised that they have not been reported, yet. I have a screen shot of the memory usage here. My next idea was to try an optimized app but there is no opt app for CUDA. Why? Is the stock CUDA app so well optimized that it's not necessary? ID: 34265 · Rating: 0 · rate: / Reply Quote

Thamir Ghaslan Send message Joined: 31 Mar 08 Posts: 61 Credit: 18,325,284 RAC: 0	Message 34267 - Posted: 5 Dec 2009, 18:46:38 UTC - in response to Message 34265. Bad news, it's not about insufficient system memory. I ran four long 4x MW WUs with 8GB of system RAM, only one GPU installed and all unnecessary services and processes disabled on my Win2003 Server Enterprise Edition with PAE enabled and it still finished with "fitness: 1.#QNAN000000000000000". I ran them with network disabled so don't be surprised that they have not been reported, yet. I have a screen shot of the memory usage here. My next idea was to try an optimized app but there is no opt app for CUDA. Why? Is the stock CUDA app so well optimized that it's not necessary? From what I've been following it seems the majority of Nvidia users are erroring out. Again, the problems cooincided with the increased WU size. ATIs seem to handle it while Nvidias don't. I have'nt seen any feed back from any of the ATI or CUDA (inhouse?) coders so their insight would be highly valuable. ID: 34267 · Rating: 0 · rate: / Reply Quote

Anthony Waters Send message Joined: 16 Jun 09 Posts: 85 Credit: 172,476 RAC: 0	Message 34268 - Posted: 5 Dec 2009, 19:38:51 UTC Is this still an issue? If so please let me know how much Video RAM the graphics card has. The larger work units appear to be using 312MB of Video RAM. ID: 34268 · Rating: 0 · rate: / Reply Quote

Nightlord Send message Joined: 29 Jul 08 Posts: 12 Credit: 60,445,018 RAC: 0	Message 34269 - Posted: 5 Dec 2009, 19:41:12 UTC Last modified: 5 Dec 2009, 19:42:19 UTC Not sure if this is related or otherwise, but immediately following the increase in WU size, most of my ATI GPU's failed on most WU's. Work commitments meant I only had a short time to investigate before deciding to set no new tasks. However, what I observed was that WU's would run to completion, but then the next WU would fail. More specifically, I had two or three WU's running - after the end of the first of those, the 3rd or 4th WU and all subsequent WU's would fail. Having had time yesterday to investigate, I recalled I had seen something similar in the past on standard WU's on an old slow Pentium 4 based machine. It had something to do with the CPU utilisation at the start of a new WU, combined with the response time the starting up new WU. On that old machine, I had to set the app_info to run 2 WU's and set w1.05 in the command line tag. Without those tweaks I couldn't reliably run MW, let alone any CPU projects at the same time. My guess is there is a timeout that requires serviced withing a specific time frame and if not serviced, the WU fails. With that in mind, I experimented a bit yesterday. I dropped the boxes down to running 1 WU at a time (n1)and set w1.01 (w1.15 on the old P4). Hey presto, they have all been running happy and smooth. So a long story, and I know that specific solution is not available to Nvidia users through the app_info, but if you have a busy machine on CPU projects, try reducing the load a little. It might also give some pointers to the project staff to look into the CPU utilization and timings at WU start up. Symington weather report and video feed ID: 34269 · Rating: 0 · rate: / Reply Quote

XJR-Maniac Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0	Message 34270 - Posted: 5 Dec 2009, 20:32:42 UTC - in response to Message 34268. Last modified: 5 Dec 2009, 20:41:15 UTC Is this still an issue? If so please let me know how much Video RAM the graphics card has. The larger work units appear to be using 312MB of Video RAM. My GPUs both have 896MB of RAM but there are others with more that fail: salyavin with GTX285/1024 OK, you don't see anything because of fast purging. Others with the same GPU and amount of RAM finish successful: David Glogau ganja So I don't think that it's of RAM at all, neither system nor GPU. In the meantime, I tried to download a fresh CUDA WU and while both MW CUDA AND MW CPU were running simultaneous, the system starts jerking around, network connection got interrupted from time to time, music starts hanging and then the system crashed that hard that it not even had time to create a memory dump. Not necessary to mention that the CUDA WU failed again. ID: 34270 · Rating: 0 · rate: / Reply Quote

JockMacMad TSBT Send message Joined: 28 Jan 09 Posts: 31 Credit: 85,934,108 RAC: 0	Message 34271 - Posted: 5 Dec 2009, 20:34:16 UTC I am beginning to think there is something very wrong with 195.62. I keep loosing GPUs. 4 to 3 to 2 after a WU crash. Reboot and now I see only 3. So I tried booting a Vista partion with 185.xx drivers. 4 GPUs and I can blast Furmark with no issues nor GPU failures. So on my part I am thinking of winding back a few driver versions and see if my WU pass and stop loosing GPUs after a crash. Its so bad it needs a power off rather than a reboot to free them back up. Also my 260 is pretty much erroring 3 out of 4 units and thats up on 195.62 as well. So whilst its not just that driver version as others are seeing the WU errors on other versions for me at least 195.62 is not a good place to be. ID: 34271 · Rating: 0 · rate: / Reply Quote

Anthony Waters Send message Joined: 16 Jun 09 Posts: 85 Credit: 172,476 RAC: 0	Message 34272 - Posted: 5 Dec 2009, 22:09:42 UTC Last modified: 5 Dec 2009, 23:17:32 UTC There appears to be something wrong with the way the application is compiled for the GPU when using larger WU sizes. It works correctly when running in emulation mode (meaning the GPU code is compiled for the CPU, and executes on the CPU). Edit: It appears to be a bigger problem than I thought, the results actually change between running the application multiple times with using the same WU. ID: 34272 · Rating: 0 · rate: / Reply Quote

XJR-Maniac Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0	Message 34290 - Posted: 6 Dec 2009, 3:59:49 UTC - in response to Message 34271. I am beginning to think there is something very wrong with 195.62. I keep loosing GPUs. 4 to 3 to 2 after a WU crash. Reboot and now I see only 3. So I tried booting a Vista partion with 185.xx drivers. 4 GPUs and I can blast Furmark with no issues nor GPU failures. So on my part I am thinking of winding back a few driver versions and see if my WU pass and stop loosing GPUs after a crash. Its so bad it needs a power off rather than a reboot to free them back up. Also my 260 is pretty much erroring 3 out of 4 units and thats up on 195.62 as well. So whilst its not just that driver version as others are seeing the WU errors on other versions for me at least 195.62 is not a good place to be. @Jock: Could you please make your machines visible so we can see what's going on? I have a sneaking suspicion but want to be sure and collect some more information before I shout it to the public. Or can you tell us on which OS you get the failures and what the failures look like? The failures we are talking about are results that finish successfully but invalid. There are others with error 0x1 incorrect function but those are not our problem. Do you run both MW CUDA and CPU? I had a crash today running both CUDA and CPU application simultaneous. ID: 34290 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0	Message 34291 - Posted: 6 Dec 2009, 4:09:57 UTC - in response to Message 34272. There appears to be something wrong with the way the application is compiled for the GPU when using larger WU sizes. It works correctly when running in emulation mode (meaning the GPU code is compiled for the CPU, and executes on the CPU). Edit: It appears to be a bigger problem than I thought, the results actually change between running the application multiple times with using the same WU. Just to add a tad of information. I had failures on the systems with GTX295 cards (895MB) and GTX260 (895MB) each of the systems have two of each card ... though the data is no longer available, I did have some successes but mostly failures. The system with the single GTX280 card has 1024MB and all the times I checked I had no failures. It is entirely possible that it could be memory related in that limit of 895MB may not be high enough on the CUDA cards... though that does not explain why my ATI card with 512MB runs the tasks just fine thank you ... Both because I cannot see the point of returning masses of bad results and that I am a credit whore (junior grade) I have the systems that return bad results in NNT so they are doing GPU Grid for the most part (Collatz is still down ... <sob>) ... ID: 34291 · Rating: 0 · rate: / Reply Quote

Thamir Ghaslan Send message Joined: 31 Mar 08 Posts: 61 Credit: 18,325,284 RAC: 0	Message 34296 - Posted: 6 Dec 2009, 6:02:49 UTC Ok, if these tasks requires roughly 300 MB then I suggest to run 1 task at a time depending on GPU ram rather than running multiple tasks. I think most used to run 2 to 3 tasks at a time in the past? My system is set to run one tasks at a time, I have 1 GB shared on the 4870x2, so thats roughly 600 MB for two tasks, one task per GPU. ID: 34296 · Rating: 0 · rate: / Reply Quote

Thamir Ghaslan Send message Joined: 31 Mar 08 Posts: 61 Credit: 18,325,284 RAC: 0	Message 34297 - Posted: 6 Dec 2009, 6:11:46 UTC - in response to Message 34269. With that in mind, I experimented a bit yesterday. I dropped the boxes down to running 1 WU at a time (n1)and set w1.01 (w1.15 on the old P4). Hey presto, they have all been running happy and smooth. Yes, seems running multiple tasks will dry up the GPU ram! 300 MB per task on the longer WU. ID: 34297 · Rating: 0 · rate: / Reply Quote

David Glogau* Send message Joined: 12 Aug 09 Posts: 172 Credit: 645,240,165 RAC: 0	Message 34306 - Posted: 6 Dec 2009, 12:27:10 UTC - in response to Message 34269. Nightlord you are a God! I have been totally unable to run MW on my i7 running Cosmology since I "upgraded" to Windows 7. I have now set Cosmology to only run on seven cores, reducing CPU usage to ~88% and lo and behold MW now crunching happily on the five Cuda GPU's. Not sure if this is related or otherwise, but immediately following the increase in WU size, most of my ATI GPU's failed on most WU's. Work commitments meant I only had a short time to investigate before deciding to set no new tasks. However, what I observed was that WU's would run to completion, but then the next WU would fail. More specifically, I had two or three WU's running - after the end of the first of those, the 3rd or 4th WU and all subsequent WU's would fail. Having had time yesterday to investigate, I recalled I had seen something similar in the past on standard WU's on an old slow Pentium 4 based machine. It had something to do with the CPU utilisation at the start of a new WU, combined with the response time the starting up new WU. On that old machine, I had to set the app_info to run 2 WU's and set w1.05 in the command line tag. Without those tweaks I couldn't reliably run MW, let alone any CPU projects at the same time. My guess is there is a timeout that requires serviced withing a specific time frame and if not serviced, the WU fails. With that in mind, I experimented a bit yesterday. I dropped the boxes down to running 1 WU at a time (n1)and set w1.01 (w1.15 on the old P4). Hey presto, they have all been running happy and smooth. So a long story, and I know that specific solution is not available to Nvidia users through the app_info, but if you have a busy machine on CPU projects, try reducing the load a little. It might also give some pointers to the project staff to look into the CPU utilization and timings at WU start up. ID: 34306 · Rating: 0 · rate: / Reply Quote

XJR-Maniac Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0	Message 34307 - Posted: 6 Dec 2009, 14:28:22 UTC OK, the last posts seem to describe another problem which is completely different from the one that I and some others are suffering from. I tried a bunch of WUs with CPU usage AND CPU time both reduced to 50% but my WUs are still failing. I had only MW CUDA running for my tests but it makes no difference. But I noticed another thing that could be of any use to narrow down things. If I take a closer look to the users that suffer from invalid results then I see that, apart from salyavin, who is not answering my PM to run a few more WUs, all machines are running WinXP or Win2003, which is nothing else than WinXP for server. Even WinXP x64 seems to have the same problem with MW. And it's not about service packs because there are WinXP machines with SP2 and SP3 that have invalid results. All others seem to have different problems like 0x1 errors (incorrect function) or others. If not, my theory is invalid. ID: 34307 · Rating: 0 · rate: / Reply Quote

Anthony Waters Send message Joined: 16 Jun 09 Posts: 85 Credit: 172,476 RAC: 0	Message 34308 - Posted: 6 Dec 2009, 15:17:09 UTC The bug has been identified and fixed. It is currently going through testing and a new version should be up sometime within the next 48 hours (pending the results of the test, and it will most likely be within an hour, but I like to give myself a larger window for unforeseable problems). The new version also contains new performance enhancements that give a 5-10% decrease in running times. Thank you for your patience. ID: 34308 · Rating: 0 · rate: / Reply Quote

jotun263 Send message Joined: 24 Aug 09 Posts: 5 Credit: 519,653 RAC: 0	Message 34309 - Posted: 6 Dec 2009, 15:19:40 UTC - in response to Message 34307. Last modified: 6 Dec 2009, 15:20:20 UTC If I take a closer look to the users that suffer from invalid results then I see that, apart from salyavin, who is not answering my PM to run a few more WUs, all machines are running WinXP or Win2003, which is nothing else than WinXP for server. Even WinXP x64 seems to have the same problem with MW. And it's not about service packs because there are WinXP machines with SP2 and SP3 that have invalid results. All others seem to have different problems like 0x1 errors (incorrect function) or others. If not, my theory is invalid. Your theory seems to be correct. I'm running my machine with Win XP64 SP2 (There's no SP3 for the 64bit-version.). Most of the WUs were invalid. I reduced the CPU-load, increased the system memory from 6 up to 10 GB, tried all possible combinations of BOINC and graphic driver-versions and other things. In the meantime it looks like that the MW-CUDA-app isn't ver happy with XP... ID: 34309 · Rating: 0 · rate: / Reply Quote

XJR-Maniac Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0	Message 34310 - Posted: 6 Dec 2009, 16:41:22 UTC - in response to Message 34308. Last modified: 6 Dec 2009, 16:42:39 UTC The bug has been identified and fixed. It is currently going through testing and a new version should be up sometime within the next 48 hours (pending the results of the test, and it will most likely be within an hour, but I like to give myself a larger window for unforeseable problems). The new version also contains new performance enhancements that give a 5-10% decrease in running times. Thank you for your patience. Hello Anthony, first I want to thank you for your much appreciated help! Well done dude! Can you tell us something more about the bug? It seems that the current CUDA app doesn't like WinXP, no matter if it's x86 or x64 and no matter also how much RAM is installed or available. I was in PM contact with redgoldendragon and he was so kind to install Vista x64 on one of his failing WinXP x86 machines and after that, the results of the CUDA app were finishing successful! OK, on Vista x64 all the installed 8GB of RAM can be used but there is another user, jotun263 (see below), with WinXP x64 who installed up to 10GB of system RAM with no success. Thank you! ID: 34310 · Rating: 0 · rate: / Reply Quote

Anthony Waters Send message Joined: 16 Jun 09 Posts: 85 Credit: 172,476 RAC: 0	Message 34311 - Posted: 6 Dec 2009, 16:48:52 UTC The memory in the GPU was not being initialized properly, I have to install Visual Studio on Windows so it may take a little longer than expected. ID: 34311 · Rating: 0 · rate: / Reply Quote

XJR-Maniac Send message Joined: 18 Oct 07 Posts: 35 Credit: 4,684,314 RAC: 0	Message 34312 - Posted: 6 Dec 2009, 17:00:18 UTC - in response to Message 34311. The memory in the GPU was not being initialized properly, I have to install Visual Studio on Windows so it may take a little longer than expected. No problem, GPUGRID will be happy to hear that it will last a little bit longer ;-))) Can you tell us later why this GPU memory initialization problem is only causing trouble on WinXP? Or does the improper initialization only occur on XP? No need to hurry. I don't want to disturb you, I'm just curious and I think all others are, too. ID: 34312 · Rating: 0 · rate: / Reply Quote