1)
Message boards :
Number crunching :
Error while computing again and again
(Message 45886)
Posted 30 Jan 2011 by XJR-Maniac Post: Yes, you're right. They all have been de_separation_23_3s WUs, the last one crashed right now. So I will wait for the fix to come. In the meantime, there are other projects waiting for my GPUs ;-))) WU: de_separation_23_3s |
2)
Message boards :
Number crunching :
Error while computing again and again
(Message 45881)
Posted 30 Jan 2011 by XJR-Maniac Post: Same issue for me on MW v0.50 with NVIDIA GTX 260. All tasks are running up to 100% and crash on finish line. Just updated driver to 266.58 on one machine with no avail. Checked my wingmen and it seems that MW v0.23 with ATI is crashing, too. Here's the log: <core_client_version>6.10.17</core_client_version> <![CDATA[ <message> Unzulässige Funktion. (0x1) - exit code 1 (0x1) </message> <stderr_txt> <search_application> milkywayathome separation 0.50 Windows x86 double OpenCL </search_application> Found 1 platforms Platform 0 information: Platform name: NVIDIA CUDA Platform version: OpenCL 1.0 CUDA 3.2.1 Platform vendor: Platform profile: Platform extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d9_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll Using device 0 on platform 0 Found 2 CL devices Device GeForce GTX 260 (NVIDIA Corporation:0x10de) Type: CL_DEVICE_TYPE_GPU Driver version: 266.58 Version: OpenCL 1.0 CUDA Compute capability: 1.3 Little endian: CL_TRUE Error correction: CL_FALSE Image support: CL_TRUE Address bits: 32 Max compute units: 27 Clock frequency: 1104 Mhz Global mem size: 939327488 Max mem alloc: 234831872 Global mem cache: 0 Cacheline size: 0 Local mem type: CL_LOCAL Local mem size: 16384 Max const args: 9 Max const buf size: 65536 Max parameter size: 4352 Max work group size: 512 Max work item dim: 3 Max work item sizes: { 512, 512, 64 } Mem base addr align: 2048 Min type align size: 128 Timer resolution: 1000 ns Double extension: MW_CL_KHR_FP64 Extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d9_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 Found a compute capability 1.3 device. Using -cl-nv-maxrregcount=32 Compiler flags: -cl-mad-enable -cl-no-signed-zeros -cl-strict-aliasing -cl-finite-math-only -DUSE_CL_MATH_TYPES=0 -DUSE_MAD=1 -DUSE_FMA=0 -cl-nv-verbose -cl-nv-maxrregcount=32 -DDOUBLEPREC=1 -DMILKYWAY_MATH_COMPILATION -DNSTREAM=3 -DFAST_H_PROB=1 -DAUX_BG_PROFILE=0 -DUSE_IMAGES=1 -DI_DONT_KNOW_WHY_THIS_DOESNT_WORK_HERE=0 Build status: CL_BUILD_SUCCESS Build log: : Considering profile 'compute_13' for gpu='sm_13' in 'cuModuleLoadDataEx_4' : Retrieving binary for 'cuModuleLoadDataEx_4', for gpu='sm_13', usage mode=' --verbose --maxrregcount 32 ' : Considering profile 'compute_13' for gpu='sm_13' in 'cuModuleLoadDataEx_4' : Control flags for 'cuModuleLoadDataEx_4' disable search path : Ptx binary found for 'cuModuleLoadDataEx_4', architecture='compute_13' : Ptx compilation for 'cuModuleLoadDataEx_4', for gpu='sm_13', ocg options=' --verbose --maxrregcount 32 ' ptxas info : Compiling entry function 'mu_sum_kernel' for 'sm_13' ptxas info : Used 32 registers, 800+0 bytes lmem, 48+16 bytes smem, 56 bytes cmem[1], 4 bytes cmem[2], 4 bytes cmem[3], 4 bytes cmem[4], 4 bytes cmem[5], 4 bytes cmem[6] Kernel work group info: Work group size = 512 Kernel local mem size = 64 Compile work group size = { 0, 0, 0 } Group size = 64, per CU = 8, threads per CU = 512 Block size = 13824 Desired = 163 Min sol: 163 13312 Lower n solution: n = 163, x = 13312 Higher n solution: n = 163, x = 13312 Using solution: n = 163, x = 13312 Range: { nu_steps = 640, mu_steps = 1600, r_steps = 1400 } Iteration area: 2240000 Chunk estimate: 163 Num chunks: 163 Added area: 13312 Effective area: 2253312 Integration time: 957.124801 s. Average time per iteration = 1495.507502 ms Kernel work group info: Work group size = 512 Kernel local mem size = 64 Compile work group size = { 0, 0, 0 } Group size = 64, per CU = 8, threads per CU = 512 Block size = 13824 Desired = 21 Min sol: 1 0 Min sol: 1 0 Min sol: 1 0 Min sol: 1 0 Min sol: 1 0 Min sol: 1 0 Min sol: 1 0 Min sol: 1 0 Min sol: 1 0 Min sol: 1 0 Min sol: 1 0 Min sol: 1 0 Min sol: 1 0 Min sol: 1 0 Min sol: 1 0 Min sol: 1 0 Min sol: 1 0 Min sol: 1 0 Min sol: 1 0 Min sol: 1 0 Didn't find a solution. Using fallback solution n = 20, x = 0 Using solution: n = 20, x = 0 Range: { nu_steps = 160, mu_steps = 400, r_steps = 700 } Iteration area: 280000 Chunk estimate: 21 Num chunks: 20 Added area: 0 Effective area: 280000 Global dimensions not divisible by local Failed to find good run sizes Failed to calculate integral 1 12:30:48 (2372): called boinc_finish </stderr_txt> ]]> Couldn't be a BOINC client version issue because the ATIs are crashing on 6.10.56 while I'm using 6.10.17. Will set my boxes to NNW until further notice. |
3)
Message boards :
Number crunching :
Invalid results with XP (64-bit) ((or maybe processor generation)?
(Message 34358)
Posted 7 Dec 2009 by XJR-Maniac Post: The bug should be fixed right now. To be sure, please reset MW via BOINC manager on your WinXP x86 and x64 machines or detach and reattach to get the new Ver. 0.24 application. That should do the trick. For most of us it did, see this thread over here: Sudden mass of WU's finishing with Computation Error |
4)
Message boards :
Number crunching :
Sudden mass of WU's finishing with Computation Error
(Message 34325)
Posted 6 Dec 2009 by XJR-Maniac Post: Finished my first valid CUDA WUs on WinXP x86! Well done, folks and thank you for your participation in this interesting little bug hunt. Keep on crunchin! |
5)
Message boards :
Number crunching :
Sudden mass of WU's finishing with Computation Error
(Message 34312)
Posted 6 Dec 2009 by XJR-Maniac Post: The memory in the GPU was not being initialized properly, I have to install Visual Studio on Windows so it may take a little longer than expected. No problem, GPUGRID will be happy to hear that it will last a little bit longer ;-))) Can you tell us later why this GPU memory initialization problem is only causing trouble on WinXP? Or does the improper initialization only occur on XP? No need to hurry. I don't want to disturb you, I'm just curious and I think all others are, too. |
6)
Message boards :
Number crunching :
Sudden mass of WU's finishing with Computation Error
(Message 34310)
Posted 6 Dec 2009 by XJR-Maniac Post: The bug has been identified and fixed. It is currently going through testing and a new version should be up sometime within the next 48 hours (pending the results of the test, and it will most likely be within an hour, but I like to give myself a larger window for unforeseable problems). The new version also contains new performance enhancements that give a 5-10% decrease in running times. Hello Anthony, first I want to thank you for your much appreciated help! Well done dude! Can you tell us something more about the bug? It seems that the current CUDA app doesn't like WinXP, no matter if it's x86 or x64 and no matter also how much RAM is installed or available. I was in PM contact with redgoldendragon and he was so kind to install Vista x64 on one of his failing WinXP x86 machines and after that, the results of the CUDA app were finishing successful! OK, on Vista x64 all the installed 8GB of RAM can be used but there is another user, jotun263 (see below), with WinXP x64 who installed up to 10GB of system RAM with no success. Thank you! |
7)
Message boards :
Number crunching :
Sudden mass of WU's finishing with Computation Error
(Message 34307)
Posted 6 Dec 2009 by XJR-Maniac Post: OK, the last posts seem to describe another problem which is completely different from the one that I and some others are suffering from. I tried a bunch of WUs with CPU usage AND CPU time both reduced to 50% but my WUs are still failing. I had only MW CUDA running for my tests but it makes no difference. But I noticed another thing that could be of any use to narrow down things. If I take a closer look to the users that suffer from invalid results then I see that, apart from salyavin, who is not answering my PM to run a few more WUs, all machines are running WinXP or Win2003, which is nothing else than WinXP for server. Even WinXP x64 seems to have the same problem with MW. And it's not about service packs because there are WinXP machines with SP2 and SP3 that have invalid results. All others seem to have different problems like 0x1 errors (incorrect function) or others. If not, my theory is invalid. |
8)
Message boards :
Number crunching :
Sudden mass of WU's finishing with Computation Error
(Message 34290)
Posted 6 Dec 2009 by XJR-Maniac Post: I am beginning to think there is something very wrong with 195.62. @Jock: Could you please make your machines visible so we can see what's going on? I have a sneaking suspicion but want to be sure and collect some more information before I shout it to the public. Or can you tell us on which OS you get the failures and what the failures look like? The failures we are talking about are results that finish successfully but invalid. There are others with error 0x1 incorrect function but those are not our problem. Do you run both MW CUDA and CPU? I had a crash today running both CUDA and CPU application simultaneous. |
9)
Message boards :
Number crunching :
Sudden mass of WU's finishing with Computation Error
(Message 34270)
Posted 5 Dec 2009 by XJR-Maniac Post: Is this still an issue? My GPUs both have 896MB of RAM but there are others with more that fail: salyavin with GTX285/1024 OK, you don't see anything because of fast purging. Others with the same GPU and amount of RAM finish successful: David Glogau ganja So I don't think that it's of RAM at all, neither system nor GPU. In the meantime, I tried to download a fresh CUDA WU and while both MW CUDA AND MW CPU were running simultaneous, the system starts jerking around, network connection got interrupted from time to time, music starts hanging and then the system crashed that hard that it not even had time to create a memory dump. Not necessary to mention that the CUDA WU failed again. |
10)
Message boards :
Number crunching :
Sudden mass of WU's finishing with Computation Error
(Message 34265)
Posted 5 Dec 2009 by XJR-Maniac Post: Bad news, it's not about insufficient system memory. I ran four long 4x MW WUs with 8GB of system RAM, only one GPU installed and all unnecessary services and processes disabled on my Win2003 Server Enterprise Edition with PAE enabled and it still finished with "fitness: 1.#QNAN000000000000000". I ran them with network disabled so don't be surprised that they have not been reported, yet. I have a screen shot of the memory usage here. My next idea was to try an optimized app but there is no opt app for CUDA. Why? Is the stock CUDA app so well optimized that it's not necessary? |
11)
Message boards :
Number crunching :
Sudden mass of WU's finishing with Computation Error
(Message 34247)
Posted 5 Dec 2009 by XJR-Maniac Post: I don't know the technical details of how GPU applications use system memory or use it to remap the video memory, I just know that they do. Errors were reported a while ago with the Collatz ATI application from people who were running multiple 1GB graphics memory ATI GPUs and had only 1 GB of system memory. More recently someone with only 2GB of memory also got those buffer errors when trying to run 2 HD 5970 cards. The solution in both cases was to limit the amount of video memory used by the application with the use of the r parameter. OK, I understand what you mean. What I always thought was that video memory is mapped BEHIND the physical address range. So that's why WinXP 32 cannot handle more than 3GB because it cannot use PAE by design. On Windows2003 Enhanced Server I can enable PAE and enable memory remapping in BIOS so that video memory can be remapped above the installed memory, no matter how much is installed. So tomorrow I will give it a shot and move the 4GB to the Win2003 machine with PAE on so that the full 8GB are available and try to run MW stand alone. |
12)
Message boards :
Number crunching :
Sudden mass of WU's finishing with Computation Error
(Message 34246)
Posted 5 Dec 2009 by XJR-Maniac Post: Dude, there's a config file you can create to track operations of WUs, but I've never used it. This is the link: I'm aware of that but I don't want to debug the core client but the MW application. On that page I couldn't find any command that could do that. I found the command <coproc_debug> but it's not really for debugging MW. The result of this command was this: [Milkyway@home] [coproc_debug] Assigning CUDA instance 0 to de_s222_3s_best_1p_01r_41_701094_1259781480_0 There should be a debug command for the MW app. |
13)
Message boards :
Number crunching :
Sudden mass of WU's finishing with Computation Error
(Message 34218)
Posted 4 Dec 2009 by XJR-Maniac Post:
Hello ganja, can you please post the bios version of your GTX260 cards? Thank you. Let me ask to the public again it someone knows if the MW app can be run in some kind of debug mode? |
14)
Message boards :
Number crunching :
Sudden mass of WU's finishing with Computation Error
(Message 34211)
Posted 4 Dec 2009 by XJR-Maniac Post: I've been following this thread and I'm just tossing up ideas... Here's my PSUs: Enermax Pro 82+ 425W in use since 9 months Enermax Pro 82+ 625W in use since 1 month OK, the 425W is a little bit at the limit but the machine with the 625W fails, too. The machine with the 625W PSU has another 9400GT running, but as I mentioned before it is only for display use. And according to nvidia the 9400GT needs another 50W, so the 625W PSU should be sufficient. BIOS of both MBs is flashed to the latest available version. The boards are different types, Asus P5QL-E and Asus P5Q Pro Turbo. And hey, it's only Milkyway that has a problem with my hardware! The GPUGRID WUs are running nearly 8 hours without interruption and they finish valid! So do Seti, Seti Beta, Collatz and Einstein. So who do you think is the one that should do all that trouble shooting? Me? I'm getting more and more sick if this! OK, I should stop typing because I'm getting angry right now! And then I mostly loose my countenance. |
15)
Message boards :
Number crunching :
Sudden mass of WU's finishing with Computation Error
(Message 34210)
Posted 4 Dec 2009 by XJR-Maniac Post: It is possible that this is a memory buffer issue. If so those with a smaller amount of system memory or less bandwidth on an older architecture and/or multiple nVidia GPU cores would be more likely to experience the issue than those with a larger amount of memory/newer system and/or only a single nVidia GPU core. There's only one GTX260 running per machine. OK, the WinXP box has a 9400GT but it's for display usage only because it created errors on both Seti and Collatz. Both machines have 4GB of RAM installed. And before someone starts to complain, yes I'm aware of the fact that WinXP 32Bit has only access to 3GB! I saw that I forgot to mention RAM in my system overview below. But how could a CUDA application be concerned by amount system memory? OK, there's a small part that is running on CPU but is that of any interest? And no matter if I have run the MW app standalone or with others, it fails in every situation. Only the short ones finish with valid results. |
16)
Message boards :
Number crunching :
The Longer Work Units...
(Message 34208)
Posted 4 Dec 2009 by XJR-Maniac Post: Can you please post brand and bios version of your cards? Thank you. |
17)
Message boards :
Number crunching :
The Longer Work Units...
(Message 34203)
Posted 4 Dec 2009 by XJR-Maniac Post: Nope, I even UNDERCLOCKED my GPU but it failed, though. If you like to KNOW what's going on, instead of speculating, maybe you would like to read this thread over here: Sudden mass of WU's finishing with Computation Error |
18)
Message boards :
Number crunching :
Sudden mass of WU's finishing with Computation Error
(Message 34202)
Posted 4 Dec 2009 by XJR-Maniac Post: After I successfully finished four WUs on GPUGRID, two on each GPU, I'm pretty sure that my GPUs are OK. There is a thread in the GPUGRID forum where a problem regarding failing GTX260 GPUs is discussed. Seems that some cards don't like fast fourier transformations (fft), mainly the older ones with 192 shaders and with the reference design (one fan), too. So mine both are the new ones with 216 shaders, 55nm architecture and they have two fans. One is a palit, the other is a gainward but they look quite the same, only the colours are different. BIOS version of both cards is 62.00.49.00.03 Can someone with a non failing, two fan 55nm GTX260/216 post his/her BIOS version, please? You can use GPU-Z to get it. Maybe I can find a version that is running on my cards, too. Is there a debug mode for the MW CUDA application with enhanced logging so one can see what causes the NaN result? Seems that the app isn't crashing. It runs till the end but creates an invalid result. There should we some kind of debug mode or how do the developers test the apps??? OK, sometimes it seems that there are no tests at all ;-))) Edit: I've read through this thread again and it seems that, unlike to GPUGRID, here at MW there are not only GTX260 GPUs that fail. There's a GTX280 and Paul's GTX2965, too. I can understand why the GTX280 could fail too because when I got it right a GTX260 is a GTX280 that didn't pass QA due to defective shaders. So they cut off the defective shader ALUs to 216, reduce clocks and memory and sell it as GTX260. But where's the link to the 295ers? Aren't they dual 275ers? So the 275/295er chips are not the same as the 260/280ers? Who knows. Maybe there will be a day to come when you can read that a pear is an apple that didn't make it through QA due to it's irregular form. |
19)
Message boards :
Number crunching :
Sudden mass of WU's finishing with Computation Error
(Message 34153)
Posted 3 Dec 2009 by XJR-Maniac Post: These are the clock speeds for a GTX260 from nVidia site specs Thanks Bruce, that's exactly what my GPUs are running on (see below). OK, as long as there is not much help from the project responsibles here and as long as I'm out of new ideas, I'm gonna disable CUDA on MW and enable it on Einstein. It's less of a waste then when I don't use my GPUs. OK, theres Seti and Collats but not that often ;-))) It's a shame because I think MW is a very interesting project. To check if there's something wrong with my GPUs and to keep my fans spinning, I joined GPUGrid. I'll keep you informed, if I find something interesting there. |
20)
Message boards :
Number crunching :
Sudden mass of WU's finishing with Computation Error
(Message 34149)
Posted 3 Dec 2009 by XJR-Maniac Post: Next I tried overclocking, maybe there won't be errors if the WU finishes faster, but still invalid. What a surprise. Can someone please post the clocks of his GTX260? @David: Can you please take a look at your result files, just to see what the fitness parameter says? Especially on your GTX260. It should be a real number like the results from Starfire. Thank you! |
©2024 Astroinformatics Group