Welcome to MilkyWay@home

Posts by XJR-Maniac

1) Message boards : Number crunching : Error while computing again and again (Message 45886)
Posted 30 Jan 2011 by Profile XJR-Maniac
Post:
Yes, you're right. They all have been de_separation_23_3s WUs, the last one crashed right now. So I will wait for the fix to come. In the meantime, there are other projects waiting for my GPUs ;-)))


WU: de_separation_23_3s
GPU: NVIDIA GTX 260

Looks like the problem with GTX2xx and WUs de_separation_23_3s Matt is aware of and going to fix in the next version.
Other WUs should run on your GPU.
The only WU with error I still could find in your list is Workunit 228247254 and there is an ATI card (HD5850?) that finished the WU and is waiting for validation (not chrashed).

2) Message boards : Number crunching : Error while computing again and again (Message 45881)
Posted 30 Jan 2011 by Profile XJR-Maniac
Post:
Same issue for me on MW v0.50 with NVIDIA GTX 260. All tasks are running up to 100% and crash on finish line. Just updated driver to 266.58 on one machine with no avail. Checked my wingmen and it seems that MW v0.23 with ATI is crashing, too.

Here's the log:

<core_client_version>6.10.17</core_client_version>
<![CDATA[
<message>
Unzulässige Funktion. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
<search_application> milkywayathome separation 0.50 Windows x86 double OpenCL </search_application>
Found 1 platforms
Platform 0 information:
  Platform name:       NVIDIA CUDA
  Platform version:    OpenCL 1.0 CUDA 3.2.1
  Platform vendor:     
  Platform profile:    
  Platform extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d9_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll 
Using device 0 on platform 0
Found 2 CL devices
Device GeForce GTX 260 (NVIDIA Corporation:0x10de)
Type:                CL_DEVICE_TYPE_GPU
Driver version:      266.58
Version:             OpenCL 1.0 CUDA
Compute capability:  1.3
Little endian:       CL_TRUE
Error correction:    CL_FALSE
Image support:       CL_TRUE
Address bits:        32
Max compute units:   27
Clock frequency:     1104 Mhz
Global mem size:     939327488
Max mem alloc:       234831872
Global mem cache:    0
Cacheline size:      0
Local mem type:      CL_LOCAL
Local mem size:      16384
Max const args:      9
Max const buf size:  65536
Max parameter size:  4352
Max work group size: 512
Max work item dim:   3
Max work item sizes: { 512, 512, 64 }
Mem base addr align: 2048
Min type align size: 128
Timer resolution:    1000 ns
Double extension:    MW_CL_KHR_FP64
Extensions:          cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d9_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll  cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 
Found a compute capability 1.3 device. Using -cl-nv-maxrregcount=32 

Compiler flags:
-cl-mad-enable -cl-no-signed-zeros -cl-strict-aliasing -cl-finite-math-only -DUSE_CL_MATH_TYPES=0 -DUSE_MAD=1 -DUSE_FMA=0 -cl-nv-verbose  -cl-nv-maxrregcount=32  -DDOUBLEPREC=1 -DMILKYWAY_MATH_COMPILATION -DNSTREAM=3 -DFAST_H_PROB=1 -DAUX_BG_PROFILE=0 -DUSE_IMAGES=1 -DI_DONT_KNOW_WHY_THIS_DOESNT_WORK_HERE=0  

Build status: CL_BUILD_SUCCESS
Build log: 

: Considering profile 'compute_13' for gpu='sm_13' in 'cuModuleLoadDataEx_4'
: Retrieving binary for 'cuModuleLoadDataEx_4', for gpu='sm_13', usage mode='  --verbose --maxrregcount 32  '
: Considering profile 'compute_13' for gpu='sm_13' in 'cuModuleLoadDataEx_4'
: Control flags for 'cuModuleLoadDataEx_4' disable search path
: Ptx binary found for 'cuModuleLoadDataEx_4', architecture='compute_13'
: Ptx compilation for 'cuModuleLoadDataEx_4', for gpu='sm_13', ocg options='  --verbose --maxrregcount 32  '
ptxas info    : Compiling entry function 'mu_sum_kernel' for 'sm_13'
ptxas info    : Used 32 registers, 800+0 bytes lmem, 48+16 bytes smem, 56 bytes cmem[1], 4 bytes cmem[2], 4 bytes cmem[3], 4 bytes cmem[4], 4 bytes cmem[5], 4 bytes cmem[6]
Kernel work group info:
  Work group size = 512
  Kernel local mem size = 64
  Compile work group size = { 0, 0, 0 }
Group size = 64, per CU = 8, threads per CU = 512
Block size = 13824
Desired = 163
Min sol: 163 13312
Lower n solution: n = 163, x = 13312
Higher n solution: n = 163, x = 13312
Using solution: n = 163, x = 13312
Range:          { nu_steps = 640, mu_steps = 1600, r_steps = 1400 }
Iteration area: 2240000
Chunk estimate: 163
Num chunks:     163
Added area:     13312
Effective area: 2253312
Integration time: 957.124801 s. Average time per iteration = 1495.507502 ms
Kernel work group info:
  Work group size = 512
  Kernel local mem size = 64
  Compile work group size = { 0, 0, 0 }
Group size = 64, per CU = 8, threads per CU = 512
Block size = 13824
Desired = 21
Min sol: 1 0
Min sol: 1 0
Min sol: 1 0
Min sol: 1 0
Min sol: 1 0
Min sol: 1 0
Min sol: 1 0
Min sol: 1 0
Min sol: 1 0
Min sol: 1 0
Min sol: 1 0
Min sol: 1 0
Min sol: 1 0
Min sol: 1 0
Min sol: 1 0
Min sol: 1 0
Min sol: 1 0
Min sol: 1 0
Min sol: 1 0
Min sol: 1 0
Didn't find a solution. Using fallback solution n = 20, x = 0
Using solution: n = 20, x = 0
Range:          { nu_steps = 160, mu_steps = 400, r_steps = 700 }
Iteration area: 280000
Chunk estimate: 21
Num chunks:     20
Added area:     0
Effective area: 280000
Global dimensions not divisible by local
Failed to find good run sizes
Failed to calculate integral 1
12:30:48 (2372): called boinc_finish

</stderr_txt>
]]>


Couldn't be a BOINC client version issue because the ATIs are crashing on 6.10.56 while I'm using 6.10.17.

Will set my boxes to NNW until further notice.
3) Message boards : Number crunching : Invalid results with XP (64-bit) ((or maybe processor generation)? (Message 34358)
Posted 7 Dec 2009 by Profile XJR-Maniac
Post:
The bug should be fixed right now. To be sure, please reset MW via BOINC manager on your WinXP x86 and x64 machines or detach and reattach to get the new Ver. 0.24 application. That should do the trick. For most of us it did, see this thread over here:

Sudden mass of WU's finishing with Computation Error
4) Message boards : Number crunching : Sudden mass of WU's finishing with Computation Error (Message 34325)
Posted 6 Dec 2009 by Profile XJR-Maniac
Post:
Finished my first valid CUDA WUs on WinXP x86!

Well done, folks and thank you for your participation in this interesting little bug hunt.

Keep on crunchin!

5) Message boards : Number crunching : Sudden mass of WU's finishing with Computation Error (Message 34312)
Posted 6 Dec 2009 by Profile XJR-Maniac
Post:
The memory in the GPU was not being initialized properly, I have to install Visual Studio on Windows so it may take a little longer than expected.


No problem, GPUGRID will be happy to hear that it will last a little bit longer ;-)))

Can you tell us later why this GPU memory initialization problem is only causing trouble on WinXP? Or does the improper initialization only occur on XP?

No need to hurry. I don't want to disturb you, I'm just curious and I think all others are, too.
6) Message boards : Number crunching : Sudden mass of WU's finishing with Computation Error (Message 34310)
Posted 6 Dec 2009 by Profile XJR-Maniac
Post:
The bug has been identified and fixed. It is currently going through testing and a new version should be up sometime within the next 48 hours (pending the results of the test, and it will most likely be within an hour, but I like to give myself a larger window for unforeseable problems). The new version also contains new performance enhancements that give a 5-10% decrease in running times.

Thank you for your patience.


Hello Anthony,

first I want to thank you for your much appreciated help! Well done dude!

Can you tell us something more about the bug? It seems that the current CUDA app doesn't like WinXP, no matter if it's x86 or x64 and no matter also how much RAM is installed or available.

I was in PM contact with redgoldendragon and he was so kind to install Vista x64 on one of his failing WinXP x86 machines and after that, the results of the CUDA app were finishing successful!

OK, on Vista x64 all the installed 8GB of RAM can be used but there is another user, jotun263 (see below), with WinXP x64 who installed up to 10GB of system RAM with no success.

Thank you!
7) Message boards : Number crunching : Sudden mass of WU's finishing with Computation Error (Message 34307)
Posted 6 Dec 2009 by Profile XJR-Maniac
Post:
OK, the last posts seem to describe another problem which is completely different from the one that I and some others are suffering from.

I tried a bunch of WUs with CPU usage AND CPU time both reduced to 50% but my WUs are still failing. I had only MW CUDA running for my tests but it makes no difference.

But I noticed another thing that could be of any use to narrow down things.

If I take a closer look to the users that suffer from invalid results then I see that, apart from salyavin, who is not answering my PM to run a few more WUs, all machines are running WinXP or Win2003, which is nothing else than WinXP for server. Even WinXP x64 seems to have the same problem with MW. And it's not about service packs because there are WinXP machines with SP2 and SP3 that have invalid results. All others seem to have different problems like 0x1 errors (incorrect function) or others. If not, my theory is invalid.
8) Message boards : Number crunching : Sudden mass of WU's finishing with Computation Error (Message 34290)
Posted 6 Dec 2009 by Profile XJR-Maniac
Post:
I am beginning to think there is something very wrong with 195.62.

I keep loosing GPUs. 4 to 3 to 2 after a WU crash. Reboot and now I see only 3. So I tried booting a Vista partion with 185.xx drivers. 4 GPUs and I can blast Furmark with no issues nor GPU failures.

So on my part I am thinking of winding back a few driver versions and see if my WU pass and stop loosing GPUs after a crash. Its so bad it needs a power off rather than a reboot to free them back up.

Also my 260 is pretty much erroring 3 out of 4 units and thats up on 195.62 as well.

So whilst its not just that driver version as others are seeing the WU errors on other versions for me at least 195.62 is not a good place to be.


@Jock: Could you please make your machines visible so we can see what's going on? I have a sneaking suspicion but want to be sure and collect some more information before I shout it to the public. Or can you tell us on which OS you get the failures and what the failures look like?

The failures we are talking about are results that finish successfully but invalid. There are others with error 0x1 incorrect function but those are not our problem. Do you run both MW CUDA and CPU? I had a crash today running both CUDA and CPU application simultaneous.
9) Message boards : Number crunching : Sudden mass of WU's finishing with Computation Error (Message 34270)
Posted 5 Dec 2009 by Profile XJR-Maniac
Post:
Is this still an issue?

If so please let me know how much Video RAM the graphics card has. The larger work units appear to be using 312MB of Video RAM.


My GPUs both have 896MB of RAM but there are others with more that fail:

salyavin with GTX285/1024

OK, you don't see anything because of fast purging.

Others with the same GPU and amount of RAM finish successful:

David Glogau
ganja

So I don't think that it's of RAM at all, neither system nor GPU.

In the meantime, I tried to download a fresh CUDA WU and while both MW CUDA AND MW CPU were running simultaneous, the system starts jerking around, network connection got interrupted from time to time, music starts hanging and then the system crashed that hard that it not even had time to create a memory dump.

Not necessary to mention that the CUDA WU failed again.
10) Message boards : Number crunching : Sudden mass of WU's finishing with Computation Error (Message 34265)
Posted 5 Dec 2009 by Profile XJR-Maniac
Post:
Bad news, it's not about insufficient system memory. I ran four long 4x MW WUs with 8GB of system RAM, only one GPU installed and all unnecessary services and processes disabled on my Win2003 Server Enterprise Edition with PAE enabled and it still finished with "fitness: 1.#QNAN000000000000000". I ran them with network disabled so don't be surprised that they have not been reported, yet.

I have a screen shot of the memory usage here.

My next idea was to try an optimized app but there is no opt app for CUDA. Why? Is the stock CUDA app so well optimized that it's not necessary?
11) Message boards : Number crunching : Sudden mass of WU's finishing with Computation Error (Message 34247)
Posted 5 Dec 2009 by Profile XJR-Maniac
Post:
I don't know the technical details of how GPU applications use system memory or use it to remap the video memory, I just know that they do. Errors were reported a while ago with the Collatz ATI application from people who were running multiple 1GB graphics memory ATI GPUs and had only 1 GB of system memory. More recently someone with only 2GB of memory also got those buffer errors when trying to run 2 HD 5970 cards. The solution in both cases was to limit the amount of video memory used by the application with the use of the r parameter.

I know Collatz ATI is a different application from MilkyWay ATI or CUDA and is more dependent on video card memory bandwidth. However system memory would still need to be used and when the MilkyWay WU is 4 times longer it possiubly uses more than previously.

Because I had noticed that some who reported no problems even with multiple nVidia cards had 8GB or 12GB of system memory and some of those who were reporting problems had 3GB reported as available or 4GB available with multiple GPU cores I reasoned that memory buffer issues may be causing a problem. Could be wrong, just raising the possibility to try and help narrow down what is causing these errors for some while others are unaffected.

Therefore it is possible that those who have a large amount of system memory installed and use a Windows 64-bit operating system may not encounter these errors. Also graphics card memory needs to be remapped to system memory so multiple cards could still cause a problem even if one of them is not being used for MilkyWay CUDA processing.


OK, I understand what you mean. What I always thought was that video memory is mapped BEHIND the physical address range. So that's why WinXP 32 cannot handle more than 3GB because it cannot use PAE by design. On Windows2003 Enhanced Server I can enable PAE and enable memory remapping in BIOS so that video memory can be remapped above the installed memory, no matter how much is installed.

So tomorrow I will give it a shot and move the 4GB to the Win2003 machine with PAE on so that the full 8GB are available and try to run MW stand alone.
12) Message boards : Number crunching : Sudden mass of WU's finishing with Computation Error (Message 34246)
Posted 5 Dec 2009 by Profile XJR-Maniac
Post:
Dude, there's a config file you can create to track operations of WUs, but I've never used it. This is the link:

Client Configuration


I'm aware of that but I don't want to debug the core client but the MW application. On that page I couldn't find any command that could do that. I found the command <coproc_debug> but it's not really for debugging MW. The result of this command was this:

[Milkyway@home] [coproc_debug] Assigning CUDA instance 0 to de_s222_3s_best_1p_01r_41_701094_1259781480_0


There should be a debug command for the MW app.
13) Message boards : Number crunching : Sudden mass of WU's finishing with Computation Error (Message 34218)
Posted 4 Dec 2009 by Profile XJR-Maniac
Post:

My box is only like 3 or 4 months old...
asus a6t motherboard Intel I7 920 2.67 o/c to 3.4 and I have 12gigs of DDR3 ram
the video card are bothe evga gtx260-216 core factory o/c which I have o/c even more.


Hello ganja, can you please post the bios version of your GTX260 cards? Thank you.

Let me ask to the public again it someone knows if the MW app can be run in some kind of debug mode?
14) Message boards : Number crunching : Sudden mass of WU's finishing with Computation Error (Message 34211)
Posted 4 Dec 2009 by Profile XJR-Maniac
Post:
I've been following this thread and I'm just tossing up ideas...

What is the physical set up for the computers that return invalid wu's? Could it be that the power supply is a little light? Is there an issue with the BIOS of the motherboard so the rev needs updating?


Here's my PSUs:

Enermax Pro 82+ 425W in use since 9 months
Enermax Pro 82+ 625W in use since 1 month

OK, the 425W is a little bit at the limit but the machine with the 625W fails, too. The machine with the 625W PSU has another 9400GT running, but as I mentioned before it is only for display use. And according to nvidia the 9400GT needs another 50W, so the 625W PSU should be sufficient.

BIOS of both MBs is flashed to the latest available version. The boards are different types, Asus P5QL-E and Asus P5Q Pro Turbo.

And hey, it's only Milkyway that has a problem with my hardware! The GPUGRID WUs are running nearly 8 hours without interruption and they finish valid! So do Seti, Seti Beta, Collatz and Einstein. So who do you think is the one that should do all that trouble shooting? Me? I'm getting more and more sick if this!

OK, I should stop typing because I'm getting angry right now! And then I mostly loose my countenance.
15) Message boards : Number crunching : Sudden mass of WU's finishing with Computation Error (Message 34210)
Posted 4 Dec 2009 by Profile XJR-Maniac
Post:
It is possible that this is a memory buffer issue. If so those with a smaller amount of system memory or less bandwidth on an older architecture and/or multiple nVidia GPU cores would be more likely to experience the issue than those with a larger amount of memory/newer system and/or only a single nVidia GPU core.

I have no experience with nVidia cards, this is only theory, it would need to be tested by someone who has the errors either installing more memory or only running one nVidia GPU per box and seeing if it fixed it. I suppose you could also test it by swapping an nVidia card that is erroring to another system that has more available memory.


There's only one GTX260 running per machine. OK, the WinXP box has a 9400GT but it's for display usage only because it created errors on both Seti and Collatz.

Both machines have 4GB of RAM installed. And before someone starts to complain, yes I'm aware of the fact that WinXP 32Bit has only access to 3GB! I saw that I forgot to mention RAM in my system overview below.

But how could a CUDA application be concerned by amount system memory? OK, there's a small part that is running on CPU but is that of any interest? And no matter if I have run the MW app standalone or with others, it fails in every situation. Only the short ones finish with valid results.
16) Message boards : Number crunching : The Longer Work Units... (Message 34208)
Posted 4 Dec 2009 by Profile XJR-Maniac
Post:
Can you please post brand and bios version of your cards? Thank you.
17) Message boards : Number crunching : The Longer Work Units... (Message 34203)
Posted 4 Dec 2009 by Profile XJR-Maniac
Post:
Nope, I even UNDERCLOCKED my GPU but it failed, though.

If you like to KNOW what's going on, instead of speculating, maybe you would like to read this thread over here:

Sudden mass of WU's finishing with Computation Error
18) Message boards : Number crunching : Sudden mass of WU's finishing with Computation Error (Message 34202)
Posted 4 Dec 2009 by Profile XJR-Maniac
Post:
After I successfully finished four WUs on GPUGRID, two on each GPU, I'm pretty sure that my GPUs are OK. There is a thread in the GPUGRID forum where a problem regarding failing GTX260 GPUs is discussed. Seems that some cards don't like fast fourier transformations (fft), mainly the older ones with 192 shaders and with the reference design (one fan), too.

So mine both are the new ones with 216 shaders, 55nm architecture and they have two fans. One is a palit, the other is a gainward but they look quite the same, only the colours are different.

BIOS version of both cards is 62.00.49.00.03


Can someone with a non failing, two fan 55nm GTX260/216 post his/her BIOS version, please? You can use GPU-Z to get it. Maybe I can find a version that is running on my cards, too.

Is there a debug mode for the MW CUDA application with enhanced logging so one can see what causes the NaN result? Seems that the app isn't crashing. It runs till the end but creates an invalid result. There should we some kind of debug mode or how do the developers test the apps??? OK, sometimes it seems that there are no tests at all ;-)))


Edit: I've read through this thread again and it seems that, unlike to GPUGRID, here at MW there are not only GTX260 GPUs that fail. There's a GTX280 and Paul's GTX2965, too. I can understand why the GTX280 could fail too because when I got it right a GTX260 is a GTX280 that didn't pass QA due to defective shaders. So they cut off the defective shader ALUs to 216, reduce clocks and memory and sell it as GTX260.

But where's the link to the 295ers? Aren't they dual 275ers? So the 275/295er chips are not the same as the 260/280ers?

Who knows. Maybe there will be a day to come when you can read that a pear is an apple that didn't make it through QA due to it's irregular form.
19) Message boards : Number crunching : Sudden mass of WU's finishing with Computation Error (Message 34153)
Posted 3 Dec 2009 by Profile XJR-Maniac
Post:
These are the clock speeds for a GTX260 from nVidia site specs
Graphics 576
Processor 1242
Memory 999
hope it helps
;-p


Thanks Bruce, that's exactly what my GPUs are running on (see below).

OK, as long as there is not much help from the project responsibles here and as long as I'm out of new ideas, I'm gonna disable CUDA on MW and enable it on Einstein. It's less of a waste then when I don't use my GPUs. OK, theres Seti and Collats but not that often ;-))) It's a shame because I think MW is a very interesting project.

To check if there's something wrong with my GPUs and to keep my fans spinning, I joined GPUGrid. I'll keep you informed, if I find something interesting there.
20) Message boards : Number crunching : Sudden mass of WU's finishing with Computation Error (Message 34149)
Posted 3 Dec 2009 by Profile XJR-Maniac
Post:
Next I tried overclocking, maybe there won't be errors if the WU finishes faster, but still invalid. What a surprise.

Can someone please post the clocks of his GTX260?

@David: Can you please take a look at your result files, just to see what the fitness parameter says? Especially on your GTX260. It should be a real number like the results from Starfire. Thank you!


Next 20

©2024 Astroinformatics Group