Welcome to MilkyWay@home

Posts by Len LE/GE

61) Message boards : Number crunching : Is there a change to no longer use Nvidia GT520 (Message 56315)
Posted 28 Nov 2012 by Len LE/GE
Post:
Hi,

Been using my GT520 for over 6 months to crunch MW GPU wu however over the last few weeks the GPU is no longer doing wu's. Has there been a change in the code or could I have a problem somewhere? The gt520 works fine in games etc.

Here is my event log
...
28/11/2012 00:22:50 | | NVIDIA GPU 0: GeForce GT 520 (driver version 306.97, CUDA version 5.0, compute capability 2.1, 1024MB, 8381362MB available, 134 GFLOPS peak)
28/11/2012 00:22:50 | Milkyway@Home | Application uses missing NVIDIA GPU
28/11/2012 00:22:50 | | Config: report completed tasks immediately
28/11/2012 00:22:50 | | Config: use all coprocessors
28/11/2012 00:22:50 | Milkyway@Home | URL http://milkyway.cs.rpi.edu/milkyway/; Computer ID 297658; resource share 100
28/11/2012 00:22:50 | Milkyway@Home | Sending scheduler request: To fetch work.
28/11/2012 00:22:50 | Milkyway@Home | Requesting new tasks for NVIDIA
28/11/2012 00:22:51 | | App version needs OpenCL but GPU doesn't support it
28/11/2012 00:22:51 | Milkyway@Home | Scheduler request completed: got 16 new tasks
28/11/2012 00:22:51 | Milkyway@Home | [error] App version uses non-existent NVIDIA GPU


BOINC does not find the OpenCL driver for your GPU and Milkyway does not even see your GPU. "GPU detection will no longer work when BOINC is installed as a service, or protected application execution". BOINC isn't installed as a service, right? Could be a bug in the BOINC version (saw some reports about problems with 7.0.28 x64), a problem with the driver installation or something completely different.
I would try deinstall/clean/reinstall the gpu driver, than a newer BOINC version and see from there.
62) Message boards : News : Nobdy Release 1.02 (Message 56305)
Posted 26 Nov 2012 by Len LE/GE
Post:
Bis zum 24.11.2012 11:56:49 UTC war alles in Ordnung:
http://milkyway.cs.rpi.edu/milkyway/results.php?userid=339&offset=0&show_names=0&state=3&appid=

Danach ging nichts mehr:
http://milkyway.cs.rpi.edu/milkyway/results.php?userid=339&offset=0&show_names=0&state=5&appid=

Was soll das????????????????????????


Your links do not work.
Checked your tasks here.
Looked into the stderr output of a few of your tasks that errored out (like 348883100) and found:
Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (15100): (null)
Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (15105): (null)
Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (15105): (null)
Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (15105): (null)
Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (15105): (null)
Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (15105): (null)
Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (15105): (null)
Failed to update checkpoint file ('separation_checkpoint_tmp' to 'separation_checkpoint') (2): No such file or directory
Failed to write final checkpoint

Conclusion: A local problem like corrupted file system, user rights on that directory changed somehow, etc.
63) Message boards : News : Nobdy Release 1.02 (Message 56272)
Posted 22 Nov 2012 by Len LE/GE
Post:
Hi everyone, I have been running MilkyWay for a year or two now with no real problems but the last two days my computer turns itself off after running the program for about five minutes, suspend the activity and no problems.Any one else having problems or know what's happening. Nick


Sounds like a heat problem. Australian summer ...
Try a tool like HWMonitor to find out about the temps in your box.
64) Message boards : News : Nobdy Release 1.02 (Message 56258)
Posted 21 Nov 2012 by Len LE/GE
Post:
If you're going to the bother of creating an app_info.xml, it's probably easier to download the DLLs under their 'real' names, rather than going for the versioned aliases and renaming them back again.

http://milkyway.cs.rpi.edu/milkyway/download/libgomp_64-1.dll
http://milkyway.cs.rpi.edu/milkyway/download/pthreadGC2_64.dll

Then you can do away with the <open_name> and <copy_file/> lines entirely - you forgot the copy on the second file, anyway.


Good catch on the missing <copy_file/> line.
Did not run nbody for a long time, so it was more like quick putting some fragments together. :) Could not test w/o nbody WUs.

Point is that you need to get the versioned dlls from the download directory when downloading manually, if you rename them locally is your choice. Those without version number are very old (used for nbody v0.40 or v0.60). AFAIR Matt moved to versioning them at that time and renaming them while downloading to the users, so he could keep the versions online without conflicts. That's why I choosed to keep the numbers locally too and used open/copy for runtime. Less confusion for me to keep the proper versions together.

I did read the explanation on codeguru. It goes basically into the same direction I was thinking; maybe my english wasn't the best to make it clear.
You are building (and testing) an exe with a new set of external dlls and trying to run it than with far older dlls. This can lead to a whole set of errors because of critical changes between those dll versions; heap corruption and memory out of bound would be far up on that list.

That's why I am saying: First make sure to use the same dynamically linked dlls the exe was build and internally tested with, than see what errors are still left. See (Message 56239) that the statically linked exes (MAC and Linux) are returning mostly valid results while the bulk of errors are coming from windows clients with the dynamically linked dlls.

Only my 2¢ and I hope they find the root of the problem soon.
65) Message boards : News : Nobdy Release 1.02 (Message 56251)
Posted 21 Nov 2012 by Len LE/GE
Post:
The app_info for nbody v0.84 64bit looked something like


    <app_info>

    <app><!-- CPU app for N-Body 0.84 mt 64bit -->
    <name>milkyway_nbody</name>
    <user_friendly_name>MilkyWay@Home nbody</user_friendly_name>
    </app>

    <file_info>
    <name>milkyway_nbody_0.84_windows_x86_64__mt.exe</name>
    <executable/>
    </file_info>
    <file_info>
    <name>libgomp_64-1_nbody_0.84.dll</name>
    <executable/>
    </file_info>
    <file_info>
    <name>pthreadGC2_64_nbody_0.84.dll</name>
    <executable/>
    </file_info>

    <app_version>
    <app_name>milkyway_nbody</app_name>
    <version_num>84</version_num>

    <plan_class>mt</plan_class>
    <avg_ncpus>4</avg_ncpus>
    <max_ncpus>4</max_ncpus>
    <cmdline>--nthreads=4</cmdline>

    <file_ref>
    <file_name>milkyway_nbody_0.84_windows_x86_64__mt.exe</file_name>
    <main_program/>
    </file_ref>
    <file_ref>
    <file_name>libgomp_64-1_nbody_0.84.dll</file_name>
    <open_name>libgomp_64-1.dll</open_name>
    <copy_file/>
    </file_ref>
    <file_ref>
    <file_name>pthreadGC2_64_nbody_0.84.dll</file_name>
    <open_name>pthreadGC2_64.dll</open_name>
    </file_ref>

    </app_info>



Replacing v0.84 with v1.02 shouldn't be too hard.
Milkyway_nbody mt needs libgomp_64-1 with needs pthreadGC2_64.
Every nbody mt exe before v0.94 came with it's own dll versions.
Problem is, there are no dll files for v0.94/v1.00/v1.02; the newest ones in the download directory seem to be for v0.84.
Those dll files without version number seem to belong to v0.60 or v0.66, maybe even earlier.

So the questions are:
a) Are dll files downloaded with the exe? Which version are they and do they have the proper name (without version) when downloaded.
b) Are the v0.84 dlls compatible to the new exe or did the successfull tests (Message 56232) used newer dll versions which are missing in the download directory?

66) Message boards : Number crunching : Running on ATI & Nvidia in the same rig? (Message 56221)
Posted 17 Nov 2012 by Len LE/GE
Post:
I think it seems to relate to how Milkyway severely limits the total number of WUs you can get at a time. It gives you so many depending on how many (logical) cores your PC has. It's simply full when it tries to bother with getting work for the Nvidia card?


Running the WUs on gpu it's a 40 WUs per gpu limit; so you should get 80 for your 2 gpus.
You could temporary set the fetch_minimal_work param in your cc_config (see BOINC client configuration) to see if you will get WUs for your nvidia card than. If you are getting WUs for both cards than, it's about which type of WUs are filling the cache first (like you are guessing above). In your case, the server should give you max 40 for ATI and max 40 for nvidia, which is the limit for each card.
67) Message boards : Number crunching : opencl not found (Message 56196)
Posted 14 Nov 2012 by Len LE/GE
Post:
I've got a HD4870, and have been receiving the OpenCL problem since the last Catalyst patch/release. We've put it down to AMD pulling support for OpenCL.


The latest driver version supporting your card and OS seems to be 12.4 (with openCL).
Version 12.6 and later still includes openCL but needs at least a 5xxx series card.
68) Message boards : Number crunching : Running on ATI & Nvidia in the same rig? (Message 56175)
Posted 13 Nov 2012 by Len LE/GE
Post:
You say you are getting WUs for both on SETI.
Did you check your mw project preferences if the use of your NVIDIA card is enabled here?
69) Message boards : Number crunching : Happy CPU (Message 56100)
Posted 5 Nov 2012 by Len LE/GE
Post:
He is doing both, cpu and gpu crunching ;)
70) Message boards : Number crunching : Interesting issue... (Message 56094)
Posted 4 Nov 2012 by Len LE/GE
Post:

UPDATE: ... This situation is getting Ridiculous. The SETI Tasks are now reacting in the same way (Remaining Time is Increasing instead of Decreasing). They Run Correctly for a few minutes, and then instead of Decreasing they Increase as the seconds pass by. Only 2 Tasks will RUN at The Same Time (This may be due to my Preferece Settings), so the when the 2 Tasks began to Increase in Remaining Time, I SUSPENDED them and Started to RUN the other 2. After a few minutes, they too Reversed and began to Count UP instead of Down. However, the PROGRESS (%) is Operating Correctly in either case: Therefore I have RESUMED the Task that has Progressed the highest Percentage, and left the others Suspended; but even by Only RUNNING One Task, it is Still INCREASING in Time Remaining, but it's easier to keep track of. I'll Alternated the Tasks and Run One at a time. Let's hope this works.


You are running a dual-core Conroe based Celeron with a 512k cache shared between the cores. Using only 1 core will give you the full cache for that core; if you are using both cores, the cache needs to swap data each time the other core asks for more. That's why WU's are running slower if you are running 2 at the same time. BOINC needs a little time to see the slowdown and than corrects the time estimate.
71) Message boards : Number crunching : Please, explain why this task was replicated before deadline (Message 55544)
Posted 11 Sep 2012 by Len LE/GE
Post:
The status of your result changed to validate error.
MW calculations and results need double precision.
Your stream_only_likelihood seems ok for cross platform comparison, but background_likelihood and search_likelihood are only identical up to the 4th decimal digit which is far from what is needed.

The validator decided your result needed a second one to compare against and the WU was sent out again a minute later; this was repeated until 2 results did match within the allowed difference.
72) Message boards : Number crunching : opencl_amd_ati workunits are too small (Message 55513)
Posted 6 Sep 2012 by Len LE/GE
Post:
The WU sizes are the same for all.
AMD GPU, NVIDIA GPU and CPU, they all get the same WU size and get verified against each other. If run times are getting too long on CPU, it takes forever to verify one of your WUs against the results from a CPU. The list of pending WUs would grow dramatically and so would do the database.
The verification procedure itself is more complicated, but this should give you an idea.
73) Message boards : Number crunching : MW@H Computing Failures (Message 55503)
Posted 5 Sep 2012 by Len LE/GE
Post:

<search_application> milkyway_separation 1.00 Windows x86 double </search_application>


You are running the cpu version of mw separation.


Unrecognized XML in project preferences: nvidia_block_amount
Skipping: 128
Skipping: /nvidia_block_amount
Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4'
Error reading astronomy parameters from file 'astronomy_parameters.txt'
Trying old parameters file


Ignore those, you will see them even when running on amd gpu.


Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to move file 'separation_checkpoint_tmp' to 'separation_checkpoint' (6801): Transaction support within the specified file system resource manager is not started or was shutdown due to an error.

Failed to update checkpoint file ('separation_checkpoint_tmp' to 'separation_checkpoint') (2): No such file or directory
Write checkpoint failed
16:54:28 (5448): called boinc_finish


mw is writing checkpoints to save the actual progress of the calculation. Those will be used each time boinc is switching back to mw, so mw knows from where to continue the calculation.

This repeated error is what you have to worry about.
Data is first written to a temp file and than the regular checkpoint file gets replaced by the temp file.
A google search shows that it is often a known bug in the transaction manager of vista where the transaction log gets corrupted. Don't know if there is a patch awailable, only seen a "Microsoft Fix it 50140" for situations when the error occures.
There are other cases where this error occurs too, described in the third link.

see
1) http://support.microsoft.com/kb/939399
2) http://serverfault.com/questions/350374/transaction-support-within-the-specified-resource-manager-is-not-started-or-was
3) http://errordecoder.com/system-error-codes/8/code-6801.html
74) Message boards : Number crunching : MW@H Computing Failures (Message 55447)
Posted 2 Sep 2012 by Len LE/GE
Post:
Guessing you have a standard install of BOINC, I would start with a file system check and than run a test of the harddrive.
75) Message boards : Number crunching : App version needs OpenCL but GPU doesn't support it (Message 54963)
Posted 2 Jul 2012 by Len LE/GE
Post:
Can xou post the beginning of the log? Where it shows what hardware the client has found?
76) Message boards : Number crunching : n-body simulation failures with app_info.xml (Message 54943)
Posted 29 Jun 2012 by Len LE/GE
Post:
Been a while since I was running nbody, but I think I remember you need to change


<file_ref>
<file_name>libgomp_64-1_nbody_0.84.dll</file_name>
</file_ref>
<file_ref>
<file_name>pthreadGC2_64_nbody_0.84.dll</file_name>
</file_ref>


to something like this


<file_ref>
<file_name>libgomp_64-1_nbody_0.84.dll</file_name>
<open_name>libgomp_64-1.dll</open_name>
</file_ref>
<file_ref>
<file_name>pthreadGC2_64_nbody_0.84.dll</file_name>
<open_name>pthreadGC2_64.dll</open_name>
</file_ref>


This opens the dlls under their std name and you don't have to rename them in your directory (and do it again with every new version).
77) Message boards : Number crunching : Errors in N-Body Simulation (Message 54585)
Posted 1 Jun 2012 by Len LE/GE
Post:
process exited with code 193 (0xc1, -63)

193 = invalid event

and a task log shows:

SIGBUS: bus error

Crashed executable name: milkyway_nbody_0.84_i686-apple-darwin
Machine type Intel 80486 (32-bit executable)
System version: Macintosh OS 10.4.11 build 8S2167
Thu May 31 11:17:21 2012

That's of any help?
78) Message boards : Number crunching : Why is Milky Way Taking all 4 CPU cores for one work unit? (Message 54472)
Posted 20 May 2012 by Len LE/GE
Post:
There are 2 different MW apps:
- separation is using 1 core
- nbody is using multi core

Actually you are running both types, so you are seeing the different behaviour you describe. You can choose with one you want to run with a setting on the project preferences page.
79) Message boards : Number crunching : N-Body programs, "Error in Computation" (Message 54359)
Posted 10 May 2012 by Len LE/GE
Post:
N-body on your Win7 system is running w/o errors.
It's on your WinXP system where the errors occur (exit code 1 (0x1)).
80) Message boards : Number crunching : GPU-task not ending (Message 54056)
Posted 17 Apr 2012 by Len LE/GE
Post:
You should update your catalyst driver to 12.3.
There is a known bug in cat < 12.3 related to Tahiti gpus, which could cause the behavior you described.


Previous 20 · Next 20

©2024 Astroinformatics Group