Welcome to MilkyWay@home

Posts by Richard Haselgrove

1) Message boards : Number crunching : anyone running a 2200/2400G? what performance are you getting? (Message 68115)
Posted 8 Feb 2019 by Richard Haselgrove
Post:
Hi everyone,

I've been working with Bill in the SETI@Home thread, and we've tracked down this error to a faulty (greatly inflated) 'GFLOPS Peak' value returned to BOINC by the ATI OpenCL driver. You can see the value - about 43 ExaFLOPS, about 10,000X too big - in the opening lines of your Event Log after startup.

I've submitted a formal bug report to ATI today, and we're working urgently on a hotfix version of BOINC which will trap and subdue the wayward flops value. Watch out for further announcements.
2) Message boards : News : Nbody Release 1.54 (Message 64190)
Posted 18 Dec 2015 by Richard Haselgrove
Post:
N-Body tasks won't be even trying to use your GPU.

Try reading here.
3) Message boards : Number crunching : Run Multiple WU's on Your GPU (Message 64187)
Posted 17 Dec 2015 by Richard Haselgrove
Post:
'Read configuration' works fine for replacing one value with another, or adding a value where none existed before. It doesn't work always for removing a value once it has embedded itself in the system, but in most cases you can just leave it there. It's the way they coded it.
4) Message boards : Number crunching : Error message: App version needs OpenCL but GPU doesn't support it (Message 64181)
Posted 15 Dec 2015 by Richard Haselgrove
Post:
Your computer runs Windows 10. Your working NVidia driver will have been replaced by a cut-down - limited functionality - driver supplied by Microsoft. You don't have any control over this, but you can go to NVidia and download/install the full-feature driver you need.

Expect this to happen again.
5) Message boards : Number crunching : Run Multiple WU's on Your GPU (Message 64179)
Posted 15 Dec 2015 by Richard Haselgrove
Post:
It does actually say at the very bottom of the Application configuration documentation

If you remove app_config.xml, or one of its entries, you must reset the project in order to restore the proper values.

Obviously, you wouldn't want to do that while you had any work cached.
6) Message boards : News : New Release- Nbody version 1.52 (Message 64167)
Posted 12 Dec 2015 by Richard Haselgrove
Post:
Guess what, even though it is still marked as not selected in my account :


Yes, it is. You left the final box checked, so in English you answered

"If no work for selected applications is available, accept work from other applications?"

with 'yes' - you accept work from unselected applications, like N-Body. Clear the final check-box if you really don't want them.
7) Message boards : Number crunching : Aborted by User, but not (Message 64128)
Posted 29 Nov 2015 by Richard Haselgrove
Post:
Use CPU
Enforced by version 6.10+ no
8) Message boards : Number crunching : Aborted by User, but not (Message 64106)
Posted 16 Nov 2015 by Richard Haselgrove
Post:
Nvidia driver was and is current ver 358.91 dated 9 NOV 2015.

Where did you download/install that from?

Microsoft or NVidia?
9) Message boards : Number crunching : Aborted by User, but not (Message 64101)
Posted 15 Nov 2015 by Richard Haselgrove
Post:
If you probe a little deeper, you can see better diagnostic information.

Your most recent example was task 1342970392. That says additionally:

Client state Aborted by user
Exit status 201 (0xc9) EXIT_MISSING_COPROC

BOINC is failing to see your GPU properly.

Your computer 410714 is running Windows 10, which hasn't really stabilised yet. In particular, Windows 10 has a habit of updating your hardware drivers whether you want it to or not: and the drivers Microsoft supplies may not always include the ecosystems (like OpenCL runtime support) that scientific computing requires.

I suggest your first step might be to replace the current drivers for your NVIDIA GeForce GTX 570 with certified drivers downloaded directly from http://nvidia.com
10) Message boards : Number crunching : app_info.xml to run MW@H GPU WU's for R9 3xx cards (Message 64085)
Posted 10 Nov 2015 by Richard Haselgrove
Post:
Or, what's wrong with this setup? :)

You're talkng about, and your BOINC client has found, a file called app_config.xml

But the file contents you have posted are appropriate (more-or-less) for a file called - as the opening tag suggests - app_info.xml

Always refer to the documentation:

Application configuration
Anonymous platform

I think you want the second of those.
11) Message boards : Number crunching : any way to change the data drive? (Message 64058)
Posted 5 Nov 2015 by Richard Haselgrove
Post:
And uninstalling the BOINC programs doesn't delete your data folder.

The whole process - uninstall, move folder, reinstall with manual selection of new location - takes about a minute (once you understand the process), and doesn't even lose tasks in progress.
12) Message boards : News : Fix for stderr.txt Truncation and Validation Errors (Message 63834)
Posted 26 Jul 2015 by Richard Haselgrove
Post:
I just went to the BOINC website .. and the most recent version they show is 7.4.42?? http://boinc.berkeley.edu/download.php

v7.6.6 is still a test version, available via the download all versions page.
13) Message boards : Number crunching : What is the cause of these 'validate errors' (Message 63820)
Posted 21 Jul 2015 by Richard Haselgrove
Post:
took a look at my stats
new version is 7.6.6
In progress (48) · Validation pending (0) · Validation inconclusive (28) · Valid (21) · Invalid (4) · Error (2)
too much inconclusives and invalids&errors
reasons and ways to avoid it?

Inconclusives: that is a consequence of the way this project is configured, using Adaptive Replication. They will be validated eventually, and the number of tasks chosen for validation by wingmates will go down as the other errors reduce.

Invalids: the only invalid tasks showing on your account now are from 15 July or earlier, when you were using BOINC v7.6.2
14) Message boards : Number crunching : What is the cause of these 'validate errors' (Message 63810)
Posted 17 Jul 2015 by Richard Haselgrove
Post:
Before I go, BOINC v7.6.6 is now available via the Download All page.
15) Message boards : Number crunching : What is the cause of these 'validate errors' (Message 63809)
Posted 17 Jul 2015 by Richard Haselgrove
Post:
Well, I ran v7.6.6 for 48 hours (~3,300 tasks) - not a single error at my end, just one "can't validate" because too many wingmates failed, like in yesterday's screenshot.

Then I regressed to v7.6.3, and within an hour got another

17-Jul-2015 12:35:04 [---] [slot] cleaning out slots/11: handle_exited_app()
17-Jul-2015 12:35:04 [---] [slot] removed file slots/11/astronomy_parameters.txt
17-Jul-2015 12:35:04 [---] [slot] removed file slots/11/boinc_finish_called
17-Jul-2015 12:35:04 [---] [slot] removed file slots/11/boinc_task_state.xml
17-Jul-2015 12:35:04 [---] [slot] removed file slots/11/init_data.xml
17-Jul-2015 12:35:04 [---] [slot] removed file slots/11/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe
17-Jul-2015 12:35:04 [---] [slot] removed file slots/11/separation_checkpoint
17-Jul-2015 12:35:04 [---] [slot] removed file slots/11/stars.txt
17-Jul-2015 12:35:04 [---] [slot] failed to remove file slots/11/stderr.txt: Error 32
17-Jul-2015 12:35:04 [Milkyway@Home] Computation for task de_sum_fast_15_3s_136_sim1Jun1_4_1434554402_10941084_0 finished
17-Jul-2015 12:35:04 [---] [slot] cleaning out slots/7: get_free_slot()
17-Jul-2015 12:35:04 [Milkyway@Home] [slot] assigning slot 7 to de_sum_fast_15_3s_136_sim1Jun1_4_1434554402_10941080_0

Again, I captured both the screenshot and the contents of the orphaned stderr.txt - it was complete, unlike task 1191232622.

OK, I think that provides conclusive evidence of cause and effect - I think my work in this thread is done. Moving on to pastures new.
16) Message boards : Number crunching : What is the cause of these 'validate errors' (Message 63807)
Posted 16 Jul 2015 by Richard Haselgrove
Post:
Heading close to 2,000 without error now.

One additional problem at this project: the administrators have set quite a low 'maximum errors' threshhold.



Two validate errors together, plus one other glitch, and the whole workunit is killed. Once BOINC v7.6.6 (or its successor) is fully tested and released as 'recommended', I'd suggest you start a push to get as many people as possible to upgrade.
17) Message boards : Number crunching : What is the cause of these 'validate errors' (Message 63804)
Posted 15 Jul 2015 by Richard Haselgrove
Post:
Not a single validate error, from over 500 tasks processed under BOINC v7.6.6 since this morning.
18) Message boards : Number crunching : What is the cause of these 'validate errors' (Message 63801)
Posted 14 Jul 2015 by Richard Haselgrove
Post:
David has applied a possible fix for this:

client (Win): when read stderr.txt, wait for write lock to be release first.

Apparently, on Win, there is still a write lock on stderr.txt,
and its buffer isn't flushed, until shortly after the app process exits.
This is bizarre, but so be it.

and Rom has built a installer to test it.

I've built a new version of 7.6 with David's latest change to address this issue.

http://boinc.berkeley.edu/dl/boinc_7.6.6_windows_intelx86.exe
http://boinc.berkeley.edu/dl/boinc_7.6.6_windows_x86_64.exe

----- Rom

Those of you who have some experience already with v7.6.2 might like to try this and see how it compares - bearing in mind that at this point it is totally untested. (That's our job!)

I'm clocking off the the night, but I'll switch back tomorrow morning and add to the testing effort.

Edit - additional comment from David:

I checked in a workaround in which the client waits until
stderr.txt is not locked before reading it.
Can people please review this change?
-- David

Windows programmers are invited to look at

http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commitdiff;h=f2d690029c6dab9d586a9ba1a2e0af03dc7f3c70
19) Message boards : Number crunching : What is the cause of these 'validate errors' (Message 63799)
Posted 14 Jul 2015 by Richard Haselgrove
Post:
After intensive work with Keith Myers and others (mainly in the SETI message board thread Stderr Truncations), I think I've finally traced and recorded the full life-cycle of these little beasties.

The easiest starting point is the debris left behind.



The task completed, and for 'some reason' (we'll come back to that later) BOINC couldn't delete one of the files. So it left it for later, and moved to another slot for the next task. In the message log, that looks like

14-Jul-2015 15:49:11 [---] [slot] cleaning out slots/2: handle_exited_app()
14-Jul-2015 15:49:11 [---] [slot] removed file slots/2/astronomy_parameters.txt
14-Jul-2015 15:49:11 [---] [slot] removed file slots/2/boinc_finish_called
14-Jul-2015 15:49:11 [---] [slot] removed file slots/2/boinc_task_state.xml
14-Jul-2015 15:49:11 [---] [slot] removed file slots/2/init_data.xml
14-Jul-2015 15:49:11 [---] [slot] removed file slots/2/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe
14-Jul-2015 15:49:11 [---] [slot] removed file slots/2/separation_checkpoint
14-Jul-2015 15:49:11 [---] [slot] removed file slots/2/stars.txt
14-Jul-2015 15:49:11 [---] [slot] failed to remove file slots/2/stderr.txt: Error 32
14-Jul-2015 15:49:11 [Milkyway@Home] Computation for task ps_modfit_fast_15_3s_136_sim1Jun1_1_1434554402_9901989_0 finished
14-Jul-2015 15:49:11 [---] [slot] cleaning out slots/2: get_free_slot()
14-Jul-2015 15:49:11 [---] [slot] failed to remove file slots/2/stderr.txt: Error 32
14-Jul-2015 15:49:11 [Milkyway@Home] [slot] failed to clean out dir: unlink() failed
14-Jul-2015 15:49:11 [---] [slot] cleaning out slots/10: get_free_slot()
14-Jul-2015 15:49:11 [Milkyway@Home] [slot] assigning slot 10 to de_80_DR8_Rev_8_5_00004_1434551187_13360920_0

Note that the timestamps match.

According to MSDN, error 32 is

ERROR_SHARING_VIOLATION
32 (0x20)
The process cannot access the file because it is being used by another process.

- BOINC couldn't delete the file, because Milkyway was still writing to it.

On the website, we see task 1187921853: Name ps_modfit_fast_15_3s_136_sim1Jun1_1_1434554402_9901989_0, Received 14 Jul 2015, 14:50:08 UTC - again it matches (my timezone is UTC+1).

The stderr on the website ends

...
Initial wait: 12 ms
Integration time: 133.964844 s. Average time per iteration = 418.640136 ms
Integral 0 time = 135.042252 s
Running likelihood with 108458 stars

</stderr_txt>

- no final result or call to boinc_finish

But I just had time to copy stderr.txt to another part of my hard disk:



That copy ends

...
Initial wait: 12 ms
Integration time: 133.964844 s. Average time per iteration = 418.640136 ms
Integral 0 time = 135.042252 s
Running likelihood with 108458 stars
Likelihood time = 2.782655 s
<background_integral> 0.000265723224422 </background_integral>
<stream_integral> 209.417694469056580 135.316345272137030 37.756694047809596 </stream_integral>
<background_likelihood> -3.403332787286266 </background_likelihood>
<stream_only_likelihood> -4.236377567130232 -4.667012639515129 -4.359314280913779 </stream_only_likelihood>
<search_likelihood> -3.090730944956150 </search_likelihood>
15:49:09 (6496): called boinc_finish

Again, note that the Integration time, Average time per iteration, and Integral 0 time all match (they vary from task to task), and that the call to boinc_finish timestamp matches the message log.

If BOINC had waited until the last few lines had been appended to stderr.txt, as they later were, before preparing the report for the server, I have every reason to believe this would have been a valid report.

It took at least 3,200 tasks to reach that point (and I think a few of the early ones have already been purged). I'll take a pause from this project for a while, and let the GPU chew on a nice restful GPUGrid task (17 hours with none of this frantic uploading and downloading). But I'll come back and test any fix that David can come up with.
20) Message boards : News : server issues (Message 63665)
Posted 3 Jun 2015 by Richard Haselgrove
Post:
If we are - finally - to pay some attention to the server, could I remind you of three messages where I've posted about the BOINC server code being outdated?

Message 63188 - unfinished web update, corrupts < and > in [ pre ] and [ code ] blocks.
Message 63274 - php warning when 'don't move stickies to top' is selected.
BOINC message 62439 - recent ATI cards aren't recognised as being OpenCL capable.

And you'll know about the connection errors and timeouts since I started drafting the above.


Next 20

©2024 Astroinformatics Group