Welcome to MilkyWay@home

Posts by Richard Haselgrove

101) Message boards : News : N-Body 1.36 (Message 59674)
Posted 25 Aug 2013 by Richard Haselgrove
Post:
There's no 32-bit version of N-Body for Windows listed on the Applications page, and Bob's host 523191 is 64-bit - unless (unlikely) he's loaded a 32-bit version of BOINC?
102) Message boards : News : N-Body 1.36 (Message 59672)
Posted 25 Aug 2013 by Richard Haselgrove
Post:
Error 0xc0000135 is a generic Windows error code which translates as "The application failed to initialize properly". The commonest cause is a missing DLL, and the commonest answer found by Google is that the DLL is question is part of the Microsoft .NET framework - but that ain't necessarily so: it's more likely to be one of the support DLLs provided from this project's servers.

N-Body 1.36 is running fine on this Windows 7/64 machine as I type.

Client_state.xml says that both plan_class variants (MT and single-threaded) are correctly specified to reference

    <file_ref>
        <file_name>libgomp_64-1_nbody_1.36.dll</file_name>
        <open_name>libgomp_64-1.dll</open_name>
        <copy_file/>
    </file_ref>
    <file_ref>
        <file_name>pthreadGC2_64_nbody_1.36.dll</file_name>
        <open_name>pthreadGC2_64.dll</open_name>
        <copy_file/>
    </file_ref>

If either are missing, the download urls are:

<download_url>http://milkyway.cs.rpi.edu/milkyway/download/libgomp_64-1_nbody_1.36.dll</download_url>
<download_url>http://milkyway.cs.rpi.edu/milkyway/download/pthreadGC2_64_nbody_1.36.dll</download_url>

I've tested both, and both are working currently.

The running MW app, according to Process explorer, has loaded

C:\Windows\libgomp_64-1.dll
C:\BOINCdata\slots\5\pthreadGC2_64.dll

BOINCdata\slots\ is the correct load location for this machine, given the specifications above and my configuration.

C:\Windows\libgomp_64-1.dll is suspicious, and maybe the result of my manual hacking in the early days of N-Body, when the DLLs weren't being correctly specified. But it does appear that BOINC has loaded a full copy of libgomp_64-1.dll into Slot5 as directed, so it seems things are working correctly.
103) Message boards : Number crunching : N-Body Search Simulation using all CPUs (Message 59594)
Posted 14 Aug 2013 by Richard Haselgrove
Post:
------------------------------------------------------------------------------
The reason this works (if I've got the description right - it's been a while since I researched this, and I only researched it under Windows) is that whenever you download new work, the N-Body program is reset to use the number of CPUs BOINC was allowed to use at the time of the download.

That's because the <app_version> block downloaded with the new work contains two distinct values:

<max_ncpus> 4 [or 2]
<cmdline>--nthreads 4 [or 2]

The first controls how many cores BOINC will allocate to N-Body, and the second controls how much CPU power N-Body will actually use (they need to be the same for efficient cunching). If you have cpu usage set to 50% when you request new work, both will be set to 2, and you have the slack for other projects that you want. As soon as you download new N-Body work while cpu usage is set to 100%, both numbers will jump up to 4, and you'll be back in your current position.

Can you set this thru a cc_config.xml or app_config.xml file you put inside the project directory? So it tells the project to ONLY EVER use the number specified.

No, you can't.

cc_config.xml is global across all projects - no use for single-project stuff.
app_config.xml works - individually - for each separate app supplied by a single project - so it would be exactly the right place to do this, if possible. The <max_cpus> might work, but at the moment there's no way of controlling <cmdline> from app_config.

I did ask David Anderson to add the feature, for precisely this reason, but he declined - saying (a) he didn't think it would work [I disagree, because I've done the same thing via an app_info, several times on several projects], and (b) because he views app_config as a temporary kludge until greater application control is added to the Manager - that may be the BOINC 8 reference you were half-remembering in the News thread a couple of days ago.

So, for the time being, we're stuck half-way, with an app_config that works within limits: but won't be extended until some vapourware that I haven't seen any sign of yet comes along.

Unless project admins gang up to put some pressure on David to give it a try? I believe the original impetus for app_config.xml came from World Community Grid - maybe it would be worth asking there, if anyone here is also a member there? (I'm not)
------------------------------------------------------------------------------
I tried the 'fetch new work while the number of CPUs is restricted' trick I described with development BOINC v7.2.10 - and it worked fine. But I found I could only get a limited amount of Milkyway work at a time ("You have reached a limit for the number of tasks in progress"), which made it very tedious. Having proved the point (to my personal satisfaction, at least), I gave up, and my laptop is now crunching Milkyway of all cores for a while, then switching away and running four separate tasks from other projects. That doesn't need babysitting, and it all comes out the same in the end.
104) Message boards : Number crunching : N-Body Search Simulation using all CPUs (Message 59553)
Posted 8 Aug 2013 by Richard Haselgrove
Post:
The N-Body app is using all of my CPUs (4). I really only want it to use 1-2. Can I control this?

Not currently. This is a BOINC feature, not a Milkyway feature, and it will vary a bit from BOINC version to version, but the best recipe I've been able to find is:

Set 'No New Tasks' for the Milkyway project, and reduce the amount of CPU work (from all projects) that you have cached on your computer as far as possible.

As Mikey suggested, change "on multiprocessor systems, use at most {_} % of the processors." to 50%

Allow work fetch for Milkyway, and download as much new N-Body work as you can.

Set 'No New Tasks' for Milkyway again.

Set "on multiprocessor systems, use at most {_} % of the processors." back to 100%

You should now have a bunch of Milkyway work on your machine which will compute using two cores, and BOINC should schedule other projects to use the other two cores.

Rinse and repeat....
------------------------------------------------------------------------------
The reason this works (if I've got the description right - it's been a while since I researched this, and I only researched it under Windows) is that whenever you download new work, the N-Body program is reset to use the number of CPUs BOINC was allowed to use at the time of the download.

That's because the <app_version> block downloaded with the new work contains two distinct values:

<max_ncpus> 4 [or 2]
<cmdline>--nthreads 4 [or 2]

The first controls how many cores BOINC will allocate to N-Body, and the second controls how much CPU power N-Body will actually use (they need to be the same for efficient cunching). If you have cpu usage set to 50% when you request new work, both will be set to 2, and you have the slack for other projects that you want. As soon as you download new N-Body work while cpu usage is set to 100%, both numbers will jump up to 4, and you'll be back in your current position.
105) Message boards : News : N-Body 1.36 (Message 59529)
Posted 6 Aug 2013 by Richard Haselgrove
Post:
Do you know if you were awarded credit?

This is a rare occurrence and something this group is looking into.

On the contrary, it's something I see quite regularly - six from the last two days alone:

Validation inconclusive MilkyWay@Home N-Body Simulation tasks for computer 479865

Using standard BOINC terminology, it seems as if - after the task has been processed and reported - the validator randomly decides whether the host is 'reliable' or 'trusted' (not quite sure which applies here). If the host is not reliable or trusted, a second replication is generated and sent out. And when that second copy is returned, it is invariably (in my experience) treated as unreliable as well, leading to a third replication being issued.

Only once the third copy is complete does true validation (a comparison of the results) take place, and tasks which pass the test are granted credit: see for example WU 408707072 (same host), where the three successive replicated tasks have all been granted credit.

Most BOINC projects either require every workunit to be replicated and the results compared, or none of them. Milkyway seems to have an unusual server configuration with optional validation.

Projects which require 100% validation usually send all replicated copies out at the same time - that saves a lot of time (and server storage space) when long-running tasks need to be compared: the serial implementation here has kept two of my inconclusives waiting since 30 July and 15 July respectively, while 'wingmates' (as we call them) slowly catch up.

Edit - it looks as if the schema you're using is Adaptive Replication.
106) Message boards : Number crunching : Berkeley Boinc Manager for Android Beta test (Message 59439)
Posted 23 Jul 2013 by Richard Haselgrove
Post:
Since it's a BOINC release, the official announcement was on the BOINC front page and in the BOINC 'news' message board:

http://boinc.berkeley.edu/
http://boinc.berkeley.edu/dev/forum_thread.php?id=8519&postid=49954
107) Message boards : News : N-Body 1.18 (Message 59305)
Posted 10 Jul 2013 by Richard Haselgrove
Post:
... you still may have that intermittent stack overflow bug kicking around, recompiling and then having to debug a new app sounds like extra work you really don't need at the moment.

Yup, got one today.

http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=520735402
108) Message boards : News : Separation Modified Fit v1.24 (Message 59276)
Posted 8 Jul 2013 by Richard Haselgrove
Post:
Agreed. So in that case (which is what I'm doing) you may as well get rid of app_config since there isn't anything in it you can't do with app_info, and it's one less thing to worry about getting and keeping right. ;-)

<edit> IMHO, they got the precedence backwards simply because app_info is the more 'powerful' tool. But then what do I know? :-D

One thing you can do with app_config that you can't do with app_info is to make changes 'on the fly', and apply them without stopping/restarting BOINC. That makes it a useful prototyping tool, especially on a system which also runs projects which don't checkpoint well.

There's also no equivalent of <max_concurrent> available in app_info.
109) Message boards : News : Separation Modified Fit v1.24 (Message 59273)
Posted 8 Jul 2013 by Richard Haselgrove
Post:
Unfortunately, to workaround BOINC limitations in the way it handles multi-threaded CPU applications (nBody) you have to use the anonymous platform currently. Once you do that the app_config file is ignored.

So it's all or nothing in the app_info file.

Actually, not so.

Both app_info and app_config can be used at the same time: if both are present, and both contain values for what is fundamentally the same parameter (e.g. <coproc><count> in app_info is the same value as <gpu_versions><gpu_usage> in app_config), then the value in app_config takes precedence. (source: Josef W. Segur)

The thing you can't do is to run one Milkyway app (say, N-Body) under app_info, but use stock settings and automatic updating for other apps from the same project. If you wanted to run both N-Body and a GPU app here, you would have to write an app_info.xml file which defines the files and parameters for both applications.
110) Message boards : News : Nbody 1.04 (Message 59132)
Posted 27 Jun 2013 by Richard Haselgrove
Post:
Is it ktm32.dll or ktmw32.dll? The error message says the first but your message board link has the second.

James

ktmw32.dll is a Microsoft Windows support file.

ktm32.dll is a typing error.
111) Message boards : News : N-Body 1.18 (Message 58882)
Posted 15 Jun 2013 by Richard Haselgrove
Post:
Interesting WU, which crashed the 1.18 (mt) Win-x64 application on my very very stable mobile Core i7 CPU "720QM":

-1073741571 (0xffffffffc00000fd) Unknown error number

http://support.microsoft.com/kb/315937 0xc00000fd 'stack overflow'.
112) Message boards : News : N-Body 1.18 (Message 58864)
Posted 14 Jun 2013 by Richard Haselgrove
Post:
So, serious question, what happens when this occurs on your system? Does it just run the GPU task, and leave the other 3.96 CPUs completely idle?

Yes. That was part of my first "extended run" MT task, and I wanted to finish it and report it to see what happened next - so (and since there only appeared to this human to be a few more minutes to run), I suspended all other CPU work hoping that the MT task would run. It didn't, until I suspend the GPU task as well.

I'll have to run some more tasks - may be difficult this weekend - and get it into the same state with more debug log flags active - I think I'm going to need both priority and work_fetch, in addition to that cpu_sched.

@ jdzukley, extended logging as I've just suggested would be helpful, if you're familiar with setting that up in time - but unfortunately, I've got to go out now, so won't be around to guide you.
113) Message boards : News : N-Body 1.18 (Message 58859)
Posted 14 Jun 2013 by Richard Haselgrove
Post:
There is a number of issues here, the most serious being that the MT task when it changes status from High Priority to normal, then goes into "Wait" and holds all remaining CPU tasks hostage.

My observation too, and I think it ties in with my conversation with Jacob Klein.

Back in 2009, when I was doing similar tests and having similar conversations with the 'AQUA' project, we were still using BOINC v6 versions with separate CPU and GPU scheduling, each resource maintaining a separate 'debt' level with respect to other projects.

Now, I'm testing with BOINC v7.0.64, which maintains a single common 'REC' for each project, which is used to schedule both CPU and GPU tasks.

I'm seeing this failure to run MT and GPU tasks together after a long - many hours - exclusive occupation of all CPUs by the MT task running in high priority. This means the REC for MW (as a whole) will be high, and runtime prio will be low: BOINC will be reluctant to schedule MW for a while to come. This may result in a different set of scheduling decisions from those seen by Jacob, who - with very little MW running under his belt - will be in the opposite 'low REC, high prio' state. I've seen MT + GPU running properly in that state, too.

In order to help the bug-hunt here, here's the message log for the event I posted to boinc_alpha:

10/06/2013 13:53:11 | | [cpu_sched_debug] Request CPU reschedule: periodic CPU scheduling
10/06/2013 13:53:11 | | [cpu_sched_debug] schedule_cpus(): start
10/06/2013 13:53:11 | Milkyway@Home | [cpu_sched_debug] thrashing prevention: mark de_nbody_100k_chisq_alt_40913_1366886102_708586_1 as deadline miss
10/06/2013 13:53:11 | SETI@home | [cpu_sched_debug] scheduling 10ja12ab.22792.19492.13.12.3_0 (coprocessor job, FIFO) (prio -1.377456)
10/06/2013 13:53:11 | SETI@home | [cpu_sched_debug] reserving 1.000000 of coproc NVIDIA
10/06/2013 13:53:11 | Milkyway@Home | [cpu_sched_debug] scheduling de_nbody_100k_chisq_alt_40913_1366886102_708586_1 (CPU job, priority order) (prio -0.245088)
10/06/2013 13:53:11 | | [cpu_sched_debug] enforce_schedule(): start
10/06/2013 13:53:11 | | [cpu_sched_debug] preliminary job list:
10/06/2013 13:53:11 | SETI@home | [cpu_sched_debug] 0: 10ja12ab.22792.19492.13.12.3_0 (MD: no; UTS: yes)
10/06/2013 13:53:11 | Milkyway@Home | [cpu_sched_debug] 1: de_nbody_100k_chisq_alt_40913_1366886102_708586_1 (MD: no; UTS: no)
10/06/2013 13:53:11 | | [cpu_sched_debug] final job list:
10/06/2013 13:53:11 | SETI@home | [cpu_sched_debug] 0: 10ja12ab.22792.19492.13.12.3_0 (MD: no; UTS: yes)
10/06/2013 13:53:11 | Milkyway@Home | [cpu_sched_debug] 1: de_nbody_100k_chisq_alt_40913_1366886102_708586_1 (MD: no; UTS: no)
10/06/2013 13:53:11 | SETI@home | [cpu_sched_debug] scheduling 10ja12ab.22792.19492.13.12.3_0
10/06/2013 13:53:11 | Milkyway@Home | [cpu_sched_debug] avoid MT overcommit: skipping de_nbody_100k_chisq_alt_40913_1366886102_708586_1
10/06/2013 13:53:11 | | [cpu_sched_debug] using 0.04 out of 4 CPUs
10/06/2013 13:53:11 | NumberFields@home | [cpu_sched_debug] wu_DS-16x121_Grp514181of819200_0 sched state 1 next 1 task state 9
10/06/2013 13:53:11 | NumberFields@home | [cpu_sched_debug] wu_DS-16x121_Grp523637of819200_0 sched state 1 next 1 task state 9
10/06/2013 13:53:11 | Milkyway@Home | [cpu_sched_debug] de_nbody_100k_chisq_alt_40913_1366886102_708586_1 sched state 1 next 1 task state 9
10/06/2013 13:53:11 | NumberFields@home | [cpu_sched_debug] wu_DS-16x121_Grp526885of819200_0 sched state 1 next 1 task state 9
10/06/2013 13:53:11 | SETI@home | [cpu_sched_debug] 10ja12ab.22792.19492.13.12.3_0 sched state 2 next 2 task state 1
10/06/2013 13:53:11 | | [cpu_sched_debug] enforce_schedule: end

I'm assuming that, somewhere in the switchover from debt-based to REC-based scheduling, a code path was introduced which overlooks the 'allow minimal overcommits for GPUs' rule. It will be interesting to see if this has persisted into v 7.1.15 and later. I haven't seen any alterations in that area, but Jacob has been watching that area of code more closely than I have over the last few weeks.
114) Message boards : News : N-Body 1.18 (Message 58849)
Posted 14 Jun 2013 by Richard Haselgrove
Post:
Richard,

I'm not trying to hijack a thread here, but... I wanted to point out that the scheduling policy may not be bugged here.

There's a basic "Job Scheduling" section within the BOINC documentation, found here: http://boinc.berkeley.edu/trac/wiki/ClientSched
Those 4 bullet points drive the main scheduling.

But I think, additionally (per the emails below from April, where I asked David Anderson about it a bit)... the scheduler makes sure never to be commit more than #ncpus, when running a multi-thread (mt) task. Conversely, it won't schedule an mt task if the resulting cpu usage would be more than #ncpus.
------------------------------------------------------------------------------------
Date: Tue, 2 Apr 2013 14:49:34 -0700
From: davea@ssl.berkeley.edu
To: boinc_alpha@ssl.berkeley.edu
Subject: Re: [boinc_alpha] Using app_config.xml <cpu_usage>2</cpu_usageresults in underloading/overloading CPU

These are both consistent with the current job-scheduling policy:

1) If a multi-thread job is running, the scheduler won't run a CPU job
if doing so would exceed #CPUs.

2) the scheduler will run GPU jobs until the CPU load is #CPUs+1
(but not beyond that).

There is a rationale for both of these, thought they are both
open to debate.

-- David
------------------------------------------------------------------------------------

I have no query with policy (1) in that email - that seems to be working as designed and intended.

But I think the current v7.0.64 scheduler is failing to obey policy (2) - the relationship between MT and GPU tasks.

I've been following this one for quite a while: very few projects supply MT apps, so it doesn't get much attention in alpha testing - the ClientSched document you linked doesn't even mention MT jobs, let alone co-scheduling with GPUs, and the change from SVN to GIT has made old code changes unsearchable. But I'm pretty certain I had a hand in:
------------------------------------------------------------------------------------
Revision: 97ee3a38f265653d6b16bd5611df3ece4b2eef91
Author: David Anderson <davea@ssl.berkeley.edu>
Date: 22/09/2009 00:23:40
Message:
- client: tweak CPU scheduling policy to avoid running multithread apps overcommitted.
Actually: allow overcommitment but only a fractional CPU
(so that, e.g., we can run a GPU app and a 4-CPU app on a 4-CPU host)


svn path=/trunk/boinc/; revision=19126
----
Modified: checkin_notes
Modified: client/cpu_sched.cpp
------------------------------------------------------------------------------------
At the moment, we can only do the highlighted bit some of the time.
115) Message boards : News : N-Body 1.18 (Message 58804)
Posted 12 Jun 2013 by Richard Haselgrove
Post:
And another mt stuck, this time at 46.415% after one minute precisely. Every test batch has one of these problem mt jobs wanting all the resources to enable completion.

It's not the batch or the WUs wanting anything, it's the bug in the BOINC client scheduler - it's being too assertive when trying to avoid over-committing the CPUs.
116) Message boards : News : N-Body 1.18 (Message 58797)
Posted 12 Jun 2013 by Richard Haselgrove
Post:
yes, and but on my computer for mt dark tasks estimated at less then 10 minutes, 6 of 12 CPU's are parked for the entire duration of the task. Also, look at run time verses CPU time +/- equal in the results file. Why does the above group of tasks always have this condition! Bottom line the above referenced group of tasks are executing very ineffectively, perhaps with correct results, and are reserving 1100% more resources meaning only 1 CPU is required, and 12 CPUs are reserved for the entire run time!

also note that actual run time is most often 5 to 8 times (*) > original estimated run time. In other words, if MT was really working on the above group, the original estimate is ok.

I think I may have a better idea of this 'dark' task problem now.

I was running ps_nbody_06_06_dark_1371001451_18919 when I took this screenshot.


(direct link)

BOINC Manager (in the background) shows that the task has been running for 3 minutes already, but it's still showing 0%age progress - and it's only using one thread (left-hand Process Explorer window).

A couple of minutes later, it went into full multi-threaded mode, and progress jumped from 0% to 100% in a couple of seconds. Finally, there was a further 5 minutes or so more running after 100% had been reached, during which again only one thread was active.

Other types of task seem to spend far less time (both absolutely and relatively) in the single-threaded plateaux at 0% and 100%, and far longer in the multi-threaded stage where the progress %age counter is incrementing steadily. Hope that help the programmers track down what's going on.
117) Message boards : News : N-Body 1.18 (Message 58779)
Posted 12 Jun 2013 by Richard Haselgrove
Post:
I've now watched an N-Body task through being pre-empted by BOINC to allow other projects' tasks to run.

During suspension (with the app left in memory), one thread - perhaps a watchdog or timer thread - continued to show a Cycles Delta ~300K (trivial), but the worker threads were completely dormant.

When BOINC gave the task the resume instruction, all four (in my case) worker threads resumed, with the normal 2 billion Cycles Delta.

IMHO, the application's internal multi-threading behaviour is working properly (though I'll try to watch the final wrap-up at the end of this 100k_chisq_alt task, because I suspect the threads may finish out of synch, and wind down CPU usage to a single 'housekeeper' thread in the final stages.)

I think the remaining issues are the initial runtime estimates for the tasks, and some rough edges in BOINC's scheduling of MT tasks: that hasn't been given much operational testing since BOINC v7 was launched, which is the primary reason why I'm here. Let's discuss that when Travis has got the server back on an even keel with respect to the GPU applications.

BTW, I was recently reminded of a thread I started, and Travis contributed to, with regard to the difficulties of predicting runtimes in advance. I don't think much has happened on this issue in the intervening 18 months.

http://www.setiusa.us/showthread.php?2415-BOINC-DA-admits-CreditNew-is-really-a-random-number-generator
118) Message boards : News : N-Body 1.18 (Message 58725)
Posted 11 Jun 2013 by Richard Haselgrove
Post:
Can someone please let me know if the process is still running after it gets stuck at 98%?

Thanks,

Jake

Rather depends what you - and they - mean by 'get stuck'.

What I saw at 99.710% was a perfectly normal "no longer needs to run in high priority" (not in danger of missing deadline), so BOINC gave it a rest and gave other projects' work a chance to run, to balance resource share. I didn't explicitly check that all threads had suspended when BOINC told it to get out of the way - I will do next time - but I didn't notice any of the replacements running slow.
119) Message boards : News : N-Body 1.18 (Message 58699)
Posted 11 Jun 2013 by Richard Haselgrove
Post:
That is strange. I am pulling this run down.

Jake

Would you still like my attempt at http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=372074737, since it isn't in the afected group?

Better to abort it before it starts on the 3.5 year odyssey, if you don't want it, but I'm happy to let it run.
120) Message boards : News : N-Body 1.18 (Message 58658)
Posted 10 Jun 2013 by Richard Haselgrove
Post:
Continued Observations: So far AS I HAVE OBSERVED only 1 dark mt job has utilized all 12 cores. All of the short jobs - estimated at less than 10 minutes have all have many cores "parked". The one dark mt job that used all cores had an estimated time in the 0'000 hours, and took say 45 minutes to run...

Picking one of these messages at random for an observation.

Watching a few MT tasks (short ones) run through to completion. They seem to reach 100% progress, then stay 'Running' at 100% for a while.

Checking with process explorer, what seems to be happening is that most of the threads finish whatever their job was, and just one thread is still chugging away - I'm wondering if this might be what jdzukley is seeing?

The big trick with multithreaded programming is to give all the threads the same amount of work to do, so they all finish together (or to keep doing some sort of thread synchronisation to keep them in step as the run progresses). This is particularly true in the BOINC MT environment where the CPUs which have finished their allotted work aren't released back into the pool for re-assignment until the last laggard slowcoach has finished.


Previous 20 · Next 20

©2024 Astroinformatics Group