Welcome to MilkyWay@home

Posts by alanb1951

21) Message boards : Number crunching : Thread to report issues after server migration (Message 76525)
Posted 3 Nov 2023 by alanb1951
Post:

P.S. I note that the new master URL has -new added, but the web site doesn't -- is that the long-term plan or will the master URL end up changing back to not having -new in it?


milkyway-new.cs.rpi.edu is the master address but using just milkyway.cs.rpi.edu will also reach milkyway-new. This is the way the security team here at RPI set it up.

Thanks for the clarification!

Back to the problem at hand: I see that the certificate issues are fairly well documented in this thread by now :-) -- if your security folks are saying there's nothing wrong [because it works for Windows and for browsers on Linux] please inform them otherwise :-)

Cheers - Al.

P.S. I will not be patching my certificate store :-)
22) Message boards : Number crunching : Thread to report issues after server migration (Message 76493)
Posted 2 Nov 2023 by alanb1951
Post:
Kevin,

I see Nick has tried the latest available client, but I'm using the latest repository clients (which are a tad older!) and given his experience I'm not going to spend ages working out how to build a client :-)

On client 7.20.2 I see the following at the end of a connection attempt (after it has redirected to https on port 433):

Thu 02 Nov 2023 20:54:23 GMT | http://milkyway-new.cs.rpi.edu/milkyway/ | [http] [ID#1] Info:  TLSv1.2 (OUT), TLS header, Unknown (21):
Thu 02 Nov 2023 20:54:23 GMT | http://milkyway-new.cs.rpi.edu/milkyway/ | [http] [ID#1] Info:  TLSv1.3 (OUT), TLS alert, unknown CA (560):
Thu 02 Nov 2023 20:54:23 GMT | http://milkyway-new.cs.rpi.edu/milkyway/ | [http] [ID#1] Info:  SSL certificate problem: unable to get local issuer certificate
Thu 02 Nov 2023 20:54:23 GMT | http://milkyway-new.cs.rpi.edu/milkyway/ | [http] [ID#1] Info:  Closing connection 16
Thu 02 Nov 2023 20:54:23 GMT | http://milkyway-new.cs.rpi.edu/milkyway/ | [http] HTTP error: SSL peer certificate or SSH remote key was not OK


Similar on client 7.20.5 too.

Hope this helps.

Cheers - Al.
23) Message boards : Number crunching : Thread to report issues after server migration (Message 76484)
Posted 2 Nov 2023 by alanb1951
Post:
Kevin,

As has been noted in the News thread, it appears that Linux systems can't re-attach...

Trying to amend the URL offered by BOINC Manager results in a "Please try again later" message. Not helpful :-)

So I used boinccmd to attach instead, and that seemed to do something. However, looking in BOINC Manager after that shows the project identified by URL rather than by name, and the status is reported as "Scheduler request pending. Project initialization" (with a Communication deferred time appended)

If I try an update, it offers "Fetching scheduler list" then reports "Project communication failed"...

Checking in /var/lib/boinc-client, I have what seems to be a valid account_milkyway-new.cs.rpi.edu_milkyway.xml but
master_milkyway-new.cs.rpi.edu_milkyway.xml is empty.

Hope that helps :-)

Cheers - Al.

P.S. I note that the new master URL has -new added, but the web site doesn't -- is that the long-term plan or will the master URL end up changing back to not having -new in it?
24) Message boards : Number crunching : What's Everybody Doing with their Double Precision These Days? (Message 76407)
Posted 3 Oct 2023 by alanb1951
Post:
it's not 64-bit but it is science work :-)

According to this post "BRP7 requires and uses fp64 / double precision in some kernels". And according to this, it's not just a small part of the computation.
OOPS! Thanks for the correction!

Cheers - Al.
25) Message boards : Number crunching : What's Everybody Doing with their Double Precision These Days? (Message 76405)
Posted 2 Oct 2023 by alanb1951
Post:
Regarding Einstein work -- if you were only subscribed to the Gamma Ray Search you may not realize that that project has officially finished, so no more work! If that's the case, try the MeerKAT pulsar search instead (BRP7) -- it's not 64-bit but it is science work :-)

[Edit - I see Link posted about that while I was composing this -- ah well...]

As for WCG -- OPNG work tends to be a bit on/off (and they sometimes seem to restrict work availability over weekends to try fo avoid possible upload/download issues when there's no-one there to look into it. Also, they had some bad OPNG batches which caused a fairly long time-out through August and into September :-(

It's worth exploring the forums for Einstein and WCG to keep an eye on what's going on :-)

Cheers - Al.
26) Message boards : Number crunching : Is there a way to split total available cores over multple tasks here? (Message 76310)
Posted 22 Jul 2023 by alanb1951
Post:
Link - I saw your reply and it made me wonder if the client I was testing on had got confused at some point...

So I restarted the one I used for my experiments and repeated some of the simple tests and it now seems to respect --nthreads if it is present, but it is still using the count of available CPUs (based on avg_ncpus if that is present) when -nthreads is unavailable. That seems to be how I would've expected it to behave, despite my earlier experimental observations to the contrary :-)

I am at a loss to explain the apparent change in behaviour away from the [unexpected] behaviour I observed :-) -- I think the moral might be to always include avg_ncpus and have it equal to the --nthreads value "just in case"... As mentioned elsewhere, it will also have the useful side-effect of keeping the client scheduler properly informed as to CPUs in use!

Thanks for making me have another look; I wasn't that happy with what I thought I'd found :-)

Cheers - Al.

P.S. I had already noticed that BOINC Manager seems to report the CPUs count that was current when tasks were downloaded (until after a restart...) but I'd been monitoring thread usage by looking at stderr.txt in the task's slot directory and with a system tool which always showed one more thread than the number OpenMP reported in the stderr file. There always seems to be an apparently idle thread, even on my system where there's never been an app_config file - I presume it's the checkpoint handler and/or some sort of "watchdog"

[Edited to add reminder about avg_ncpus and the client scheduler]
27) Message boards : Number crunching : Is there a way to split total available cores over multple tasks here? (Message 76306)
Posted 22 Jul 2023 by alanb1951
Post:
The nthreads parameter is ONLY for MT tasks. It sets the maximum number of threads per task to use. For the OP who wanted to run on 15 threads total for the host, that works out as 3 tasks in total using 5 threads each.
You can drop the ncpus value entirely as it is ignored by the nthreads parameter setting.
I seemed to recall some posts about N-body not actually using the --nthreads parameter but a cursory search didn't find anything so I conducted an experiment on one of my systems with enough cores to make it a proper test.

I normally allow 9 CPU threads for BOINC on the system I tested -- its normal app_config.xml is
<app_config>
  <app_version>
    <app_name>milkyway_nbody</app_name>
    <plan_class>mt</plan_class>
    <avg_ncpus>3</avg_ncpus>
    <cmdline>--nthreads 3</cmdline>
  </app_version>
  <project_max_concurrent>2</project_max_concurrent>


I reduced my queue size to make sure I didn't get swamped with new work, removed the avg_ncpus line and re-read the config files. It immediately suspended one of the two N-body tasks it had been running, as if the client's scheduler now thought tasks used all 9 threads :-) -- the task that carried on running continued to use three threads... I looked at client_state.xml and the app_version section for nbody now included an avg_ncpus value of 9 (which explained the scheduler behaviour!)

It then fetched one new work unit, which apparently wanted 9 threads so I suspended that one before it started, restored the normal app_config.xml and got two tasks running again...

On another (smaller) machine I tried an app_config.xml file with avg_ncpus less than the total allowed to BOINC and without the command-line parameter value option. That happily ran tasks utilizing the required number of threads, despite the absence of --nthreads!

So that left the question of how the executable decides on the thread count... Every task that starts up gets a file called init_data.xml in its slots directory entry; that file contains a lot of information from various places, including an ncpus value which appears to be the same as the avg_ncpus value in the app_version data at the time the task is started (or restarted on a BOINC restart?) It seems likely that the N-body app digs the thread-count out of that file.

I have no doubt that other OpenMP programs may well respect a thread-count parameter of some form, but it certainly appears that N-body doesn't :-)

Cheers - Al.
28) Questions and Answers : Unix/Linux : 15 CPUs cause running tasks to stop running (Message 76298)
Posted 20 Jul 2023 by alanb1951
Post:
Aurum,

Al, "instructions retired per second" is the secret sauce that makes this work so well. I had taken your word for it before and have all my computers running 3-thread nbodys. Thanks much for doing this useful work and explaining it so thoroughly.
Glad to have been of help!

I once tried to install PERF to measure instructions retired but it gave me four ways to install it and I didn't get it working. It's the only Linux program I've ever seen suggest multiple ways to install.
Regarding installing perf (and other kernel-specific tools): I'm on Ubuntu rather than Mint, and all I needed to do was tell Synaptic Package Manager (which I use rather than running apt or dpkg from a console!) to install linux-tools-generic... This makes sure that whenever a new kernel version is installed via Software Updater a new version of perf (and friends) gets pulled in to match the new kernel -- the older versions will remain until the corresponding kernels are uninstalled.

The alternative is to install explicitly by kernel version -- that can be a bit of a pain :-) as one ought to match both the version and the flavour (which is generic in my case and yours...

I'm not familiar enough with Mint to know whether it structures its packages in the same way; sorry about that...

CPDN has issued a slug of WAH but it's windoze only. Their new guy is planning on putting out much OpenIFS for Linux this fall:
https://www.cpdn.org/forum_thread.php?id=9149&postid=69160#69160
Yup, I've been following that saga and am looking forward to running some proper 64-bit CPDN work...

Cheers - Al.
29) Questions and Answers : Unix/Linux : 15 CPUs cause running tasks to stop running (Message 76297)
Posted 20 Jul 2023 by alanb1951
Post:
mikey - for information:

"PERF" as in Ithena Measurements Perf tasks? If so it's like any other Boinc project you choose what kind of tasks you want to run, Perf is for Windows only while Ooni tasks are Linux only and the Cnode tasks are for both I think. There is also a Project Ithena computation that has Hex tasks https://comp.ithena.net/usr/
Nope - we're talking Linux, and perf is one of a set of kernel-specific system tools; it offers various different ways of looking at system performance...

Cheers - Al.
30) Questions and Answers : Unix/Linux : 15 CPUs cause running tasks to stop running (Message 76296)
Posted 20 Jul 2023 by alanb1951
Post:
mikey - for information:

"PERF" as in Ithena Measurements Perf tasks? If so it's like any other Boinc project you choose what kind of tasks you want to run, Perf is for Windows only while Ooni tasks are Linux only and the Cnode tasks are for both I think. There is also a Project Ithena computation that has Hex tasks https://comp.ithena.net/usr/
Nope - we're talking Linux, and perf is one of a set of kernel-specific system tools; it offers various different ways of looking at system performance...

Cheers - Al.
31) Message boards : Number crunching : validation inconclusive on some tasks (Message 76289)
Posted 19 Jul 2023 by alanb1951
Post:
Mikey is right about the usual meaning of Validation Inconclusive, and about the way it sends out the tasks one at a time...

A number of the tasks that still show up in your tasks report are Separation tasks, some/all of which may never get cleared out because of the way they shut Separation down -- you may have spotted that and allowed for it when counting tasks, in which case apologies for mentioning it!

The workunit you posted about (960542719) has now validated and is quite interesting in that it drew my attention to how MilkyWay flags tasks that fail to validate. The tale it tells is thus:

  • Initial wingman aborted it about 90 minutes after receiving it;
  • your task (922770178) returned and waited (reporting either Validation Inconclusive or, perhaps1, Pending Validation until the _2 task returned);
  • the _2 task returned and didn't match well with yours when validated (definitely Inconclusive now!);
  • a _3 task was sent out and returned - it was a good enough match to _2 that _2 and _3 were declared valid and yours was rejected.


I note that it has marked your task as Validate error; many projects use that tag for tasks whose results the validator can't understand well enough to attempt validation at all2, marking basic failures to match as Invalid. There didn't seem to be anything blatantly wrong with what your task returned (though it was a long way off the results for the two that validated), so I guess MW uses that label for all types of validation failure...

I note also that you have a small number of other N-body tasks that got Validate errors (with the same sort of mismatch of results...) I wonder if that has something to do with your allowing 15 CPU threads and the system sometimes losing [partial] track of what it's doing (for instance, a missed thread synchronization might do that...) It's unlikely to be a hardware issue -- I run a similar system (but under Linux) and it has never had an N-body task that failed to validate (at 3 threads per task and only 11 or 12 threads allowed to BOINC in total...)

I've also read your messages about N-body in other threads, and note the advice shared there -- hope you can get it sorted out properly soon!

Cheers - Al.

1 If it decides your task is a candidate for validation without a matching wingman but the validator decides to disagree it should be tagged Inconclusive at once, whereas if it has already decided you need a wingman before the validator gets a look in it should be tagged Pending. MW has a strange validator, so it may not always do what might be expected :-)

2 For example, WCG tags mismatched validations as "Invalid" and results the validator can't understand as "Error" (making them indistinguishable from any other sort of error) whilst Einstein tags mismatched validations as "Completed, marked as invalid" and results the validator can't understand as "Validation Error" -- there is enough information passed between the validator and the database to tell the two cases apart but the web interface has to bother to pay attention...

32) Questions and Answers : Unix/Linux : 15 CPUs cause running tasks to stop running (Message 76264)
Posted 13 Jul 2023 by alanb1951
Post:
(Aurum: it took me a moment to realize you were quoting from one of my earlier posts - sorry about the delay in responding.)

I found that there was a slight degradation for each thread up to three or four, then it went downhill quite fast -- while I was working out an optimum for one of my systems, I found 2 three-thread tasks worked the CPUs harder than 1 six-thread task did
How do you define "works harder?"
"Works harder" is based on instructions retired per second rather than total run time. The latter is not as useful as a throughput statistic because no two N-body tasks are guaranteed to execute [roughly] the same number of instructions...

I measured instructions/second for a task over a prolonged period to ensure I was likely to catch multiple iterations; I set the checkpoint interval quite high to avoid accidentally sampling a task during a checkpoint, as such activity is likely to involve a lot of context switches in a short time interval, and a lot of the CPU usage at that point is O/S rather than application...

If I wanted to test two N-body tasks running together, I'd sample one of them for a specified time then sample the other for the same amount of time. In general, if the tasks had the same number of threads they would perform in a reasonably similar fashion. By only allowing the same total number of threads, I was able to observe that there was a large enough improvement to make the smaller tasks more practical (if I intended to give that many threads to N-body...)

(Of course, I repeated the tests on several different tasks, and had to discard one or two tests because the tasks finished mid-test! The run-time estimates for N-body are not very accurate...)

It is likely that part (if not all) of the apparent improvement is down to threads for a specific task being somewhat less likely to get out of sync if there are less of them; if threads are being juggled around, caches and TLBs may be affected and more instructions will be likely to stall.

Incidentally, I never allow BOINC to use more than 75% of CPU threads; most of my systems have enough stuff going on in the background that leaving about 25% of threads free seems an effective level. Of course, that's my systems; others may find different settings are better (especially if they have completely different hardware platforms.

And what CPU did you test this on? TIA
The only systems I have [at present] that are likely to offer more than two or three threads to N-body are the Ryzen 3700X and a Ryzen 5600H I mentioned in my earlier post -- I tested the 3-versus-6 scenario on both with no other BOINC tasks running, and also looked [briefly] at the effect of various workloads on a single 3-thread task. Both systems have 2x16GB RAM; the 3700X has total power limited to 80W (which doesn't seem to slow it up much, if at all!)

I suspect the outcome might be [slightly] different on Intel non-server CPUs, and likewise fot any server chipsets; unfortunately, I don't have any of the latter for testing :-) Throughput may also depend on the number of memory channels and how memory is installed -- I've seen that discussed elsewhere (probably at WCG, but I may be misremembering...)

By the way, I haven't seen any CPDN work or WCG ARP1 work in a while -- I rather suspect that when either of those makes an appearance the memory will take an extra hit and N-body tasks will run less efficiently at that point :-)

Hope that answers your questions.

Cheers - Al.
33) Message boards : Number crunching : Will N-Body projects all use multiple CPUs? (Message 76254)
Posted 13 Jul 2023 by alanb1951
Post:
Aurum,

Note that I'm not a MilkyWay researcher or technician, but I'll have a go at this...

Is n-body a legitimate multi-CPU project or is it just multiple WUs in one package?
It is a multi-threaded application, using OpenMP. Given that (unlike Separation) it only produces one result, there are not multiple WUs in one package. And it'll quite happily send a task for the same WU to systems that allocate different numbers of threads (including one!) :-)

Does n-body use all allocated CPUs for the entire run?
I think it only uses one thread during set-up (the starting phase that takes about 30 seconds on most of my systems); it seems to use all allocated threads after that.

Or does it start using all CPUs and then as parts finish CPUs go idle and wasted until there's just one left running at the end?
It seems to try to apportion the required computational effort across all threads, but there are all sorts of reasons why seemingly identical blocks of work might take different amounts of time -- some of the time, threads will be idling if they are out of sync at key points. This becomes more noticeable as more threads are given to a single task, as it becomes more likely that the O/S will interrupt a (random?) thread to perform some necessary activity of its own :-)

Hope that goes some way to answering your questions.

Cheers - Al.
34) Questions and Answers : Macintosh : Why is there no N-body application for Mac's? (Message 76238)
Posted 11 Jul 2023 by alanb1951
Post:
Mikey,

This is exactly where I wish Richard Hassellgrove and his group of software testers could get with the different Projects and get some basic across the Projects tech support, probably not free though, to help with things like this. Einstein for example has Mac apps for all the different versions from a Cheese Grater Mac to the new M2 cpu. Maybe they could even provide their apps, admin to admin, and let MilkyWay for example work on changing it so it works here.
Yup, that would be wonderful, but a lot [if not all] of the BOINC development and support is now on a volunteer basis -- I wonder if Richard may be a volunteer himself :-) (By the way, he is currently trying to help the CPDN folks with a recalcitrant credit problem...)

As for getting help from the rest of the community, they'd need some project that has multi-threaded code that runs on Apple Silicon, and they'd probably have to find an expert programmer from somewhere. None of the Einstein apps available on Apple Silicon are multi-threaded as far as I'm aware... :-(

You're probably familiar with "Good, quick, cheap - any two!" (or similar); the reality in many cases is that one only gets really good in the absence of both quick and cheap! Sadly, the goal in a lot of places is to do things as cheaply as possible nowadays; this is especially true in most academic environments that don't have huge research budgets (even Einstein have lost staff that were not replaced...). And even if there is a willingness to recruit, most programmers can earn far better money in a non-academic/research environment, especially in specialist cases1.

I try not to be pessimistic about the future of distributed computing, but sometimes it's quite difficult!

Cheers - Al.

1 I wonder who'll pick up Apple BOINC client if/when Charlie Fenton is no longer willing/able to look after it...
35) Questions and Answers : Unix/Linux : 15 CPUs cause running tasks to stop running (Message 76234)
Posted 11 Jul 2023 by alanb1951
Post:
James,

I will just run one project at a time and forget the rest of the problems that come with running more than one or a project that requires virtual box. If it is not plug and play, I will just ignore it.

Fair enough, it's your choice to make, but if you change your mind...

Mikey's information was good (apart from the Windows-specific path!).

You have a recent BOINC client so if it did a standard install, there will be a directory /var/lib/boinc (which may be a link to /var/lib/boinc-client in some cases!) One of the subdirectories in there will be called projects and the various projects each have a separate directory in there. Most will have a directory name that should identify the project :-)

Should you change your mind and feel the need for more guidance, let us know what sort of work mix you'd like to achieve and someone can probably offer further assistance. And if you're feeling adventurous and want to find out more about app_config.xml on your own you could always look at https://boinc.berkeley.edu/wiki/Client_configuration [if you haven't already done so :-)]

Good luck and happy crunching.

Cheers - Al.

P.S. On my Ryzen 3700X I run a mix of WCG (selected sub-projects, each separately managed via app_config.xml), Einstein (GPU work only) and MilkyWay N-body, with TN-Grid as a fallback for times when WCG is short of work. I found it easy to set up and tune, and it rarely gives me any trouble unless there's a mass shortage of work :-)
36) Questions and Answers : Macintosh : Why is there no N-body application for Mac's? (Message 76225)
Posted 10 Jul 2023 by alanb1951
Post:
Please note - I am an end user, not a member of the MilkyWay team, so this is a personal opinion...

My guess is it won't happen soon, if ever :-( -- I can think of several reasons why Apple Silicon is not seen as a priority.

Firstly, there may not be a viable long-term version of OpenMP available for M1/M2 (&c) [similar to the lack of guaranteed long-term availability of OpenCL...] If a major rewrite is necessary to use another multi-threading mechanism, that is not likely to happen in house given that those running MilkyWay are not necessarily programmers first and foremost.

(If in doubt about OpenMP on Apple, put "OpenMP Apple Silicon" into a search engine and see what you get back; much of it is not reassuring...)

Secondly, even if OpenMP is viable, the availability of hardware and "expert" staff time to port to Apple SiIicon might be seen as a negative.

Thirdly, even if there is the will to get past points one and two above, there then needs to be a prolonged period of testing to ensure that the new application produces results that are at least in the right general region -- otherwise, think what would happen if a pair of Apples both returned bad science! (That isn't a dig at Apple kit, by the way; there have been examples in other projects where some hardware and/or O/S-specific libraries produces significantly different results; more often seen with GPU code, but...)

I no longer have any Apple kit [for various reasons not relevant here], but I do think it'll be a shame if Apple Silicon kit finds itself shut out of various projects because Apple places [accidental?] barriers in the way of non-commercial software development; the kit seems impressively fast and there are lots of willing users out there (WCG and Einstein users are baying for proper Apple Silicon apps, for instance!)

Cheers - Al.

P.S. I no longer have any Windows kit either :-)
37) Questions and Answers : Unix/Linux : 15 CPUs cause running tasks to stop running (Message 76224)
Posted 10 Jul 2023 by alanb1951
Post:
James,

TL;DR -- use app_config.xml

[Otherwise...] A couple of points about managing work, one about managing BOINC work in general and one specific to MilkyWay N-body...

The general one first because it plays into the overall situation - it's about how much of your system BOINC is allowed to use, and how:

A system needs one or two spare processor threads all the time to manage the system, handle I/O and do whatever else might be needed. I note that you seem to have constrained your system to leave one processor thread free; I'd be very inclined to free up at least one more processor thread because there's quite a lot going on besides your user program(s) (BOINC or otherwise) and every time one of those things needs to get a turn it has to suspend another process if there's a shortage of free processors :-)

There's another issue too; certain sorts of work place enough overheads on L3 cache and RAM access to result in processor threads running idle for lack of data. There are projects where it is often recommended that one uses at most half the available CPU threads, and I've had work from at least one project where an 8-thread processor was more productive running 2 tasks than 4 :-) (In certain cases there might be other non-calculation related reasons for delay, but this is getting a bit long already!)

There's also the matter of managing how much work is run at once if one has multiple BOINC projects active. This is best managed by using an app_config.xml file in each project folder to provide a project_max_concurrent constraint as to how many tasks can run for that project at the same time. (There's plenty of discussion about app_config.xml elsewhere...)

A couple of my systems are Ryzens (a 3700X and a 5600H) and I've found I get the best throughput if I only use 75% of the CPUs (12/16 for the 3700X, 9/12 for the 5600H) -- to be fair, I have a couple of home-grown BOINC monitoring daemons running on each system (so I like to keep a thread clear for when one of those needs to run) and I'm running a mix of CPU tasks from WCG and MilkyWay, and I have never experimented with an all-MilkyWay mix, but the principal is the same.

Now, regarding N-body (and, possibly, other multi-threaded BOINC project applications out there), assigning too many threads to one task may not be a good thing...

By default, a single N-Body task will use all CPU threads up to the limit set by your "Use at most ??% of the CPUs" or 16, whichever is less. Unfortunately, the more threads assigned to a single task, the more likely it is that one or more threads will be suspended at any given time to let the BOINC client in or to allow the O/S to do its thing, and as there are regular points within the N-body application where it needs to ensure all threads are [more or less] in sync there will be times when threads are waiting, not computing.

You'll get more throughput if you actually use an app_config.xml file in the MilkyWay project folder to constrain the number of threads allocated to a single task. I found that there was a slight degradation for each thread up to three or four, then it went downhill quite fast -- while I was working out an optimum for one of my systems, I found 2 three-thread tasks worked the CPUs harder than 1 six-thread task did, and I now run one 3-thread N-body at a time, intermixed with other BOINC stuff.

That remark about reducing the number of threads is equally applicable on Windows systems, by the way, and there's a fair amount of discussion about using app_config.xml files for this purpose over in Number Crunching.

Hope this helps.

Cheers - Al.
38) Message boards : Number crunching : N-Body tune initial replication value (Message 76122)
Posted 2 Jul 2023 by alanb1951
Post:
Mikey described BOINC's built-in Adaptive Replication mechanism, which appears to be in use for N-Body and was in use for Separation. Around the time of the big server crash, both N-Body and Separation had difficulties with Adaptive Replication (possibly caused by delays in how long it was taking initial results to come back?) and I don't think I have seen an N-Body task pass the adaptive replication test since (if they ever did before!), not that I've been checking every task :-)

Typical requirements for a project to use Adaptive Replication are that there are a multitude of tasks that have almost exactly the same parameters (so statistical variations are allowed for) or that a result doesn't have to have extremely high precision anyway. Some WCG projects may fit the latter, and I suspect MilkyWay projects use the former approach!

MilkyWay projects use a back-end toolkit called [I think] Toolkit for Asynchronous Optimization. If I understand what little of the code I tried looking at, the customized BOINC validator insists on a wingman (even in AR-permitted conditions) if a result is for a parameter grouping that hasn't already validated a work unit -- I'm guessing that there are situations in which every workunit ends up needing a wingman anyway, and it may be that N-Body is in that situation.

If that really is the situation, it might be better if they altered the project configuration to not use Adaptive Replication (if that doesn't break the custom validator!)

Hope this is of interest - Al.
39) Message boards : News : New Poll Regarding GPU Application of N-Body (Message 75728)
Posted 19 Jun 2023 by alanb1951
Post:
We can only hope someone will be kind enough to improve the OpenCL app into something much, much faster than it is right now.
And someone would probably have to commit to amending the OpenCL code and/or the OpenCL-related code in the CPU-based part of the application whenever there was a relevant change to the science code in the "CPU-only" application code :-)

I get the impression that this application isn't one of those simple ones where the GPU-based calculations are more or less unchangeable and can be completely controlled via parameter settings. If all it is doing is (for instance) FFTs or simple optimization of a matrix, the only issue would be "Can it be made efficient enough to make it worth doing?" However, if it would either entail lots of shuffling data around on the GPU or frequent movement of data to and from the GPU between GPU-worthy sections of computation that might be a completely different matter!

And, of course, if part of making it efficient enough entails "messing" with support libraries or adding hacks to facilitate using the GPU for one task whilst another one is doing CPU-intensive stuff, that would have to be done very carefully, especially if end users might be using their GPUs for more than one BOINC application -- I recall an issue over at WCG a while back which was to do with something in that area :-(

Without an expert eye on the code, we can't know what performance issues (global memory usage, bandwidth between motherboard and GPU, et cetera) there might be. Whilst I share the hope that it might be possible to do something GPU-wise, I would be unsurprised if an [unbiased] expert outsider decides it isn't worth it...

Cheers - Al.
40) Message boards : News : Separation Project Coming To An End (Message 75497)
Posted 13 Jun 2023 by alanb1951
Post:
Tom,

It's always a bit sad to see a project come to an end, but it's also good to know the goals have been achieved. Well done to all concerned!

Regarding actual shut-down, I'm inclined to agree with Ian&Steve C's comment on the timeline in his first post. Has the current Separation run already reached a point where it's effectively work for works sake, or is there still serious value in processing the rest of this batch? It really has to be your call as to when to stop generating new work :-)

If it's a matter of trying to get the current run to converge as quickly as possible, I'll stick around until that happens (as, I suspect, will many others); otherwise I might as well switch to N-Body only at once, resuming GPU work if and only if you have to do re-runs based on the review process.

Regarding GPU users finding other things to do, I suspect quite a few of us will just give more time to Einstein@home :-)

Good luck with the paper! I'll continue to watch progress on all projects here with great interest...

Cheers - Al.

P.S. there is still the occasional post on the SETI@home site asking when that is going to restart -- expect the same to happen here :-)


Previous 20 · Next 20

©2024 Astroinformatics Group