Posts by alanb1951

21) Message boards : Number crunching : Thread to report issues after server migration (Message 76615) Posted 14 Nov 2023 by alanb1951 Post: Today I got a Workunit 962969273 aborted by project. What does that mean? It means the server decided your result wasn't needed, and as you hadn't started processing it when the server was next contacted it told your client to abort it! This usually happens when a task is issued because a prior task goes to "No Reply" state but then returns late (often a day or more so!) -- that appears to be the case with your example. Cheers - Al.
22) Message boards : Number crunching : Thread to report issues after server migration (Message 76579) Posted 6 Nov 2023 by alanb1951 Post: The certificate issue seems to be resolved now -- I was able to get MW up and running again on my Linux systems from around 15:45 UTC on 2023-11-06. It even worked as I wanted once I remembered that I needed to restore/rebuild my app_config.xml files to cut the number of threads per task :-) Thank you Kevin and the RPI techs. Cheers - Al.
23) Message boards : News : Migrating MilkyWay@home to a New Server (Message 76578) Posted 6 Nov 2023 by alanb1951 Post: The certificate issue seems to be resolved now -- I was able to get MW up and running again on my Linux systems from around 15:45 UTC on 2023-11-06. Thank you Kevin and the RPI techs. Cheers - Al.
24) Message boards : Number crunching : Thread to report issues after server migration (Message 76537) Posted 4 Nov 2023 by alanb1951 Post: Nick, Thanks for your effort, but I rather think we're talking past one another (probably my fault) rather than communicating... What the details I saw on the Firefox certificate stuff had me pondering was whether all MW servers are actually sending out the same certificate chain! What I read from your last reply is that if the browser can find certificates for the relevant intermediate and root CA issuers in its own store it won't bother to look at the rest of the chain sent by the [MW] server... The alternatives I can think of would be either Firefox sees the broken certificate and takes what action it can to work round it; the MW web server and whatever server(s) BOINC (and openssl) are hitting are actually returning different certificate chains! Again, thanks for your efforts, but I don't think there's really any point in my pursuing this any further -- I'm not going to start digging with a network monitor, as I'm not that bothered :-) -- we've all combined to identify and report the problem, and it will probably be sorted out on Monday! Cheers - Al.
25) Message boards : Number crunching : Thread to report issues after server migration (Message 76534) Posted 3 Nov 2023 by alanb1951 Post: Nick - It's not a patch. The cert store holds certificates that you trust. I use the certificate bundle supplied by Ubuntu (which appears to be based on the Mozilla bundle¹!), so as far as I'm concerned, adding to it is akin to patching (though the remark was tongue-in-cheek...) -- I tend to avoid altering something that shouldn't need modifying :-) Web browsers come by default with many trusted root certificates. Which is why your browser isn't complaining about the new site. Does this mean that the valid certificate chain Firefox is supposed to have downloaded [from the MW site, I presumed] is actually a concoction constructed by Firefox? If so, fair enough (but a badly worded Firefox certificate information page!) and I'd be [vaguely] interested in how it does it². If that is the case, that would certainly explain why I couldn't make sense of the two very different certificate chains I could see! Cheers - Al. P.S. I make no claims to being an SSL/TLS guru, so excuse my [apparently] limited understanding :-) ¹ According to the package manager ca-certificates "Contains the certificate authorities shipped with Mozilla's browser to allow SSL-based applications to check for the authenticity of SSL connections." ² I presume that all the Linux software relevant here (BOINC, openssl, Firefox) ends up using a recent libssl version - BOINC seems to use libcurl as an intermediary to libssl3, openssl uses libssl3 and the Firefox snap seems to be a static build...
26) Message boards : Number crunching : Thread to report issues after server migration (Message 76529) Posted 3 Nov 2023 by alanb1951 Post: Back to the problem at hand: I see that the certificate issues are fairly well documented in this thread by now :-) -- if your security folks are saying there's nothing wrong [because it works for Windows and for browsers on Linux] please inform them otherwise :-) Cheers - Al. P.S. I will not be patching my certificate store :-) I have contacted them and linked them this thread as well. If they don't get back to me today then I'm afraid it will have to wait until Monday. I will try and fix other issues not related with the certification issue while I wait for their response. Thanks, Kevin. The "wait until the technicians return on Monday" issue affects other BOINC projects too (CPDN and WCG to name but two), and it's understandable¹. Good luck with the other fixing up. Although the [limited] parts of the web site I've used seem o.k., I've seen some of the other bug reports with hosts of PHP errors [yuk!] Cheers - Al. ¹ Having worked in a University Computing Service at one time I have seen that at first hand, both before and after the introduction of 24/7 "on call"...
27) Message boards : Number crunching : Thread to report issues after server migration (Message 76525) Posted 3 Nov 2023 by alanb1951 Post: P.S. I note that the new master URL has -new added, but the web site doesn't -- is that the long-term plan or will the master URL end up changing back to not having -new in it? milkyway-new.cs.rpi.edu is the master address but using just milkyway.cs.rpi.edu will also reach milkyway-new. This is the way the security team here at RPI set it up. Thanks for the clarification! Back to the problem at hand: I see that the certificate issues are fairly well documented in this thread by now :-) -- if your security folks are saying there's nothing wrong [because it works for Windows and for browsers on Linux] please inform them otherwise :-) Cheers - Al. P.S. I will not be patching my certificate store :-)
28) Message boards : Number crunching : Thread to report issues after server migration (Message 76493) Posted 2 Nov 2023 by alanb1951 Post: Kevin, I see Nick has tried the latest available client, but I'm using the latest repository clients (which are a tad older!) and given his experience I'm not going to spend ages working out how to build a client :-) On client 7.20.2 I see the following at the end of a connection attempt (after it has redirected to https on port 433): Thu 02 Nov 2023 20:54:23 GMT \| http://milkyway-new.cs.rpi.edu/milkyway/ \| [http] [ID#1] Info: TLSv1.2 (OUT), TLS header, Unknown (21): Thu 02 Nov 2023 20:54:23 GMT \| http://milkyway-new.cs.rpi.edu/milkyway/ \| [http] [ID#1] Info: TLSv1.3 (OUT), TLS alert, unknown CA (560): Thu 02 Nov 2023 20:54:23 GMT \| http://milkyway-new.cs.rpi.edu/milkyway/ \| [http] [ID#1] Info: SSL certificate problem: unable to get local issuer certificate Thu 02 Nov 2023 20:54:23 GMT \| http://milkyway-new.cs.rpi.edu/milkyway/ \| [http] [ID#1] Info: Closing connection 16 Thu 02 Nov 2023 20:54:23 GMT \| http://milkyway-new.cs.rpi.edu/milkyway/ \| [http] HTTP error: SSL peer certificate or SSH remote key was not OK Similar on client 7.20.5 too. Hope this helps. Cheers - Al.
29) Message boards : Number crunching : Thread to report issues after server migration (Message 76484) Posted 2 Nov 2023 by alanb1951 Post: Kevin, As has been noted in the News thread, it appears that Linux systems can't re-attach... Trying to amend the URL offered by BOINC Manager results in a "Please try again later" message. Not helpful :-) So I used boinccmd to attach instead, and that seemed to do something. However, looking in BOINC Manager after that shows the project identified by URL rather than by name, and the status is reported as "Scheduler request pending. Project initialization" (with a Communication deferred time appended) If I try an update, it offers "Fetching scheduler list" then reports "Project communication failed"... Checking in /var/lib/boinc-client, I have what seems to be a valid account_milkyway-new.cs.rpi.edu_milkyway.xml but master_milkyway-new.cs.rpi.edu_milkyway.xml is empty. Hope that helps :-) Cheers - Al. P.S. I note that the new master URL has -new added, but the web site doesn't -- is that the long-term plan or will the master URL end up changing back to not having -new in it?
30) Message boards : Number crunching : What's Everybody Doing with their Double Precision These Days? (Message 76407) Posted 3 Oct 2023 by alanb1951 Post: it's not 64-bit but it is science work :-) According to this post "BRP7 requires and uses fp64 / double precision in some kernels". And according to this, it's not just a small part of the computation. OOPS! Thanks for the correction! Cheers - Al.
31) Message boards : Number crunching : What's Everybody Doing with their Double Precision These Days? (Message 76405) Posted 2 Oct 2023 by alanb1951 Post: Regarding Einstein work -- if you were only subscribed to the Gamma Ray Search you may not realize that that project has officially finished, so no more work! If that's the case, try the MeerKAT pulsar search instead (BRP7) -- it's not 64-bit but it is science work :-) [Edit - I see Link posted about that while I was composing this -- ah well...] As for WCG -- OPNG work tends to be a bit on/off (and they sometimes seem to restrict work availability over weekends to try fo avoid possible upload/download issues when there's no-one there to look into it. Also, they had some bad OPNG batches which caused a fairly long time-out through August and into September :-( It's worth exploring the forums for Einstein and WCG to keep an eye on what's going on :-) Cheers - Al.
32) Message boards : Number crunching : Is there a way to split total available cores over multple tasks here? (Message 76310) Posted 22 Jul 2023 by alanb1951 Post: Link - I saw your reply and it made me wonder if the client I was testing on had got confused at some point... So I restarted the one I used for my experiments and repeated some of the simple tests and it now seems to respect --nthreads if it is present, but it is still using the count of available CPUs (based on avg_ncpus if that is present) when -nthreads is unavailable. That seems to be how I would've expected it to behave, despite my earlier experimental observations to the contrary :-) I am at a loss to explain the apparent change in behaviour away from the [unexpected] behaviour I observed :-) -- I think the moral might be to always include avg_ncpus and have it equal to the --nthreads value "just in case"... As mentioned elsewhere, it will also have the useful side-effect of keeping the client scheduler properly informed as to CPUs in use! Thanks for making me have another look; I wasn't that happy with what I thought I'd found :-) Cheers - Al. P.S. I had already noticed that BOINC Manager seems to report the CPUs count that was current when tasks were downloaded (until after a restart...) but I'd been monitoring thread usage by looking at stderr.txt in the task's slot directory and with a system tool which always showed one more thread than the number OpenMP reported in the stderr file. There always seems to be an apparently idle thread, even on my system where there's never been an app_config file - I presume it's the checkpoint handler and/or some sort of "watchdog" [Edited to add reminder about avg_ncpus and the client scheduler]
33) Message boards : Number crunching : Is there a way to split total available cores over multple tasks here? (Message 76306) Posted 22 Jul 2023 by alanb1951 Post: The nthreads parameter is ONLY for MT tasks. It sets the maximum number of threads per task to use. For the OP who wanted to run on 15 threads total for the host, that works out as 3 tasks in total using 5 threads each. You can drop the ncpus value entirely as it is ignored by the nthreads parameter setting. I seemed to recall some posts about N-body not actually using the --nthreads parameter but a cursory search didn't find anything so I conducted an experiment on one of my systems with enough cores to make it a proper test. I normally allow 9 CPU threads for BOINC on the system I tested -- its normal app_config.xml is <app_config> <app_version> <app_name>milkyway_nbody</app_name> <plan_class>mt</plan_class> <avg_ncpus>3</avg_ncpus> <cmdline>--nthreads 3</cmdline> </app_version> <project_max_concurrent>2</project_max_concurrent> I reduced my queue size to make sure I didn't get swamped with new work, removed the avg_ncpus line and re-read the config files. It immediately suspended one of the two N-body tasks it had been running, as if the client's scheduler now thought tasks used all 9 threads :-) -- the task that carried on running continued to use three threads... I looked at client_state.xml and the app_version section for nbody now included an avg_ncpus value of 9 (which explained the scheduler behaviour!) It then fetched one new work unit, which apparently wanted 9 threads so I suspended that one before it started, restored the normal app_config.xml and got two tasks running again... On another (smaller) machine I tried an app_config.xml file with avg_ncpus less than the total allowed to BOINC and without the command-line parameter value option. That happily ran tasks utilizing the required number of threads, despite the absence of --nthreads! So that left the question of how the executable decides on the thread count... Every task that starts up gets a file called init_data.xml in its slots directory entry; that file contains a lot of information from various places, including an ncpus value which appears to be the same as the avg_ncpus value in the app_version data at the time the task is started (or restarted on a BOINC restart?) It seems likely that the N-body app digs the thread-count out of that file. I have no doubt that other OpenMP programs may well respect a thread-count parameter of some form, but it certainly appears that N-body doesn't :-) Cheers - Al.
34) Questions and Answers : Unix/Linux : 15 CPUs cause running tasks to stop running (Message 76298) Posted 20 Jul 2023 by alanb1951 Post: Aurum, Al, "instructions retired per second" is the secret sauce that makes this work so well. I had taken your word for it before and have all my computers running 3-thread nbodys. Thanks much for doing this useful work and explaining it so thoroughly. Glad to have been of help! I once tried to install PERF to measure instructions retired but it gave me four ways to install it and I didn't get it working. It's the only Linux program I've ever seen suggest multiple ways to install. Regarding installing perf (and other kernel-specific tools): I'm on Ubuntu rather than Mint, and all I needed to do was tell Synaptic Package Manager (which I use rather than running apt or dpkg from a console!) to install linux-tools-generic... This makes sure that whenever a new kernel version is installed via Software Updater a new version of perf (and friends) gets pulled in to match the new kernel -- the older versions will remain until the corresponding kernels are uninstalled. The alternative is to install explicitly by kernel version -- that can be a bit of a pain :-) as one ought to match both the version and the flavour (which is generic in my case and yours... I'm not familiar enough with Mint to know whether it structures its packages in the same way; sorry about that... CPDN has issued a slug of WAH but it's windoze only. Their new guy is planning on putting out much OpenIFS for Linux this fall: https://www.cpdn.org/forum_thread.php?id=9149&postid=69160#69160 Yup, I've been following that saga and am looking forward to running some proper 64-bit CPDN work... Cheers - Al.
35) Questions and Answers : Unix/Linux : 15 CPUs cause running tasks to stop running (Message 76297) Posted 20 Jul 2023 by alanb1951 Post: mikey - for information: "PERF" as in Ithena Measurements Perf tasks? If so it's like any other Boinc project you choose what kind of tasks you want to run, Perf is for Windows only while Ooni tasks are Linux only and the Cnode tasks are for both I think. There is also a Project Ithena computation that has Hex tasks https://comp.ithena.net/usr/ Nope - we're talking Linux, and perf is one of a set of kernel-specific system tools; it offers various different ways of looking at system performance... Cheers - Al.
36) Questions and Answers : Unix/Linux : 15 CPUs cause running tasks to stop running (Message 76296) Posted 20 Jul 2023 by alanb1951 Post: mikey - for information: "PERF" as in Ithena Measurements Perf tasks? If so it's like any other Boinc project you choose what kind of tasks you want to run, Perf is for Windows only while Ooni tasks are Linux only and the Cnode tasks are for both I think. There is also a Project Ithena computation that has Hex tasks https://comp.ithena.net/usr/ Nope - we're talking Linux, and perf is one of a set of kernel-specific system tools; it offers various different ways of looking at system performance... Cheers - Al.
37) Message boards : Number crunching : validation inconclusive on some tasks (Message 76289) Posted 19 Jul 2023 by alanb1951 Post: Mikey is right about the usual meaning of Validation Inconclusive, and about the way it sends out the tasks one at a time... A number of the tasks that still show up in your tasks report are Separation tasks, some/all of which may never get cleared out because of the way they shut Separation down -- you may have spotted that and allowed for it when counting tasks, in which case apologies for mentioning it! The workunit you posted about (960542719) has now validated and is quite interesting in that it drew my attention to how MilkyWay flags tasks that fail to validate. The tale it tells is thus: Initial wingman aborted it about 90 minutes after receiving it; your task (922770178) returned and waited (reporting either Validation Inconclusive or, perhaps¹, Pending Validation until the _2 task returned); the _2 task returned and didn't match well with yours when validated (definitely Inconclusive now!); a _3 task was sent out and returned - it was a good enough match to _2 that _2 and _3 were declared valid and yours was rejected. I note that it has marked your task as Validate error; many projects use that tag for tasks whose results the validator can't understand well enough to attempt validation at all², marking basic failures to match as Invalid. There didn't seem to be anything blatantly wrong with what your task returned (though it was a long way off the results for the two that validated), so I guess MW uses that label for all types of validation failure... I note also that you have a small number of other N-body tasks that got Validate errors (with the same sort of mismatch of results...) I wonder if that has something to do with your allowing 15 CPU threads and the system sometimes losing [partial] track of what it's doing (for instance, a missed thread synchronization might do that...) It's unlikely to be a hardware issue -- I run a similar system (but under Linux) and it has never had an N-body task that failed to validate (at 3 threads per task and only 11 or 12 threads allowed to BOINC in total...) I've also read your messages about N-body in other threads, and note the advice shared there -- hope you can get it sorted out properly soon! Cheers - Al. ¹ If it decides your task is a candidate for validation without a matching wingman but the validator decides to disagree it should be tagged Inconclusive at once, whereas if it has already decided you need a wingman before the validator gets a look in it should be tagged Pending. MW has a strange validator, so it may not always do what might be expected :-) ² For example, WCG tags mismatched validations as "Invalid" and results the validator can't understand as "Error" (making them indistinguishable from any other sort of error) whilst Einstein tags mismatched validations as "Completed, marked as invalid" and results the validator can't understand as "Validation Error" -- there is enough information passed between the validator and the database to tell the two cases apart but the web interface has to bother to pay attention...
38) Questions and Answers : Unix/Linux : 15 CPUs cause running tasks to stop running (Message 76264) Posted 13 Jul 2023 by alanb1951 Post: (Aurum: it took me a moment to realize you were quoting from one of my earlier posts - sorry about the delay in responding.) I found that there was a slight degradation for each thread up to three or four, then it went downhill quite fast -- while I was working out an optimum for one of my systems, I found 2 three-thread tasks worked the CPUs harder than 1 six-thread task did How do you define "works harder?" "Works harder" is based on instructions retired per second rather than total run time. The latter is not as useful as a throughput statistic because no two N-body tasks are guaranteed to execute [roughly] the same number of instructions... I measured instructions/second for a task over a prolonged period to ensure I was likely to catch multiple iterations; I set the checkpoint interval quite high to avoid accidentally sampling a task during a checkpoint, as such activity is likely to involve a lot of context switches in a short time interval, and a lot of the CPU usage at that point is O/S rather than application... If I wanted to test two N-body tasks running together, I'd sample one of them for a specified time then sample the other for the same amount of time. In general, if the tasks had the same number of threads they would perform in a reasonably similar fashion. By only allowing the same total number of threads, I was able to observe that there was a large enough improvement to make the smaller tasks more practical (if I intended to give that many threads to N-body...) (Of course, I repeated the tests on several different tasks, and had to discard one or two tests because the tasks finished mid-test! The run-time estimates for N-body are not very accurate...) It is likely that part (if not all) of the apparent improvement is down to threads for a specific task being somewhat less likely to get out of sync if there are less of them; if threads are being juggled around, caches and TLBs may be affected and more instructions will be likely to stall. Incidentally, I never allow BOINC to use more than 75% of CPU threads; most of my systems have enough stuff going on in the background that leaving about 25% of threads free seems an effective level. Of course, that's my systems; others may find different settings are better (especially if they have completely different hardware platforms. And what CPU did you test this on? TIA The only systems I have [at present] that are likely to offer more than two or three threads to N-body are the Ryzen 3700X and a Ryzen 5600H I mentioned in my earlier post -- I tested the 3-versus-6 scenario on both with no other BOINC tasks running, and also looked [briefly] at the effect of various workloads on a single 3-thread task. Both systems have 2x16GB RAM; the 3700X has total power limited to 80W (which doesn't seem to slow it up much, if at all!) I suspect the outcome might be [slightly] different on Intel non-server CPUs, and likewise fot any server chipsets; unfortunately, I don't have any of the latter for testing :-) Throughput may also depend on the number of memory channels and how memory is installed -- I've seen that discussed elsewhere (probably at WCG, but I may be misremembering...) By the way, I haven't seen any CPDN work or WCG ARP1 work in a while -- I rather suspect that when either of those makes an appearance the memory will take an extra hit and N-body tasks will run less efficiently at that point :-) Hope that answers your questions. Cheers - Al.
39) Message boards : Number crunching : Will N-Body projects all use multiple CPUs? (Message 76254) Posted 13 Jul 2023 by alanb1951 Post: Aurum, Note that I'm not a MilkyWay researcher or technician, but I'll have a go at this... Is n-body a legitimate multi-CPU project or is it just multiple WUs in one package? It is a multi-threaded application, using OpenMP. Given that (unlike Separation) it only produces one result, there are not multiple WUs in one package. And it'll quite happily send a task for the same WU to systems that allocate different numbers of threads (including one!) :-) Does n-body use all allocated CPUs for the entire run? I think it only uses one thread during set-up (the starting phase that takes about 30 seconds on most of my systems); it seems to use all allocated threads after that. Or does it start using all CPUs and then as parts finish CPUs go idle and wasted until there's just one left running at the end? It seems to try to apportion the required computational effort across all threads, but there are all sorts of reasons why seemingly identical blocks of work might take different amounts of time -- some of the time, threads will be idling if they are out of sync at key points. This becomes more noticeable as more threads are given to a single task, as it becomes more likely that the O/S will interrupt a (random?) thread to perform some necessary activity of its own :-) Hope that goes some way to answering your questions. Cheers - Al.
40) Questions and Answers : Macintosh : Why is there no N-body application for Mac's? (Message 76238) Posted 11 Jul 2023 by alanb1951 Post: Mikey, This is exactly where I wish Richard Hassellgrove and his group of software testers could get with the different Projects and get some basic across the Projects tech support, probably not free though, to help with things like this. Einstein for example has Mac apps for all the different versions from a Cheese Grater Mac to the new M2 cpu. Maybe they could even provide their apps, admin to admin, and let MilkyWay for example work on changing it so it works here. Yup, that would be wonderful, but a lot [if not all] of the BOINC development and support is now on a volunteer basis -- I wonder if Richard may be a volunteer himself :-) (By the way, he is currently trying to help the CPDN folks with a recalcitrant credit problem...) As for getting help from the rest of the community, they'd need some project that has multi-threaded code that runs on Apple Silicon, and they'd probably have to find an expert programmer from somewhere. None of the Einstein apps available on Apple Silicon are multi-threaded as far as I'm aware... :-( You're probably familiar with "Good, quick, cheap - any two!" (or similar); the reality in many cases is that one only gets really good in the absence of both quick and cheap! Sadly, the goal in a lot of places is to do things as cheaply as possible nowadays; this is especially true in most academic environments that don't have huge research budgets (even Einstein have lost staff that were not replaced...). And even if there is a willingness to recruit, most programmers can earn far better money in a non-academic/research environment, especially in specialist cases¹. I try not to be pessimistic about the future of distributed computing, but sometimes it's quite difficult! Cheers - Al. ¹ I wonder who'll pick up Apple BOINC client if/when Charlie Fenton is no longer willing/able to look after it...

Previous 20 · Next 20