21)
Message boards :
Number crunching :
Thread to report issues after server migration
(Message 76615)
Posted 14 Nov 2023 by alanb1951 Post: Today I got a Workunit 962969273 aborted by project. What does that mean? It means the server decided your result wasn't needed, and as you hadn't started processing it when the server was next contacted it told your client to abort it! This usually happens when a task is issued because a prior task goes to "No Reply" state but then returns late (often a day or more so!) -- that appears to be the case with your example. Cheers - Al. |
22)
Message boards :
Number crunching :
Thread to report issues after server migration
(Message 76579)
Posted 6 Nov 2023 by alanb1951 Post: The certificate issue seems to be resolved now -- I was able to get MW up and running again on my Linux systems from around 15:45 UTC on 2023-11-06. It even worked as I wanted once I remembered that I needed to restore/rebuild my app_config.xml files to cut the number of threads per task :-) Thank you Kevin and the RPI techs. Cheers - Al. |
23)
Message boards :
News :
Migrating MilkyWay@home to a New Server
(Message 76578)
Posted 6 Nov 2023 by alanb1951 Post: The certificate issue seems to be resolved now -- I was able to get MW up and running again on my Linux systems from around 15:45 UTC on 2023-11-06. Thank you Kevin and the RPI techs. Cheers - Al. |
24)
Message boards :
Number crunching :
Thread to report issues after server migration
(Message 76537)
Posted 4 Nov 2023 by alanb1951 Post: Nick, Thanks for your effort, but I rather think we're talking past one another (probably my fault) rather than communicating... What the details I saw on the Firefox certificate stuff had me pondering was whether all MW servers are actually sending out the same certificate chain! What I read from your last reply is that if the browser can find certificates for the relevant intermediate and root CA issuers in its own store it won't bother to look at the rest of the chain sent by the [MW] server... The alternatives I can think of would be either
|
25)
Message boards :
Number crunching :
Thread to report issues after server migration
(Message 76534)
Posted 3 Nov 2023 by alanb1951 Post: Nick - It's not a patch. The cert store holds certificates that you trust.I use the certificate bundle supplied by Ubuntu (which appears to be based on the Mozilla bundle1!), so as far as I'm concerned, adding to it is akin to patching (though the remark was tongue-in-cheek...) -- I tend to avoid altering something that shouldn't need modifying :-) Web browsers come by default with many trusted root certificates. Which is why your browser isn't complaining about the new site.Does this mean that the valid certificate chain Firefox is supposed to have downloaded [from the MW site, I presumed] is actually a concoction constructed by Firefox? If so, fair enough (but a badly worded Firefox certificate information page!) and I'd be [vaguely] interested in how it does it2. If that is the case, that would certainly explain why I couldn't make sense of the two very different certificate chains I could see! Cheers - Al. P.S. I make no claims to being an SSL/TLS guru, so excuse my [apparently] limited understanding :-) 1 According to the package manager ca-certificates "Contains the certificate authorities shipped with Mozilla's browser to allow SSL-based applications to check for the authenticity of SSL connections." 2 I presume that all the Linux software relevant here (BOINC, openssl, Firefox) ends up using a recent libssl version - BOINC seems to use libcurl as an intermediary to libssl3, openssl uses libssl3 and the Firefox snap seems to be a static build... |
26)
Message boards :
Number crunching :
Thread to report issues after server migration
(Message 76529)
Posted 3 Nov 2023 by alanb1951 Post: Back to the problem at hand: I see that the certificate issues are fairly well documented in this thread by now :-) -- if your security folks are saying there's nothing wrong [because it works for Windows and for browsers on Linux] please inform them otherwise :-) Thanks, Kevin. The "wait until the technicians return on Monday" issue affects other BOINC projects too (CPDN and WCG to name but two), and it's understandable1. Good luck with the other fixing up. Although the [limited] parts of the web site I've used seem o.k., I've seen some of the other bug reports with hosts of PHP errors [yuk!] Cheers - Al. 1 Having worked in a University Computing Service at one time I have seen that at first hand, both before and after the introduction of 24/7 "on call"... |
27)
Message boards :
Number crunching :
Thread to report issues after server migration
(Message 76525)
Posted 3 Nov 2023 by alanb1951 Post:
Thanks for the clarification! Back to the problem at hand: I see that the certificate issues are fairly well documented in this thread by now :-) -- if your security folks are saying there's nothing wrong [because it works for Windows and for browsers on Linux] please inform them otherwise :-) Cheers - Al. P.S. I will not be patching my certificate store :-) |
28)
Message boards :
Number crunching :
Thread to report issues after server migration
(Message 76493)
Posted 2 Nov 2023 by alanb1951 Post: Kevin, I see Nick has tried the latest available client, but I'm using the latest repository clients (which are a tad older!) and given his experience I'm not going to spend ages working out how to build a client :-) On client 7.20.2 I see the following at the end of a connection attempt (after it has redirected to https on port 433): Thu 02 Nov 2023 20:54:23 GMT | http://milkyway-new.cs.rpi.edu/milkyway/ | [http] [ID#1] Info: TLSv1.2 (OUT), TLS header, Unknown (21): Thu 02 Nov 2023 20:54:23 GMT | http://milkyway-new.cs.rpi.edu/milkyway/ | [http] [ID#1] Info: TLSv1.3 (OUT), TLS alert, unknown CA (560): Thu 02 Nov 2023 20:54:23 GMT | http://milkyway-new.cs.rpi.edu/milkyway/ | [http] [ID#1] Info: SSL certificate problem: unable to get local issuer certificate Thu 02 Nov 2023 20:54:23 GMT | http://milkyway-new.cs.rpi.edu/milkyway/ | [http] [ID#1] Info: Closing connection 16 Thu 02 Nov 2023 20:54:23 GMT | http://milkyway-new.cs.rpi.edu/milkyway/ | [http] HTTP error: SSL peer certificate or SSH remote key was not OK Similar on client 7.20.5 too. Hope this helps. Cheers - Al. |
29)
Message boards :
Number crunching :
Thread to report issues after server migration
(Message 76484)
Posted 2 Nov 2023 by alanb1951 Post: Kevin, As has been noted in the News thread, it appears that Linux systems can't re-attach... Trying to amend the URL offered by BOINC Manager results in a "Please try again later" message. Not helpful :-) So I used boinccmd to attach instead, and that seemed to do something. However, looking in BOINC Manager after that shows the project identified by URL rather than by name, and the status is reported as "Scheduler request pending. Project initialization" (with a Communication deferred time appended) If I try an update, it offers "Fetching scheduler list" then reports "Project communication failed"... Checking in /var/lib/boinc-client, I have what seems to be a valid account_milkyway-new.cs.rpi.edu_milkyway.xml but master_milkyway-new.cs.rpi.edu_milkyway.xml is empty. Hope that helps :-) Cheers - Al. P.S. I note that the new master URL has -new added, but the web site doesn't -- is that the long-term plan or will the master URL end up changing back to not having -new in it? |
30)
Message boards :
Number crunching :
What's Everybody Doing with their Double Precision These Days?
(Message 76407)
Posted 3 Oct 2023 by alanb1951 Post: OOPS! Thanks for the correction!it's not 64-bit but it is science work :-) Cheers - Al. |
31)
Message boards :
Number crunching :
What's Everybody Doing with their Double Precision These Days?
(Message 76405)
Posted 2 Oct 2023 by alanb1951 Post: Regarding Einstein work -- if you were only subscribed to the Gamma Ray Search you may not realize that that project has officially finished, so no more work! If that's the case, try the MeerKAT pulsar search instead (BRP7) -- it's not 64-bit but it is science work :-) [Edit - I see Link posted about that while I was composing this -- ah well...] As for WCG -- OPNG work tends to be a bit on/off (and they sometimes seem to restrict work availability over weekends to try fo avoid possible upload/download issues when there's no-one there to look into it. Also, they had some bad OPNG batches which caused a fairly long time-out through August and into September :-( It's worth exploring the forums for Einstein and WCG to keep an eye on what's going on :-) Cheers - Al. |
32)
Message boards :
Number crunching :
Is there a way to split total available cores over multple tasks here?
(Message 76310)
Posted 22 Jul 2023 by alanb1951 Post: Link - I saw your reply and it made me wonder if the client I was testing on had got confused at some point... So I restarted the one I used for my experiments and repeated some of the simple tests and it now seems to respect --nthreads if it is present, but it is still using the count of available CPUs (based on avg_ncpus if that is present) when -nthreads is unavailable. That seems to be how I would've expected it to behave, despite my earlier experimental observations to the contrary :-) I am at a loss to explain the apparent change in behaviour away from the [unexpected] behaviour I observed :-) -- I think the moral might be to always include avg_ncpus and have it equal to the --nthreads value "just in case"... As mentioned elsewhere, it will also have the useful side-effect of keeping the client scheduler properly informed as to CPUs in use! Thanks for making me have another look; I wasn't that happy with what I thought I'd found :-) Cheers - Al. P.S. I had already noticed that BOINC Manager seems to report the CPUs count that was current when tasks were downloaded (until after a restart...) but I'd been monitoring thread usage by looking at stderr.txt in the task's slot directory and with a system tool which always showed one more thread than the number OpenMP reported in the stderr file. There always seems to be an apparently idle thread, even on my system where there's never been an app_config file - I presume it's the checkpoint handler and/or some sort of "watchdog" [Edited to add reminder about avg_ncpus and the client scheduler] |
33)
Message boards :
Number crunching :
Is there a way to split total available cores over multple tasks here?
(Message 76306)
Posted 22 Jul 2023 by alanb1951 Post: The nthreads parameter is ONLY for MT tasks. It sets the maximum number of threads per task to use. For the OP who wanted to run on 15 threads total for the host, that works out as 3 tasks in total using 5 threads each.I seemed to recall some posts about N-body not actually using the --nthreads parameter but a cursory search didn't find anything so I conducted an experiment on one of my systems with enough cores to make it a proper test. I normally allow 9 CPU threads for BOINC on the system I tested -- its normal app_config.xml is <app_config> <app_version> <app_name>milkyway_nbody</app_name> <plan_class>mt</plan_class> <avg_ncpus>3</avg_ncpus> <cmdline>--nthreads 3</cmdline> </app_version> <project_max_concurrent>2</project_max_concurrent> I reduced my queue size to make sure I didn't get swamped with new work, removed the avg_ncpus line and re-read the config files. It immediately suspended one of the two N-body tasks it had been running, as if the client's scheduler now thought tasks used all 9 threads :-) -- the task that carried on running continued to use three threads... I looked at client_state.xml and the app_version section for nbody now included an avg_ncpus value of 9 (which explained the scheduler behaviour!) It then fetched one new work unit, which apparently wanted 9 threads so I suspended that one before it started, restored the normal app_config.xml and got two tasks running again... On another (smaller) machine I tried an app_config.xml file with avg_ncpus less than the total allowed to BOINC and without the command-line parameter value option. That happily ran tasks utilizing the required number of threads, despite the absence of --nthreads! So that left the question of how the executable decides on the thread count... Every task that starts up gets a file called init_data.xml in its slots directory entry; that file contains a lot of information from various places, including an ncpus value which appears to be the same as the avg_ncpus value in the app_version data at the time the task is started (or restarted on a BOINC restart?) It seems likely that the N-body app digs the thread-count out of that file. I have no doubt that other OpenMP programs may well respect a thread-count parameter of some form, but it certainly appears that N-body doesn't :-) Cheers - Al. |
34)
Questions and Answers :
Unix/Linux :
15 CPUs cause running tasks to stop running
(Message 76298)
Posted 20 Jul 2023 by alanb1951 Post: Aurum, Al, "instructions retired per second" is the secret sauce that makes this work so well. I had taken your word for it before and have all my computers running 3-thread nbodys. Thanks much for doing this useful work and explaining it so thoroughly.Glad to have been of help! I once tried to install PERF to measure instructions retired but it gave me four ways to install it and I didn't get it working. It's the only Linux program I've ever seen suggest multiple ways to install.Regarding installing perf (and other kernel-specific tools): I'm on Ubuntu rather than Mint, and all I needed to do was tell Synaptic Package Manager (which I use rather than running apt or dpkg from a console!) to install linux-tools-generic... This makes sure that whenever a new kernel version is installed via Software Updater a new version of perf (and friends) gets pulled in to match the new kernel -- the older versions will remain until the corresponding kernels are uninstalled. The alternative is to install explicitly by kernel version -- that can be a bit of a pain :-) as one ought to match both the version and the flavour (which is generic in my case and yours... I'm not familiar enough with Mint to know whether it structures its packages in the same way; sorry about that... CPDN has issued a slug of WAH but it's windoze only. Their new guy is planning on putting out much OpenIFS for Linux this fall:Yup, I've been following that saga and am looking forward to running some proper 64-bit CPDN work... Cheers - Al. |
35)
Questions and Answers :
Unix/Linux :
15 CPUs cause running tasks to stop running
(Message 76297)
Posted 20 Jul 2023 by alanb1951 Post: mikey - for information: "PERF" as in Ithena Measurements Perf tasks? If so it's like any other Boinc project you choose what kind of tasks you want to run, Perf is for Windows only while Ooni tasks are Linux only and the Cnode tasks are for both I think. There is also a Project Ithena computation that has Hex tasks https://comp.ithena.net/usr/Nope - we're talking Linux, and perf is one of a set of kernel-specific system tools; it offers various different ways of looking at system performance... Cheers - Al. |
36)
Questions and Answers :
Unix/Linux :
15 CPUs cause running tasks to stop running
(Message 76296)
Posted 20 Jul 2023 by alanb1951 Post: mikey - for information: "PERF" as in Ithena Measurements Perf tasks? If so it's like any other Boinc project you choose what kind of tasks you want to run, Perf is for Windows only while Ooni tasks are Linux only and the Cnode tasks are for both I think. There is also a Project Ithena computation that has Hex tasks https://comp.ithena.net/usr/Nope - we're talking Linux, and perf is one of a set of kernel-specific system tools; it offers various different ways of looking at system performance... Cheers - Al. |
37)
Message boards :
Number crunching :
validation inconclusive on some tasks
(Message 76289)
Posted 19 Jul 2023 by alanb1951 Post: Mikey is right about the usual meaning of Validation Inconclusive, and about the way it sends out the tasks one at a time... A number of the tasks that still show up in your tasks report are Separation tasks, some/all of which may never get cleared out because of the way they shut Separation down -- you may have spotted that and allowed for it when counting tasks, in which case apologies for mentioning it! The workunit you posted about (960542719) has now validated and is quite interesting in that it drew my attention to how MilkyWay flags tasks that fail to validate. The tale it tells is thus:
|
38)
Questions and Answers :
Unix/Linux :
15 CPUs cause running tasks to stop running
(Message 76264)
Posted 13 Jul 2023 by alanb1951 Post: (Aurum: it took me a moment to realize you were quoting from one of my earlier posts - sorry about the delay in responding.) "Works harder" is based on instructions retired per second rather than total run time. The latter is not as useful as a throughput statistic because no two N-body tasks are guaranteed to execute [roughly] the same number of instructions...I found that there was a slight degradation for each thread up to three or four, then it went downhill quite fast -- while I was working out an optimum for one of my systems, I found 2 three-thread tasks worked the CPUs harder than 1 six-thread task didHow do you define "works harder?" I measured instructions/second for a task over a prolonged period to ensure I was likely to catch multiple iterations; I set the checkpoint interval quite high to avoid accidentally sampling a task during a checkpoint, as such activity is likely to involve a lot of context switches in a short time interval, and a lot of the CPU usage at that point is O/S rather than application... If I wanted to test two N-body tasks running together, I'd sample one of them for a specified time then sample the other for the same amount of time. In general, if the tasks had the same number of threads they would perform in a reasonably similar fashion. By only allowing the same total number of threads, I was able to observe that there was a large enough improvement to make the smaller tasks more practical (if I intended to give that many threads to N-body...) (Of course, I repeated the tests on several different tasks, and had to discard one or two tests because the tasks finished mid-test! The run-time estimates for N-body are not very accurate...) It is likely that part (if not all) of the apparent improvement is down to threads for a specific task being somewhat less likely to get out of sync if there are less of them; if threads are being juggled around, caches and TLBs may be affected and more instructions will be likely to stall. Incidentally, I never allow BOINC to use more than 75% of CPU threads; most of my systems have enough stuff going on in the background that leaving about 25% of threads free seems an effective level. Of course, that's my systems; others may find different settings are better (especially if they have completely different hardware platforms. And what CPU did you test this on? TIAThe only systems I have [at present] that are likely to offer more than two or three threads to N-body are the Ryzen 3700X and a Ryzen 5600H I mentioned in my earlier post -- I tested the 3-versus-6 scenario on both with no other BOINC tasks running, and also looked [briefly] at the effect of various workloads on a single 3-thread task. Both systems have 2x16GB RAM; the 3700X has total power limited to 80W (which doesn't seem to slow it up much, if at all!) I suspect the outcome might be [slightly] different on Intel non-server CPUs, and likewise fot any server chipsets; unfortunately, I don't have any of the latter for testing :-) Throughput may also depend on the number of memory channels and how memory is installed -- I've seen that discussed elsewhere (probably at WCG, but I may be misremembering...) By the way, I haven't seen any CPDN work or WCG ARP1 work in a while -- I rather suspect that when either of those makes an appearance the memory will take an extra hit and N-body tasks will run less efficiently at that point :-) Hope that answers your questions. Cheers - Al. |
39)
Message boards :
Number crunching :
Will N-Body projects all use multiple CPUs?
(Message 76254)
Posted 13 Jul 2023 by alanb1951 Post: Aurum, Note that I'm not a MilkyWay researcher or technician, but I'll have a go at this... Is n-body a legitimate multi-CPU project or is it just multiple WUs in one package?It is a multi-threaded application, using OpenMP. Given that (unlike Separation) it only produces one result, there are not multiple WUs in one package. And it'll quite happily send a task for the same WU to systems that allocate different numbers of threads (including one!) :-) Does n-body use all allocated CPUs for the entire run?I think it only uses one thread during set-up (the starting phase that takes about 30 seconds on most of my systems); it seems to use all allocated threads after that. Or does it start using all CPUs and then as parts finish CPUs go idle and wasted until there's just one left running at the end?It seems to try to apportion the required computational effort across all threads, but there are all sorts of reasons why seemingly identical blocks of work might take different amounts of time -- some of the time, threads will be idling if they are out of sync at key points. This becomes more noticeable as more threads are given to a single task, as it becomes more likely that the O/S will interrupt a (random?) thread to perform some necessary activity of its own :-) Hope that goes some way to answering your questions. Cheers - Al. |
40)
Questions and Answers :
Macintosh :
Why is there no N-body application for Mac's?
(Message 76238)
Posted 11 Jul 2023 by alanb1951 Post: Mikey, This is exactly where I wish Richard Hassellgrove and his group of software testers could get with the different Projects and get some basic across the Projects tech support, probably not free though, to help with things like this. Einstein for example has Mac apps for all the different versions from a Cheese Grater Mac to the new M2 cpu. Maybe they could even provide their apps, admin to admin, and let MilkyWay for example work on changing it so it works here.Yup, that would be wonderful, but a lot [if not all] of the BOINC development and support is now on a volunteer basis -- I wonder if Richard may be a volunteer himself :-) (By the way, he is currently trying to help the CPDN folks with a recalcitrant credit problem...) As for getting help from the rest of the community, they'd need some project that has multi-threaded code that runs on Apple Silicon, and they'd probably have to find an expert programmer from somewhere. None of the Einstein apps available on Apple Silicon are multi-threaded as far as I'm aware... :-( You're probably familiar with "Good, quick, cheap - any two!" (or similar); the reality in many cases is that one only gets really good in the absence of both quick and cheap! Sadly, the goal in a lot of places is to do things as cheaply as possible nowadays; this is especially true in most academic environments that don't have huge research budgets (even Einstein have lost staff that were not replaced...). And even if there is a willingness to recruit, most programmers can earn far better money in a non-academic/research environment, especially in specialist cases1. I try not to be pessimistic about the future of distributed computing, but sometimes it's quite difficult! Cheers - Al. 1 I wonder who'll pick up Apple BOINC client if/when Charlie Fenton is no longer willing/able to look after it... |
©2025 Astroinformatics Group