Welcome to MilkyWay@home

Posts by alanb1951

1) Message boards : Cafe MilkyWay : WCG Friends (Message 74847)
Posted 23 Dec 2022 by alanb1951
Post:
For information -- OPN1 has now officially joined OPNG in being suspended whilst setting up work for the new target(s) is being done by the scientists. There's a thread about it in the News forum...

So we now have to wait for both CPU and GPU work for OPN; my Pi4 is unhappy because its other BOINC project (TN-Grid) is effectively off at present because of recurrent file server issues (which are, apparently, beyond the control of the TN-Grid folks...) but my GPUs are working here and at Einstein...

Cheers - Al.
2) Message boards : Cafe MilkyWay : WCG Friends (Message 74845)
Posted 21 Dec 2022 by alanb1951
Post:
I'm not convinced yet they actually know how to run a Boinc Project

Think there is a growing number that would agree with that. No work today it seems.


Before I comment on the above, an observation -- I wonder how many people realize that the new WCG team is probably just the Jurisica Lab MCM team (with dependency on outside agencies for some aspects of hardware and networking...) -- I speculate that Krembil aren't actually pouring support in for Igor and company now they realize what they've "bought into"...

And now, a couple of comments on the quotes...

It's not really a BOINC project - it's a pre-BOINC project that got "massaged" to fit it into the BOINC universe rather than there being a complete "reboot" [like Apple do on hardware changes!] way back then... And a lot of the initial problems Jurisica Lab had were with the IBM/WCG stuff and the user-facing components (forums, web-site login, web-site missing components, inter-database communications[1]) - unfortunately, experience and relevant expertise are needed to solve such things, and with IBM out of the picture where's the experience?...

The upload/download issues were/are probably a product of not having optimal infrastructure and [of course] wouldn't show up as a crisis until the system came under stress - given the change of physical platform(s) this is another "learn by experience" situation, and whilst it's unfortunate it's also understandable -- the fact that it has taken so long to resolve (and may still not be fully fixed) is likely to be down [in part] to dependency on an outside agency (SHARCNET?) for network stuff, as some changes may require said agency to do work that won't happen instantly on demand...

[And now all we need is a certain regular WCG forum poster to climb in to point out that the above is all about symptoms, not causes, and that the real problem is WCG's failure to communicate :-) ...]

Regarding available work -- I've been getting a steady flow of MCM work since the recovery from the situation that took several critical services off-line (there's a News thread about that at WGC...), but will concede that OPN1 and ARP1 work has been less available. I suspect that in both those cases it may have as much to do with [bi-directional?] data-flow between the scientists and WCG as anything else; we already know that there's no OPNG work because the scientists are prepping up for a new target (or targets) for the GPU version, and we don't know how much more CPU work there might be for the existing target... Mikey, if I recall you're only asking for ARP1 (and HST1?) so at the moment you are out of luck :-(

Looking at wingman returns for ARP1 and MCM1 I see far more "No Reply" or "Not Started by Deadline"[2] tasks than I would normally expect for tasks with a 6-day deadline; it's not so prevalent with OPN1 as that uses Adaptive Replication so a lot of tasks don't need wingmen in the first place... I wonder if any existing restrictions on "unreliable" systems getting new work in bulk got lifted for the migration and haven't been put back.

So, overall, I don't think the WCG situation is "terminal" but my jury is out on whether things are getting to a point where they might be able to announce official restart (rather than the "still testing" mode that a lot of folks don't seem to realize still applies!) - I suspect there are probably a few months more before that'll be realistic. And I have had some experience of being in a team of one [or two if I was lucky] trying to balance a 48-hour working day with some modicum of personal life, so now I know somewhat more about how non-standard the WCG set-up was/is I'm perhaps a lot more forgiving than many others... (I wonder how many of the more vociferous complainants on the WCG forums can even code/program, let alone have worked as DBAs and/or SysAdmins!)

Cheers - Al.

[1] Whilst all the standard BOINC-specific stuff resides where BOINC expects to find it, the forums have their own system, as does a lot of the stuff about user statistics. It appears that a lot of that resides in a completely separate database (possibly carried over from pre-BOINC days?...)

[2] WCG doesn't flag Not Started by Deadline explicitly on the user web pages and the available API feeds, but such tasks are easily recognized by having an Error reply (with nothing but the client version listed) returned close to the deadline.
3) Message boards : Number crunching : Daily graphs of server_status (Message 74553)
Posted 24 Oct 2022 by alanb1951
Post:
...
Tom already said the problem is not enough memory in the current Server but the IT people are in charge of moving the stuff over to the new Server they already have ready and waiting

Hmmm, waiting for what?


Presumably the long running Nbody WU’s have exacerbated the problem ?
The two projects have separate validators, so the only effect one might have on the other is memory demands. However, it's not N-Body that's having severe backlog problems, if the state of my current tasks in progress or waiting for validation is anything to go by...

As at about 20:00 UTC on 24th October a typical _0 N-Body task seems to take about 20 hours to pass through the validator (which doesn't allow it to validate without a wing-man for some reason...) and have a new task sent out for a second opinion. Whilst that isn't brilliant, it's not anywhere near as bad as the situation for Separation tasks!...

As at about the same time as the above, a typical _0 Separation task seems to take over 6 days to pass through the validator (whether it ends up self-validating or not) -- if it doesn't validate without a wingman, there will be two further opinions sought, but when _1 comes back the transitioner will note that 3 results are required so it will spin off another task without troubling the validator[1], and that's pretty quick!

It looks as if it adds about a day to the processing time for each million extra tasks awaiting validation, and bearing in mind that (in theory) more than 10% of initial tasks should end up needing a second (and third!) opinion, clearing out the backlog should produce a reasonable amount of available work, albeit more slowly... (And, of course, tasks that return with an error state should produce new tasks without needing to engage the work unit generator, the same as second retries...) It has been suggested that it might be a good idea to turn off the generation of new work for Separation for a while -- I'd second that suggestion!

Cheers - Al.

[1] Unless the MilkyWay team has done something really strange to the core BOINC stuff, I don't think time-out or error-driven retries and "third opinion" tasks should go anywhere near the work-unit generator so they should not be held up if the validator is not involved (unlike "second opinion" tasks). If Tom knows otherwise, I'd be interested to know what they did!

P.S. "Waiting for what?" -- as mentioned elsewhere, they seem to be waiting for the IT people to sort out the migration...
4) Message boards : Number crunching : "job cache full" log file message (Message 74455)
Posted 15 Oct 2022 by alanb1951
Post:
I've asked for tasks 3 times in the past hour or so. Nothing shows up on the client. For two of the three requests, the Event Log came back with "job cache full". Have never seen this before. Suggestions?
It means what it says :-) -- your client reckons it has enough work already downloaded to keep your system occupied for the amount of time configured for your "Store at least N days of work" setting...

As your computers are not visible on this site I've no idea which MilkyWay applications you might be running, or how much work you already have pending. However, I note that you are also active at Einstein@home, where the estimates of how long tasks might take can be every bit as erratic as the N-Body estimates are here -- could it be that Einstein work is filling the cache first (the delay beween server contacts is shorter at Einstein!)?

Whatever the cause, it is probably better to be told your cache is full than it is to have so much work that a lot of it might not meet the deadlines if the estimated run times are out, so I wouldn't worry about it too much :-)

Cheers - Al.

P.S. I'm just another user...
5) Message boards : Number crunching : Validation Pending too many tasks (Message 74441)
Posted 13 Oct 2022 by alanb1951
Post:
I believe when you get sent over 200 _0 GPU tasks it doesn't help the situation of clearing the queue, In fact it makes it longer
    I believe the best way to clear the weighting backlog is to:
    Stop creating new tasks allow the queue(s) to clear
    Allow new tasks to be created to allow work to be validated



It would also help if they could put any task with a non _0 as next to be sent out instead of at the end of the now very long queue, maybe put any task with a non _0 in it's name in their own folder and tell the Server to send those out before going back to the normal folder to send out tasks
There is an option to the feeder that allows for "priority" tasks to get precedence, but priority is apparently only assigned to work units that have overdue results (which, I think, includes "Not started by deadline" as well as "No Reply") so it wouldn't apply here.

However, there is also an option to use priority then work-unit id -- theoretically that would help clear out the older units first, so it looks like a sensible default for projects that don't need to prioritize new work[1]. If that's already engaged here, it doesn't seem to be the solution to clearing this backlog, although it ought to push out non _0 tasks as early as possible...

[Edit - just seen Speedy51's post, which effectively points to the same place!]

Cheers - Al.

[1] In general, using adaptive replication (as at MilkyWay) might suggest that getting as many work units as possible processed as quickly as possible is the goal, in which case prioritizing retries that are not due to time-outs may not be appropriate (or necessary if there isn't already a huge backlog!...)
6) Message boards : Number crunching : Very Long WU's (Message 74397)
Posted 10 Oct 2022 by alanb1951
Post:
N-Body tasks have a 12 day deadline so even with long run times there shouldn't be any "not started by deadline" errors. The only reasons I can think of that would make one run out of time is a large BOINC queue and not having one's PC run close to 24/7.

Well, the N-Body queue is, as always for months, constantly 100 large.
The PC is running 24/7 - but I did stop it for about 8 hours of maintenance.
The 12 day deadline can get a bit short, since the run times have lately increased significantly. Sometimes up to 27 hours.
So, I guess decreasing the numbers of tasks in the queue should do it.

It took my systems a while to cut down the number of N-Body tasks when they started getting bigger, but eventually they seem to have got the hang of it; however, I suspect my configuration is very different to yours :-)

On one system I've been running nothing but N-body since WCG went on hiatus in February, and I eventually tuned it down from 0.5 days work cached to 0.2 days work cached and that seems to be enough to keep the number of tasks under 10... That system runs one at a time on 3 out of 4 "CPUs".

On the other system that is running N-Body I also run various GPU projects and have re-enabled WCG -- that caches 0.6 days work and the mix of WCG and N-Body seems to keep the N-Body count below 10 whilst allowing the maximum numbers of WCG tasks I'm prepared to accept at once. That system runs one N-Body at a time on 3 of the 12 "CPUs" I allow to BOINC on an 8 core/16 thread processor.

So it can be tamed :-)

Or "somebody" should/could perhaps extend the deadline? That way giving the old chunk of metal a decent chance ...

As AndreyOR says, there's no reason a system should be having these problems in the first place -- whilst I understand that there may be problems for folks who attach to projects in what some folks have been known to call "fire and forget" mode, it seems unreasonable to expect deadlines to be extended to cope with users who either can't (Science United?; other BOINC projects with huge jobs?) or won't (for whatever reason) change configuration. The only excuse I can think of for a large queue nowadays is flaky or irregular internet access :-)

Good luck with your system "tuning" -- hopefully things will improve.

Cheers - Al.
7) Message boards : News : News General (Message 74322)
Posted 30 Sep 2022 by alanb1951
Post:
What about recompiling server?


The binaries will all need to be recompiled on the new server hardware. I have been keeping a list of some of the changes people have requested to our server software, but if you have suggestions feel free to share them here! When we recompile, it makes sense to do things like change the task pool sizes, remove the sleep loop in the transitioner, etc.

Hell, yes. This please!
I thought the standard transitioner only slept if it tries to do a pass and can't find anything to do! (I'm willing to be proven wrong...)

Cheers - Al.
8) Message boards : News : Server Maintenance 12:00 PM ET (16:00 UTC) 9/23/2022 (Message 74290)
Posted 27 Sep 2022 by alanb1951
Post:
[As it's too late to edit the previous post...]

Work downloading seems to have just started again :-). However, the number of work units waiting for validation is still climbing :-(.

I wonder how long it will be before it all bungs up again (or maybe even crashes [again?])

Cheers - Al.
9) Message boards : News : Server Maintenance 12:00 PM ET (16:00 UTC) 9/23/2022 (Message 74288)
Posted 27 Sep 2022 by alanb1951
Post:
Regarding the high counts for N-body...

Over the last few days there has been an enormous backlog of stuff waiting for validation! Bearing in mind that at present the N-body tasks are sent out with initial quorum 1 (like Separation tasks) but don't ever seem to validate without a wingman (unlike Separation!) a very large proportion of the backlog would be N-body task initial result returns, which would promptly generate a retry when eventually validated :-)

So a day or so after the validation backlog started to grow the number of tasks waiting to be sent started to shoot up as well, and it seems to have settled out around the 110,000 to 120,000 mark...

Until they work out why there's such a huge validation backlog, it's likely to stay as it is... I'm wondering if it's something to do with the extra code in the validator that backs off interesting results (using something called Toolkit for Asynchronous Optimization, I believe)

Cheers - Al.

P.S. The comments about sluggishness (in various guises) are still applicable -- this post is being made the first time I've managed to get at the server since some time yesterday, and my client has only just managed to report old results (again, stalled since yesterday...) And as for trying to download work -- not a chance!
10) Message boards : Number crunching : Very Long WU's (Message 74183)
Posted 15 Sep 2022 by alanb1951
Post:
What’s with the sudden influx of long de_nbody work units? I have an 8 core machine and am used to seen WU’s that complete in 6 or 8 minutes running on all cores art once. Suddenly I am getting large numbers of WU’s with run times of 5 to 8 hours. That’s quite a change. What gives?
Someone else raised the same point in the thread "NBody tasks taking much longer ..." just under a week ago. And I've been noticing these longer tasks since the middle of last month...

We'd need the project scientist to give a proper explanation, but the long-running tasks admit to being long-running before they start, and seem to get credited accordingly so it looks like expected behaviour for certain work units!

Cheers - Al.
11) Message boards : Number crunching : Validation inconclusive (Message 74133)
Posted 9 Sep 2022 by alanb1951
Post:
@Nuadormrac (and Peter too...)

For information regarding the phases of validation...

The validator has two duties -- when a result is returned it checks it for obvious errors (which can be marked Invalid immediately) and either validates the work unit at once (see below) or marks the result as needing verification. If multiple results are required for verification, the validator gets invoked again once there are enough results to perform a full verification. (Oversimplification!)

As Peter mentioned, sometimes it only needs one result -- it uses a mechanism called Adaptive Replication. Once a user has passed a pre-defined count [20, I think] of consecutive successful results for a specific application, the validator does a little calculation which (by default) will result in about 90% of tasks for said user being passed without needing any wingmen. So until you've racked up enough consecutive successfully validated Separation tasks it will always go to Validation Inconclusive (and if you get a bad result it'll clear the count...)

In theory, both applications at MilkyWay use Adaptive Replication, but it [currently] seems to be broken for N-body, so there'll always be a wingman [eventually]. However, it works for Separation, but for some reason the replication count for the wingman case seems to be three (instead of the two used for N-body). I presume the count is set higher because of the way the results are compared for verification purposes; it doesn't require an exact match...

Cheers - Al.

P.S. When MilkyWay had the problems after the disk crash earlier this year, it was quite common to see work units stuck with one or more results at Validation Inconclusive when there should've been enough results to complete validation and declare a canonical result. This was because another part of the system got so bottlenecked that it wasn't calling the validator for the verification phase!
12) Questions and Answers : Web site : No update (Message 74043)
Posted 7 Aug 2022 by alanb1951
Post:
A test to try -- this is what I'd do on a linux system, and I presume it's possible on a Windows client...

Find your boinc_client [or boinc?] directory on the machine in question, and locate a file called sched_request_milkyway.cs.rpi.edu_milkyway.xml

Near the top of that file should be a line something like
  <hostid>123456</hostid>

The number is the host ID that the particular system is reporting to the server...

If that matches what you're expecting, you're no nearer an answer; however, if it doesn't match you have some useful information...

By the way, the server tasks information for the host ID you originally provided still hasn't shown any progess; if you really are requesting as host ID 931664 something might be broken at the MW end ;-(

Cheers - Al.
13) Questions and Answers : Unix/Linux : Maybe off topic but...... (Message 73975)
Posted 21 Jul 2022 by alanb1951
Post:
Yes, it helps a little, but when I type in, sudo nano /etc/hosts, doesn't it then ask for my password?
If so, what do I tell it? I never setup a su password.
Thanks!
Your own user password is required there...

By the way, if you have a lot of stuff to do as root and are confident that you won't have any disasters (or you are willing to re-install from scratch!) you can get a root session on most Linux systems by doing either sudo su or sudo -i. That gets you a root session within your existing session, starting at the root of the filestore. You can then navigate around to the various places where changes are needed and modify the files with nano or another editor of your choice... When you've finished, simply enter exit.

And make use of ls -l to check ownership and permissions on files in the directories you work in -- editing existing files shouldn't disturb those properties, but creating new files may require that they don't belong to root after all!

With apologies if some of this is telling you stuff you knew already...

Cheers - Al.
14) Message boards : Number crunching : Validation inconclusive (Message 73974)
Posted 21 Jul 2022 by alanb1951
Post:
Uh oh. In 2038 Boinc will cease to be.
https://en.wikipedia.org/wiki/Year_2038_problem


I would think before then they can find a way to make it work
I can't decide whether Peter was trying to make a joke or not :-). However, this one is probably less of a problem than Y2K, and a lot of diligence avoided most of the problems that could've caused... (And I suspect we're more likely to have had a mass extinction event by 2038 than this being a problem...)

In this case the only real issue is how the Operating System returns date/time information. As per the Wikipedia item, some O/S flavours return a 32-bit integer number of seconds since the "UNIX Epoch", and that would be problematic! Other systems may return a floating-point number of days or seconds since some base date; as long as those return a double rather than a float, there's already no problem in getting date/time data after the 2038 barrier. More recent Linux versions with 64-bit system libraries already return a 64-bit integer instead, so no problems there either. It then depends on how the system libraries make the date/time available to applications -- all it needs is for routines not to be constrained to 32-bit (or less) precision...

The remaining issue would be whether applications are coded to meet the standards expected by various system libraries -- for instance, using a time_t variable rather than a native C data type for a date/time numeric value (and competent programmers use those size-agnostic variable types for exactly this sort of reason!) Any applications coded thus would need a recompilation against newer libraries (if that hadn't already happened!)

BOINC is written in C++ and uses the appropriate variable declaration standards. So, as no client software is likely to still be 32-bit by then, where's the problem? The server side might have been a bit more interesting if the database had issues with such dates, but MySQL shouldn't be an issue regarding dates, and the conversion to/from system standard date representations seems solid. So, again, recompile if necessary, and re-link with the latest libraries!

Now, if you want an example of a seemingly unavoidable Y2K-type incident, consider what happens to a lot of [older model?] GPS systems when the (10-bit) week number rolls over once every 20 years or so and there's no defensive coding in the device to deal with that.

Cheers - Al.
15) Questions and Answers : Unix/Linux : Maybe off topic but...... (Message 73969)
Posted 20 Jul 2022 by alanb1951
Post:
Having Administrator status isn't the same is it would be in Windows :-)

If the files it suggests you edit (or create) are system files, they will probably belong to the root user and not be editable by any other user. So when, for example, I want to edit /etc/hosts I would do
sudo nano /etc/hosts
If you were already using sudo I'm at a loss to explain it...

Someone else might chip in with suggestions for a GUI-based text editor if you don't like using a "terminal" application. I don't use one, so I can't offer one :-)

Hope this helps...

Cheers - Al.
16) Message boards : Cafe MilkyWay : WCG Friends (Message 73886)
Posted 22 Jun 2022 by alanb1951
Post:
The WCG forums and web site are now back on-line [at https://www.worldcommunitygrid.org]. Web site functionality doesn't seem to be 100% complete, but some of that is because defaults for things like device lists are "confused" by the length of the hiatus!

We are promised "some exciting news soon" -- I presume that means the restart of work flow...

Cheers - Al.

P.S. The link in Mikey's post above redirects to the old link, so save 3 characters :-)...
17) Questions and Answers : Web site : Old WU listed in Tasks (Message 73858)
Posted 18 Jun 2022 by alanb1951
Post:
In scrolling through my WU task listings on the website, I have a bunch (over 700) completed tasks (Valid & Invalid) listed from January 2021. There is a jump from March 2022 back to January 2021 in the Valid listings.
Every so often, someone notices these and asks about them...

In late January/early February 2021 the generated Work Unit IDs got so large that the database couldn't handle them, and although Tom Donlon did his best to get a graceful recovery a certain amount of data got orphaned for one reason or another... Here's a post where Tom comments on the issue.

Hope that answers your question :-)

Cheers - Al.
18) Message boards : Number crunching : Validation inconclusive (Message 73834)
Posted 14 Jun 2022 by alanb1951
Post:
When n body settles out, what will it hover around? About 100,000 in the unsent? I can't remember....
I believe the "cushion" for N-body is 1,000 units... (The cushion for Separation is 10,000, as can be observed at present.)

Cheers - Al.
19) Message boards : Number crunching : Cross-Project Certificate Zero Data (Message 73798)
Posted 5 Jun 2022 by alanb1951
Post:
The stats site that the certificate generator uses as shipped (BOINC Combined Statistics [at boinc.netsoft-online.com]) was totally off the net when I checked it on reading these messages. It is now back again and the cross-project certificate generator now appears to work as expected...

Cheers - Al.
20) Message boards : Number crunching : Validation inconclusive (Message 73788)
Posted 3 Jun 2022 by alanb1951
Post:
Seems to have stopped validating again, my Simulation backlog has been at the same number for days now.
I think you'll find that if you look at the oldest and newest items in the Pending Validation category you'll be losing some at the old end and getting some new ones that balance it out...

The server is still handing out tasks for work units that first went out around 11th/12th April, and a proportion of those didn't return a result before Tom's "purge"... Those will [unfortunately] require a second opinion and that goes on the back of the queue again ;-(

I haven't had a single N-Body PV item clear yet, but I only started doing N-Body just before the "purge" so my oldest PV entries are dated 18th April... I reckon those will start to disappear in about another week (and it is likely to take another fortnight or so to actually clear the entire backlog...)

Cheers - Al.


Next 20

©2023 Astroinformatics Group