Welcome to MilkyWay@home

Posts by alanb1951

1) Message boards : Number crunching : bad argument #0 to 'calculateEps2' (Expected 3 or 6 arguments) (Message 77014)
Posted 18 days ago by alanb1951
Post:
It looks as if orbit-fitting tasks in the new data feed (04_03_2024 in the task name) also have the problem that the previous feed (03_29_2024) had. (And like the previous data-set, the task names don't have "orbit_fitting" in them this time, which might hint at what the previous problem was!)

I still have some retries for earlier dates and those are viable.

Cheers - Al.
2) Message boards : News : Admin Updates Discussion (Message 77013)
Posted 18 days ago by alanb1951
Post:
I've not checked every failed Orbit task, but the large number I did check were all from the 03_29_2024 data feed and all seemed to have task names without "orbit_fitting" -- oops! :-)

The same seems to be the case for the 04_03_2024 data set... However, I am still processing retries for earlier, valid, tasks

Cheers - Al.

[Edited to fix a date typo (I often mess up U.S.-style dates...)]

[Edited again to note that I still get some older, viable, tasks...]
3) Message boards : News : Admin Updates Discussion (Message 76901)
Posted 9 Feb 2024 by alanb1951
Post:
Kevin, your post https://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=5069&postid=76876

mentions some users old Separation tasks were cleared.

What was the criteria?

Why haven't all Separation tasks been cleared from the database?

I've still got 2800 Separation tasks hanging on in Valid, Invalid and Error categories.

Keith,

Those Separation tasks that haven't been cleared have been "orphaned" (i.e. their workunit records no longer exist); there was some discussion about this earlier in this thread, including a possible solution

Cheers - Al.
4) Questions and Answers : Web site : Server Status Page (Message 76882)
Posted 6 Feb 2024 by alanb1951
Post:
The Server Status page does not reflect correct numbers in the Work Status portion at the upper right when compared to the Tasks by Application at the lower left. Tasks Unsent vs Tasks Ready to Send

Bill F

The standard Server Status page caches the information used to display the Tasks by Application section to reduce the amount of database activity needed -- it won't refresh for aboujt an hour, after which it refreshes the next time someone accesses the page!

I think it used to refresh the Work Status part of the page separately, but the PHP I found on GitHub seems to cache that as well, so I'm [now] at a loss to explain the discrepancy...

The Work Status part of the page only does simple counts against the results table with the various result status codes, which is a lot easier so not cached!

Cheers - Al.

[Edited after a re-check on the recent PHP sources...]
5) Message boards : News : Admin Updates Discussion (Message 76861)
Posted 31 Jan 2024 by alanb1951
Post:
Kevin,

Not really a Server Update issue, but I wasn't sure where to post this...

Could you please put something on the end of the Updated GPU Requirements thread in the Number crunching forum and then lock it?

Every so often someone will post in there about their GPU not being supported, and it might be a way of letting them know the current position re GPUs! :-)

Cheers - Al.
6) Message boards : Number crunching : Updated GPU Requirements (Currently not supporting GPU tasks) (Message 76860)
Posted 31 Jan 2024 by alanb1951
Post:
So Nvidia Quadro K42000 in SLI mode and higher specs than listed is not supported? Your loss.
As the Separation project has finished, there is no longer any need for GPUs here (at present); so it isn't really their loss, is it?

Ah, well... :-)

Cheers - Al.

P.S. perhaps this thread should get a message to that effect from the MW admin and then be locked!
7) Message boards : Number crunching : Milkyway CPU usage reduced to zero, other processes after high cpu/ram usage (Message 76840)
Posted 29 Jan 2024 by alanb1951
Post:
There's good information above about giving your MW NBody tasks a better chance of not seeming to stall. In particular, freeing up at least one CPU thread for basic systems tasks is key, especially on older, slower(?) machines... (Your Einstein tasks will not be hindered in the same way because they aren't using OpenMP to provide multi-threading!)

Note that if you alter the number of "CPUs" available via the BOINC Manager but don't restart BOINC, all NBody tasks in your buffer will still be expecting to use the previous number of available cores... That part of the task configuration is managed in such a way that it cannot be reset [per existing task] unless the client is restarted :-)

Cheers - Al.
8) Message boards : News : Admin Updates Discussion (Message 76838)
Posted 29 Jan 2024 by alanb1951
Post:
They're not gone from my 2 hosts, maybe it will happen soon...
Those are from 2021, looks like Kevin will have to look at those separately, the tasks exist, but not the WUs, so it might be a bit more complicated.
There is a fairly simple script for SysAdmins in the source repository that looks as if it would do the trick if it is present on the MW site.

Its name is delete_orphan_results.php :-)

Cheers - Al.
9) Message boards : News : Admin Updates Discussion (Message 76795)
Posted 22 Jan 2024 by alanb1951
Post:
Link -- sorry if my wording wasn't clear enough; I'll try to clarify...

However, patching like that wouldn't work to get rid of the orphaned Separation results because the related workunits aren't there any longer
Sure they are, just click on any separation WU and you get a list of tasks for that WU. So they are still in the database incl. all results (they are in std_err, not separate files), the corresponding IDs are valid (why shouldn't they, the ID becomes invalid when the WU is purged from db), but they are not purged for the same reason as the N-Body WUs "validated" by Kevin: no canonical result. This WU for example can be purged, but not all those without a canonical result.
I should perhaps have defined "orphan"... My remark was about Separation tasks that have low task numbers and workunit numbers such as 2141411706 -- good luck finding anything other than "Unable to handle request: can't find workunit" in those cases! :-) I didn't regard Separation tasks from after the mass WU/task renumbering that was needed in early 2021 as orphaned; their [parent] WUs are usually still present! (Most of my left-over Separation tasks are from 2021!)
If the strange valid tags were the result of a database hack (either explicit or using something in the Admin toolkit [which would be broken if it did that!], it will be interesting to see what happens when the third result comes in :-) It shouldn't struggle to pick a canonical result, but...
You mean for Separation or N-Body? Separation will be stuck in waiting for validation, while the validator for N-Body will simply do it's job (unless something is completely broken because of the hack/forced validation).
In this case I was talking about NBody; sorry if that wasn't clear from the context :-)

Cheers - Al.
10) Message boards : News : Admin Updates Discussion (Message 76791)
Posted 22 Jan 2024 by alanb1951
Post:
Link, interesting comments...

An example from elsewhere... At the time of writing, WCG seems to be having problems getting retries issued in some circumstances; eventually the transitioner seems to notice there has been no activity and the retry tasks get sent out. It takes 6 days for that to happen, which just happens to be the deadline length...
That happens however at the deadline of any of the completed tasks, not the new ones, they don't have a deadline yet, they get it when they are sent out. When that happens, my guess is, that the validator will put them back to the inconclusive state.
The transitioner code I looked at has a backstop mechanism whereby in some code paths it sets a safety-net time for when it should next look at the workunit1. So it presumably doesn't matter whether the retry has been sent out or is still stuck at Waiting to be sent with no deadline - the transitioner will look at the workunit anyway and will act the same as it does when it sees certain sorts of non-success returns (where it seems to push a retry into the feeder at once rather than just queueing a request..)

alanb1951 wrote:
It seems that the MilkyWay validator can mark the first result Valid without calling it the canonical result (and hence not awarding a credit score or invoking the assimilator!)
Here is a workunit which even has two completed-and-validated 0.00-credit results at the moment, plus one task in progress: 963764114
Both results are different, so actually they are inconclusive. I don't think the validator marked them as valid, it was more likely Kevin trying to get rid of all separation tasks by marking all inconclusive results as valid in the hope they will be purged from the database after that. Well, they are still there, more than 48 hour after becoming valid, so that didn't work I guess.
You may be right about how that happened; it makes more sense than the validator doing it :-), and would suggest that the "Is there any point in only sending out one initial task?" issue remains (i.e. the first result is extremely unlikely to go valid at once, rather than Inconclusive)2...

However, patching like that wouldn't work to get rid of the orphaned Separation results because the related workunits aren't there any longer and [as far as I can tell] the purge system operates on WUs, not individual results :-) -- I fear that the only way to be rid of them is to explicitly hack out all traces of results that have very high result-IDs and which don't have a valid workunit-ID value, and that's not a task I'd want to do without shutting down all BOINC activity for as long as it takes to do a full backup and the "hack" (which sounds familiar from comments back when Separation was shut down .)

If the strange valid tags were the result of a database hack (either explicit or using something in the Admin toolkit [which would be broken if it did that!], it will be interesting to see what happens when the third result comes in :-) It shouldn't struggle to pick a canonical result, but...

Cheers - Al

P.S. I hope we aren't "talking past one another"...

1 One situation where this happens is if a retry is requested when there are no other tasks still out in the field for a workunit; if there are tasks out there, it usually seems to leave the existing "next look" time in place (and there are other cases that will keep a shorter wait time on record, if I recall correctly...)

2 That said, I don't know whether turning off the BOINC Adaptive Replication status for the application might break the TAO logic.
11) Message boards : Number crunching : 300+ n body tasks validated with no credit (Message 76781)
Posted 21 Jan 2024 by alanb1951
Post:
I'm severely tempted to stop crunching until this gets resolved.....

Any ideas?
It looks like the wing men also completed validation with no credit.

For information: this has already been raised in the News/Admin Updates Discussion thread and Kevin acknowledged it...

I've had a look through my results currently listed as valid (ignoring the large number of Separation orphans that are still listed!) and made a post about it in that thread; it may or may not be of interest... (It's quite long, so I've not re-posted it here.)

For what it's worth, I haven't [yet] seen any of these tasks that also have a validated wingman with no credit, so I can't see if the possible explanation for zero credit still holds if verification should have been possible; I'll have another look in a day or two to see if that changes, as I'd be interested to see one or more such WUs to have a look :-)

Cheers - Al.
12) Message boards : News : Admin Updates Discussion (Message 76780)
Posted 21 Jan 2024 by alanb1951
Post:
Regarding validated results without credit...

I've just sifted through [most of] my NBody results that are listed as Valid and I note that the ones with no credit don't seem to have a canonical result yet (the validator could pass them to the assimilator if they did! and they all have an unsent task or a wingman in progress. I also have tasks Pending Verification (listed as Validation inconclusive here), and those all have an unsent task and say "pending" in the credit column instead (as one might expect...)

It seems that the MilkyWay validator can mark the first result Valid without calling it the canonical result (and hence not awarding a credit score or invoking the assimilator!) Given the use of the Toolkit for Asynchronous Optimization, this may actually be intentional1. I think that when the previous runaway WU generation problems happened the Adaptive Replication wasn't working properly for NBody (I never saw a task that didn't get wingmen!) so it may not have shown the same behaviour back then, leaving the first result Pending rather than Valid!

It may well sort itself out as tasks get their confirmation results validated... I have just seen two results I returned nearly a fortnight ago that eventually got a wingman to return something on the 20th; both those have a credit score now and they hadn't been fully assimilated (purged) yet.

How long it takes the others to catch up may depend on other aspects such as how many other WUs have tasks (initial or retry) waiting to go out, and the choice of WUs considered by the feeder on each pass; hopefully it won't take the length of the deadline interval (12 days?) to get round to forcing the retries into the feeder2 but I fear that it might :-(

Cheers - Al.

1 The TAO puts an extra layer of decision into the validation process, and [from a cursory investigation during the previous crash and runaway] it seems it can decide whether retries are necessary or not based on the outcome of previous workunits. (I'm willing to be told otherwise by someone at MW or by Travis Desell...)

2 An example from elsewhere... At the time of writing, WCG seems to be having problems getting retries issued in some circumstances; eventually the transitioner seems to notice there has been no activity and the retry tasks get sent out. It takes 6 days for that to happen, which just happens to be the deadline length...
13) Message boards : News : Admin Updates Discussion (Message 76768)
Posted 19 Jan 2024 by alanb1951
Post:
I am not sure why 3 million workunits were generated. The cap was set pretty low to 1000 (now 10,000) but it just ignored that which I still haven't found the reason why.
It happened in the past after server maintenance, that during the maintenance lots of N-Body tasks were generated, but than it dropped to 1000 as the tasks were processed. This time however the work generators maintain the 3 millions of ready to send WUs, that's why I asked. 1000 worked pretty well actually AFAICT, enough to always get new work when asking for it while resend tasks were out just few minutes after they were created, but 10,000 should work too I guess. If you can get the work generators to follow the limit, the issue will clear itself, no need to abort anything.

I think there may be an issue in the base MilkyWay WU generator that could cause runaway WU creation if there was a transitioner backlog. If the new build is using the original generator any fixes applied might have been lost :-(

I did a bit of a code dive at the time of the previous manifestation of this issue, and I posted about it (without going into too much technical detail) in a thread called Server Trouble. I also sent Tom a private message with details about a possible solution based on how recent examples of the example BOINC WU generator code made sure that transitioner backlogs would not cause a problem. I have no idea whether any fix applied bore any resemblance to what I highlighted :-)

Just a thought...

Cheers - Al.
14) Message boards : News : Commenting on Recent Issues with the server (Message 76706)
Posted 15 Dec 2023 by alanb1951
Post:
Kevin,

A quick thought regarding "Admin only" threads -- if there's a facility that lets Admins lock threads you could create a thread with an initial message then pin it and lock it. When you need to add to it, unlock, post your next message and lock it again. This would, of course, need testing in case a locked thread can't be unlocked again :-)

If someone manages to post to the thread anyway, just delete what they post (possibly after copying it to a more appropriate place?) The first message would, as suggested elsewhere, include information on where to discuss the topic of that message, so such action seems fair :-)

Cheers - Al.
15) Message boards : Number crunching : Thread to report issues after server migration (Message 76702)
Posted 14 Dec 2023 by alanb1951
Post:
Those PHP warnings are to do with the website changes, as the application data was reloaded somewhere along the way without retaining the old data (which would've needed an application version update, I believe); as a result the PHP can't find app version information for the older items. This came up earlier in this thread and in the News section server migration thread too. Kevin indicated as much in his reply to the second post I linked...

It's irritating, but once all the older results have been assimilated/purged it shouldn't happen any longer :_)

Cheers - Al.
16) Message boards : Number crunching : Thread to report issues after server migration (Message 76645)
Posted 29 Nov 2023 by alanb1951
Post:
Conan,

As suggested by MIchael Setzer II in the News forum, you could add the following to your /etc/hosts file so that your client(s) can still get at the server using the now-withdrawn name. You should be able to run down your work to a point where you can take action to restore the original site name!

128.113.126.54 milkyway-new.cs.rpi.edu


This was working for me until the various BOINC scheduler services were taken offline some time on th eevening of 28th November (confirmed by the server status page); uploads are still getting through but then sit at "Ready to report" and, of course, downloads aren't possible at present (no scheduler...)

Hope this helps in some way.

Cheers - Al
17) Message boards : Number crunching : Thread to report issues after server migration (Message 76615)
Posted 14 Nov 2023 by alanb1951
Post:
Today I got a Workunit 962969273 aborted by project. What does that mean?

It means the server decided your result wasn't needed, and as you hadn't started processing it when the server was next contacted it told your client to abort it!

This usually happens when a task is issued because a prior task goes to "No Reply" state but then returns late (often a day or more so!) -- that appears to be the case with your example.

Cheers - Al.
18) Message boards : Number crunching : Thread to report issues after server migration (Message 76579)
Posted 6 Nov 2023 by alanb1951
Post:
The certificate issue seems to be resolved now -- I was able to get MW up and running again on my Linux systems from around 15:45 UTC on 2023-11-06. It even worked as I wanted once I remembered that I needed to restore/rebuild my app_config.xml files to cut the number of threads per task :-)

Thank you Kevin and the RPI techs.

Cheers - Al.
19) Message boards : News : Migrating MilkyWay@home to a New Server (Message 76578)
Posted 6 Nov 2023 by alanb1951
Post:
The certificate issue seems to be resolved now -- I was able to get MW up and running again on my Linux systems from around 15:45 UTC on 2023-11-06.

Thank you Kevin and the RPI techs.

Cheers - Al.
20) Message boards : Number crunching : Thread to report issues after server migration (Message 76537)
Posted 4 Nov 2023 by alanb1951
Post:
Nick,

Thanks for your effort, but I rather think we're talking past one another (probably my fault) rather than communicating... What the details I saw on the Firefox certificate stuff had me pondering was whether all MW servers are actually sending out the same certificate chain!

What I read from your last reply is that if the browser can find certificates for the relevant intermediate and root CA issuers in its own store it won't bother to look at the rest of the chain sent by the [MW] server... The alternatives I can think of would be either

  1. Firefox sees the broken certificate and takes what action it can to work round it;
  2. the MW web server and whatever server(s) BOINC (and openssl) are hitting are actually returning different certificate chains!


Again, thanks for your efforts, but I don't think there's really any point in my pursuing this any further -- I'm not going to start digging with a network monitor, as I'm not that bothered :-) -- we've all combined to identify and report the problem, and it will probably be sorted out on Monday!

Cheers - Al.



Next 20

©2024 Astroinformatics Group