Welcome to MilkyWay@home

Admin Updates Discussion

Message boards : News : Admin Updates Discussion
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 7 · Next

AuthorMessage
Profile Kevin Roux
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 9 Aug 22
Posts: 82
Credit: 2,853,631
RAC: 7,036
Message 76711 - Posted: 15 Dec 2023, 19:53:02 UTC
Last modified: 15 Dec 2023, 19:56:52 UTC

Hello everyone,

This thread will be used to discuss posts made in Admin Updates.

Thanks,
Kevin
ID: 76711 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 624
Credit: 19,299,838
RAC: 2,590
Message 76762 - Posted: 18 Jan 2024, 11:10:50 UTC

Cancelled all of the "Ready to send" separation workunits
I guess you need to cancel separation tasks "waiting for assimilation" now, so they can finally be removed from our results lists.

Regarding one other change that appeared after the server maintenance: is ~3 million ready to send N-Body tasks the new target for the work generators? With that it will take up to two weeks before the _1 and any additional resend tasks make it through that pile (when we had 1.5 millions they needed around 5-7 days).

And since N-Body seem to always need two results to validate, wouldn't in make sense to set minimum quorum and with that initial replication to 2 ? Or are there any WUs, that validate with only one result?
ID: 76762 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Finn the Human
Avatar

Send message
Joined: 23 Dec 18
Posts: 23
Credit: 10,213,119
RAC: 0
Message 76763 - Posted: 18 Jan 2024, 21:18:36 UTC

I haven't had any tasks validated for the past two days now with 80+ validation inconclusive because of that. Maybe we should prioritize sending WUs that need validation?
Everything stays
But it still changes
Ever so slightly
Daily and nightly
In little ways
When everything stays...

ID: 76763 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Kevin Roux
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 9 Aug 22
Posts: 82
Credit: 2,853,631
RAC: 7,036
Message 76764 - Posted: 18 Jan 2024, 21:42:52 UTC - in response to Message 76762.  
Last modified: 23 Jan 2024, 7:09:44 UTC

I guess you need to cancel separation tasks "waiting for assimilation" now, so they can finally be removed from our results lists.

Regarding one other change that appeared after the server maintenance: is ~3 million ready to send N-Body tasks the new target for the work generators? With that it will take up to two weeks before the _1 and any additional resend tasks make it through that pile (when we had 1.5 millions they needed around 5-7 days).

And since N-Body seem to always need two results to validate, wouldn't in make sense to set minimum quorum and with that initial replication to 2 ? Or are there any WUs, that validate with only one result?


I am not sure why 3 million workunits were generated. The cap was set pretty low to 1000 (now 10,000) but it just ignored that which I still haven't found the reason why. I will try to go in and remove/cancel these workunits so validations can be done.
ID: 76764 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3339
Credit: 524,010,781
RAC: 0
Message 76766 - Posted: 19 Jan 2024, 1:07:47 UTC - in response to Message 76764.  

I guess you need to cancel separation tasks "waiting for assimilation" now, so they can finally be removed from our results lists.

Regarding one other change that appeared after the server maintenance: is ~3 million ready to send N-Body tasks the new target for the work generators? With that it will take up to two weeks before the _1 and any additional resend tasks make it through that pile (when we had 1.5 millions they needed around 5-7 days).

And since N-Body seem to always need two results to validate, wouldn't in make sense to set minimum quorum and with that initial replication to 2 ? Or are there any WUs, that validate with only one result?


I am not sure why 3 million workunits were generated. The cap was set pretty low to 1000 (now 10,000) but it just ignored that which I still haven't found the reason why. I will try to go in and remove/cancel these workunits so validations can be done.[/quote]

I obvioussly have no clue how intimate you are with the Boinc Server side code but apparently things are in SEVERAL places instead of just one in the coding. ie one admin at a different project tried to change the credits given out for a task and found they were hard coded in at least 3 different sections, I'm NOT asking for a credit award change I'm just using it as an example.

Also don't know if you know it but there is a Boinc Admin email group that you can ask questions from other Boinc Project Admins that may have already been thru what you are, assuming now, still learning about.
ID: 76766 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 624
Credit: 19,299,838
RAC: 2,590
Message 76767 - Posted: 19 Jan 2024, 10:22:34 UTC - in response to Message 76764.  
Last modified: 19 Jan 2024, 10:26:34 UTC

I am not sure why 3 million workunits were generated. The cap was set pretty low to 1000 (now 10,000) but it just ignored that which I still haven't found the reason why.
It happened in the past after server maintenance, that during the maintenance lots of N-Body tasks were generated, but than it dropped to 1000 as the tasks were processed. This time however the work generators maintain the 3 millions of ready to send WUs, that's why I asked. 1000 worked pretty well actually AFAICT, enough to always get new work when asking for it while resend tasks were out just few minutes after they were created, but 10,000 should work too I guess. If you can get the work generators to follow the limit, the issue will clear itself, no need to abort anything.
ID: 76767 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 213
Credit: 108,362,921
RAC: 4,460
Message 76768 - Posted: 19 Jan 2024, 12:10:32 UTC - in response to Message 76767.  
Last modified: 19 Jan 2024, 12:12:05 UTC

I am not sure why 3 million workunits were generated. The cap was set pretty low to 1000 (now 10,000) but it just ignored that which I still haven't found the reason why.
It happened in the past after server maintenance, that during the maintenance lots of N-Body tasks were generated, but than it dropped to 1000 as the tasks were processed. This time however the work generators maintain the 3 millions of ready to send WUs, that's why I asked. 1000 worked pretty well actually AFAICT, enough to always get new work when asking for it while resend tasks were out just few minutes after they were created, but 10,000 should work too I guess. If you can get the work generators to follow the limit, the issue will clear itself, no need to abort anything.

I think there may be an issue in the base MilkyWay WU generator that could cause runaway WU creation if there was a transitioner backlog. If the new build is using the original generator any fixes applied might have been lost :-(

I did a bit of a code dive at the time of the previous manifestation of this issue, and I posted about it (without going into too much technical detail) in a thread called Server Trouble. I also sent Tom a private message with details about a possible solution based on how recent examples of the example BOINC WU generator code made sure that transitioner backlogs would not cause a problem. I have no idea whether any fix applied bore any resemblance to what I highlighted :-)

Just a thought...

Cheers - Al.
ID: 76768 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John154

Send message
Joined: 2 Aug 20
Posts: 1
Credit: 14,982,887
RAC: 56,580
Message 76769 - Posted: 19 Jan 2024, 20:05:12 UTC

It seems that all of my pending jobs have now been marked as validated with no credit - is this a temporary problem? (ie https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=963887023)
ID: 76769 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Kevin Roux
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 9 Aug 22
Posts: 82
Credit: 2,853,631
RAC: 7,036
Message 76771 - Posted: 19 Jan 2024, 20:26:10 UTC - in response to Message 76769.  

It seems that all of my pending jobs have now been marked as validated with no credit - is this a temporary problem? (ie https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=963887023)


I will take a look at that.
ID: 76771 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Eden_H [E.R.]

Send message
Joined: 27 Jun 14
Posts: 3
Credit: 3,445,901
RAC: 0
Message 76772 - Posted: 20 Jan 2024, 0:21:56 UTC - in response to Message 76769.  
Last modified: 20 Jan 2024, 0:29:15 UTC

same issue here all of yesterday and today WUs validated with 0.00 credits
ID: 76772 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 624
Credit: 19,299,838
RAC: 2,590
Message 76773 - Posted: 20 Jan 2024, 10:08:29 UTC
Last modified: 20 Jan 2024, 10:13:30 UTC

Same here. Interestingly the resend tasks waiting in the queue have not been canceled, they are still "unsent" and not "didn't need" as usual when a WU validates late after a resend task has been created but before it was sent out.
ID: 76773 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3339
Credit: 524,010,781
RAC: 0
Message 76774 - Posted: 20 Jan 2024, 11:47:23 UTC - in response to Message 76772.  

same issue here all of yesterday and today WUs validated with 0.00 credits


Mine too but there is a new thing I THINK, it now says "initial replication 2"

name de_nbody_11_02_2023_v183_pal5__data__3_1705426859_64425
application Milkyway@home N-Body Simulation
created 16 Jan 2024, 17:51:23 UTC
minimum quorum 1
initial replication 2

932928572 857711 19 Jan 2024, 7:50:31 UTC 19 Jan 2024, 18:27:19 UTC Completed and validated 24,333.81 45,806.73 0.00 Milkyway@home N-Body Simulation v1.83 (mt)
windows_x86_64
ID: 76774 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 624
Credit: 19,299,838
RAC: 2,590
Message 76775 - Posted: 20 Jan 2024, 12:22:23 UTC - in response to Message 76774.  

Mine too but there is a new thing I THINK, it now says "initial replication 2"
No, that's not new, for some reason "initial replication" always increases by one here when a returned result becomes inconclusive, for the old separation tasks you can find even initial replication of 3 if already two results were returned. WUs for which the first result has not been returned yet still have initial replication of 1.
ID: 76775 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Finn the Human
Avatar

Send message
Joined: 23 Dec 18
Posts: 23
Credit: 10,213,119
RAC: 0
Message 76777 - Posted: 20 Jan 2024, 22:35:33 UTC

Well i'll probably stop crunching for a bit till this gets sorted
Everything stays
But it still changes
Ever so slightly
Daily and nightly
In little ways
When everything stays...

ID: 76777 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Eden_H [E.R.]

Send message
Joined: 27 Jun 14
Posts: 3
Credit: 3,445,901
RAC: 0
Message 76778 - Posted: 21 Jan 2024, 1:54:59 UTC - in response to Message 76777.  

Well i'll probably stop crunching for a bit till this gets sorted


Agreed. We're donating, not squandering. Especially with the current cost of electricity
ID: 76778 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 213
Credit: 108,362,921
RAC: 4,460
Message 76780 - Posted: 21 Jan 2024, 8:22:37 UTC

Regarding validated results without credit...

I've just sifted through [most of] my NBody results that are listed as Valid and I note that the ones with no credit don't seem to have a canonical result yet (the validator could pass them to the assimilator if they did! and they all have an unsent task or a wingman in progress. I also have tasks Pending Verification (listed as Validation inconclusive here), and those all have an unsent task and say "pending" in the credit column instead (as one might expect...)

It seems that the MilkyWay validator can mark the first result Valid without calling it the canonical result (and hence not awarding a credit score or invoking the assimilator!) Given the use of the Toolkit for Asynchronous Optimization, this may actually be intentional1. I think that when the previous runaway WU generation problems happened the Adaptive Replication wasn't working properly for NBody (I never saw a task that didn't get wingmen!) so it may not have shown the same behaviour back then, leaving the first result Pending rather than Valid!

It may well sort itself out as tasks get their confirmation results validated... I have just seen two results I returned nearly a fortnight ago that eventually got a wingman to return something on the 20th; both those have a credit score now and they hadn't been fully assimilated (purged) yet.

How long it takes the others to catch up may depend on other aspects such as how many other WUs have tasks (initial or retry) waiting to go out, and the choice of WUs considered by the feeder on each pass; hopefully it won't take the length of the deadline interval (12 days?) to get round to forcing the retries into the feeder2 but I fear that it might :-(

Cheers - Al.

1 The TAO puts an extra layer of decision into the validation process, and [from a cursory investigation during the previous crash and runaway] it seems it can decide whether retries are necessary or not based on the outcome of previous workunits. (I'm willing to be told otherwise by someone at MW or by Travis Desell...)

2 An example from elsewhere... At the time of writing, WCG seems to be having problems getting retries issued in some circumstances; eventually the transitioner seems to notice there has been no activity and the retry tasks get sent out. It takes 6 days for that to happen, which just happens to be the deadline length...
ID: 76780 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 624
Credit: 19,299,838
RAC: 2,590
Message 76782 - Posted: 21 Jan 2024, 10:19:09 UTC - in response to Message 76777.  

Well i'll probably stop crunching for a bit till this gets sorted
Probably not necessary IMHO, results returned now seem to become inconclusive as usual, so they should validate one day, the more we crunch, the sooner this will happen.

The _1+ tasks needed proper validation of the 0 credits tasks have also not been canceled after the "validation", so that should sort out itself as well once they be sent out and returned (unless the servers will decide they are not needed before sending them out, but that's nothing we can change by not crunching for Milkyway anyway, only project admin can fix it).
ID: 76782 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
xii5ku

Send message
Joined: 1 Jan 17
Posts: 37
Credit: 111,047,395
RAC: 34,892
Message 76784 - Posted: 21 Jan 2024, 15:26:11 UTC - in response to Message 76780.  
Last modified: 21 Jan 2024, 15:57:31 UTC

alanb1951 wrote:
It seems that the MilkyWay validator can mark the first result Valid without calling it the canonical result (and hence not awarding a credit score or invoking the assimilator!)
Here is a workunit which even has two completed-and-validated 0.00-credit results at the moment, plus one task in progress: 963764114

PS,
here is a timeline of various server_status.php data, if it helps in any way: https://grafana.kiska.pw/d/boinc/boinc?orgId=1&var-project=milkyway@home&from=1704585600000&to=now
ID: 76784 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 624
Credit: 19,299,838
RAC: 2,590
Message 76785 - Posted: 21 Jan 2024, 16:23:27 UTC - in response to Message 76784.  

An example from elsewhere... At the time of writing, WCG seems to be having problems getting retries issued in some circumstances; eventually the transitioner seems to notice there has been no activity and the retry tasks get sent out. It takes 6 days for that to happen, which just happens to be the deadline length...
That happens however at the deadline of any of the completed tasks, not the new ones, they don't have a deadline yet, they get it when they are sent out. When that happens, my guess is, that the validator will put them back to the inconclusive state.


alanb1951 wrote:
It seems that the MilkyWay validator can mark the first result Valid without calling it the canonical result (and hence not awarding a credit score or invoking the assimilator!)
Here is a workunit which even has two completed-and-validated 0.00-credit results at the moment, plus one task in progress: 963764114
Both results are different, so actually they are inconclusive. I don't think the validator marked them as valid, it was more likely Kevin trying to get rid of all separation tasks by marking all inconclusive results as valid in the hope they will be purged from the database after that. Well, they are still there, more than 48 hour after becoming valid, so that didn't work I guess.
ID: 76785 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Eden_H [E.R.]

Send message
Joined: 27 Jun 14
Posts: 3
Credit: 3,445,901
RAC: 0
Message 76786 - Posted: 21 Jan 2024, 19:32:14 UTC - in response to Message 76782.  

... but that's nothing we can change by not crunching for Milkyway anyway, only project admin can fix it).


Well, by not crunching we can express our disappointment (and motivate project admin to fix the issue).
ID: 76786 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 7 · Next

Message boards : News : Admin Updates Discussion

©2024 Astroinformatics Group