Welcome to MilkyWay@home

Validator Outage


Advanced search

Message boards : News : Validator Outage
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
muca

Send message
Joined: 21 May 20
Posts: 2
Credit: 9,839,709
RAC: 27,167
5 million credit badge1 year member badge
Message 71223 - Posted: 8 Oct 2021, 8:30:06 UTC - in response to Message 71148.  
Last modified: 8 Oct 2021, 8:43:32 UTC

@Tom: It seems they still comming...
https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=203048065
Is it possible somehow cancel it on server side?
ID: 71223 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Frank

Send message
Joined: 2 Nov 10
Posts: 12
Credit: 915,706,801
RAC: 2,441,593
500 million credit badge10 year member badge
Message 71243 - Posted: 13 Oct 2021, 16:01:06 UTC

These "Invalids" are really becoming irritating. I am encountering about 450 per day and each is consuming 1.625 of the normal computer time. Are we making any progress on eliminating the 7 WUs tasks?
I won't even bitch about the "Error While Computing " Tasks. They are not Errors While Computing since no computing is ever done. Actually, they are Initialization Errors. I don't care what they are I just want them to go away.
On the Invalids, I am fairly certain that tasks that end up as 7 WU are sent as 4 WU. The assessment 7 WUs is made by the clients' computers. Maybe what we need is a subroutine in Initialization that tests the WU Count before computation starts and if it is not 4 aborts the run. I don't like creating "workarounds" as fixes but it would be better than we have today.
ID: 71243 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileJoseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 240
Credit: 1,300,845,657
RAC: 1,102,933
1 billion credit badge12 year member badge
Message 71244 - Posted: 13 Oct 2021, 18:15:34 UTC - in response to Message 71243.  

These "Invalids" are really becoming irritating. I am encountering about 450 per day and each is consuming 1.625 of the normal computer time. Are we making any progress on eliminating the 7 WUs tasks?
I won't even bitch about the "Error While Computing " Tasks. They are not Errors While Computing since no computing is ever done. Actually, they are Initialization Errors. I don't care what they are I just want them to go away.
On the Invalids, I am fairly certain that tasks that end up as 7 WU are sent as 4 WU. The assessment 7 WUs is made by the clients' computers. Maybe what we need is a subroutine in Initialization that tests the WU Count before computation starts and if it is not 4 aborts the run. I don't like creating "workarounds" as fixes but it would be better than we have today.


The separation source file at "milkywayathome_client/separation/separation_main.c"

Has the following code line: "mw_printf("<number_WUs> %d </number_WUs>\n", ap.totalWUs);"

putting the following under it would abort all 7 parameter work units
if (ap.totalWUs == 7)
{
exit(EXIT_FAILURE)
}

This would avoid crunching 7 WU tasks but you would end up with a lot of "error" which might cause the daily quota to be exceeded. The number of GPUs can always be faked to get more but I suspect it is better if the project guru can fix the problem

It has been over a year since I last built the client
https://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4551#69402

It appears there have been a few changes since then.
https://github.com/Milkyway-at-home/milkywayathome_client
ID: 71244 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 145
Credit: 60,115,648
RAC: 126,343
50 million credit badge2 year member badge
Message 71245 - Posted: 13 Oct 2021, 20:28:51 UTC

Turns out there are a few WUs with 4x the number of parameters that got stuck in the parent population for some runs. Absolutely no idea how that happened, but I should be able to make it so that it truncates that down to the normal number when it generates new WUs.

I can explain a bit more when I implement a fix, might try to get that done tonight. We'll see.

Technical bits below:

When the XML file (technically, its the thing that writes to the XML file) computes the number of parameters in a job, it only considers the number of parameters in the first WU and ignores the number in the other bundles. In this case, if the first WU has the wrong number (say, 4x as many parameters) then it will estimate incorrectly. This means that the number of parameters that the XML file estimates is (4x4)x26 instead of (4)x26, so you get 416 instead of the expected 104.

However, when the XML generator calculates the number of WUs, it takes the total number of parameters (actually (1x4)x26 + (3x1)x26 = 182, because the other 3 bundled WUs are normal) and when you divide that by 26, you get 7. Which is why the XML file thinks that there are 7 WUs but 416 parameters, when 416 parameters should actually be 16 bundled WUs.

I should figure out why these 182-parameter WUs happen, but I'll have to do that later and just try to implement a patch right now. This only became noticeable because the faulty WUs happened to return good likelihood scores (not sure if those are legitimate or not -- the first 26 parameters look realistic, but the other 26x3 are just scraped from who knows what in memory... could be that the program only looks at the first 26 when calculating the likelihood of that WU, so that run would actually be good.) Alternatively I could just try to figure out how to purge those members from the population (or just remove the extra 3x26 parameters), which should in theory stop the bad WUs from being generated.

I know that it's later than you all would have liked, but know that I've heard your pleas and I'm trying to do something about it.
ID: 71245 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 145
Credit: 60,115,648
RAC: 126,343
50 million credit badge2 year member badge
Message 71246 - Posted: 13 Oct 2021, 21:31:55 UTC
Last modified: 14 Oct 2021, 3:40:19 UTC

I've gone through and truncated all of the faulty population members. Nothing seems broken as far as I can tell. Over the next few days, you should hopefully see a decrease in the number of invalid runs, since all WUs generated from this point forward should have the correct size -- if not, I'll have to take some more drastic measures.

Technical Bits Below:

Curiously, the faulty members were all in the 0th spot in the population. I think that's because when generating the new WUs from the parents, it takes the size of the parent with lower id number -- so, if these weird WUs aren't in the 0th place, their children won't have the wrong size, and any replacements will be guaranteed to remove that wrong size WU (although this is just a guess, could totally be wrong on this one). Also, curiously, these runs that are currently up were the only runs to have been affected by this. The server code hasn't changed at all since I ran previous runs (except for the validator, which changed after this problem began) so I'm not sure what's up with that. But the same issue was present in every single run.

I'll keep an eye on things and see if the 0th spot gets updated with a oversized WU again. Hopefully this is all resolved, but maybe not.

Thanks for your patience with this, I have a thousand projects all going on so it's not always easy for me to get around to this one!
ID: 71246 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 145
Credit: 60,115,648
RAC: 126,343
50 million credit badge2 year member badge
Message 71247 - Posted: 14 Oct 2021, 17:50:15 UTC
Last modified: 14 Oct 2021, 17:50:53 UTC

The validation error rate is slowly falling (down 15% in number of invalidate errors in the last 4 hours). Currently the server is at about ~3.2% invalidate rate, which is down from 3.5% 4 hours ago. Hopefully this trend continues.

I expect that in a few days - a week we see things return to normal as all of the faulty WUs pass through the system.
ID: 71247 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 12
Credit: 108,629
RAC: 4,921
100 thousand credit badge9 year member badge
Message 71250 - Posted: 15 Oct 2021, 6:52:10 UTC - in response to Message 71247.  

Thanks, errors down, had 3 yesterday, 2 were due to the 7 WU problem, 1 presumably genuine.
ID: 71250 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Max_Pirx

Send message
Joined: 13 Dec 17
Posts: 17
Credit: 1,237,649,722
RAC: 1,520,370
1 billion credit badge3 year member badge
Message 71251 - Posted: 15 Oct 2021, 17:33:58 UTC

Yep, invalid tasks are going down for me too. From 640 couple of days ago down to 380 at the moment. Thanks for sorting this out and cheers!
ID: 71251 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Frank

Send message
Joined: 2 Nov 10
Posts: 12
Credit: 915,706,801
RAC: 2,441,593
500 million credit badge10 year member badge
Message 71252 - Posted: 17 Oct 2021, 14:35:12 UTC

Tom Donlon, you done good.
In the last 24 hours I have completed 11,000 tasks and experienced 2 Invalids (7 WU) and O Errors While Computing..
ID: 71252 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileJoseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 240
Credit: 1,300,845,657
RAC: 1,102,933
1 billion credit badge12 year member badge
Message 71253 - Posted: 17 Oct 2021, 14:57:01 UTC

Really appreciate the effort to fix this!
Out of 17,000+ only 5 in last 24 hours.
ID: 71253 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : News : Validator Outage

©2021 Astroinformatics Group