Welcome to MilkyWay@home

Separation Validator Updates/Brief Server Outage(s)

Message boards : News : Separation Validator Updates/Brief Server Outage(s)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 70965 - Posted: 15 Jul 2021, 18:41:01 UTC

Hello Everyone,

I will be updating the separation validator starting at 3PM ET. The server will go down for a short time and then come back up. In the case that the new validator causes problems, the server will go back down again to revert to the old validator. I will be monitoring the situation and would appreciate input on any workunits that fail validation after the new validator goes live.

The server may go down/back up a few times during this process. Thanks for your patience. I'll keep you all posted on the status of things as they happen.

Best,
Tom
ID: 70965 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 70966 - Posted: 15 Jul 2021, 19:04:49 UTC - in response to Message 70965.  

The first server outage is finished, and the validator started successfully. Monitoring the validator now to see if it throws any errors or comes down unexpectedly.
ID: 70966 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 70968 - Posted: 15 Jul 2021, 19:18:00 UTC

I've already seen some results be successfully validated by the new validator. Fingers crossed things are working.
ID: 70968 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 708
Credit: 542,918,282
RAC: 151,810
Message 70971 - Posted: 16 Jul 2021, 0:09:52 UTC

All my validate errors today are from BEFORE the validator changeover. Keeping my fingers crossed the high invalid rate is cured.
ID: 70971 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Siran d'Vel'nahr
Avatar

Send message
Joined: 1 Jul 08
Posts: 88
Credit: 25,079,058
RAC: 0
Message 70972 - Posted: 16 Jul 2021, 9:14:14 UTC

Greetings,

My RAC is going back up and my Invalid number is decreasing. I believe you found the sweet spot Tom. :-)

Have a great day! :)

Siran
CAPT Siran d'Vel'nahr XO - L L & P _\\//
USS Vre'kasht NCC-33187
Winders 10 OS? "What a piece of junk!" - L. Skywalker
"Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath
ID: 70972 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cavalary
Avatar

Send message
Joined: 23 Aug 11
Posts: 33
Credit: 11,062,253
RAC: 0
Message 70973 - Posted: 17 Jul 2021, 19:37:31 UTC

Seeing 16 more validate errors since this was up, and 2 more inconclusives with wingman results...
ID: 70973 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 70974 - Posted: 17 Jul 2021, 19:55:38 UTC

Unfortunately it won't completely remove validate errors entirely, just the ones that are due to computer precision problems involving the gap in the data. Other validate errors will still happen at their normal rate.

Based on the server stats, it looked like this fix cut the validate error rate to about 1/10 of what it was before.
ID: 70974 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
paris
Avatar

Send message
Joined: 26 Apr 08
Posts: 87
Credit: 64,801,496
RAC: 0
Message 70975 - Posted: 17 Jul 2021, 20:40:01 UTC

Looks promising. My invalid count has gone way down and my RAC is climbing again. Thank you for your work on this.


Plus SETI Classic = 21,082 WUs
ID: 70975 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Bill F
Avatar

Send message
Joined: 4 Jul 09
Posts: 86
Credit: 16,758,291
RAC: 4,085
Message 70976 - Posted: 18 Jul 2021, 2:03:13 UTC

Similar results my invalid rate has dropped significantly. I am one happy cruncher.

Thank you
Bill F
In October of 1969 I took an oath to support and defend the Constitution of the United States against all enemies, foreign and domestic;
There was no expiration date.


ID: 70976 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jimbocous
Avatar

Send message
Joined: 7 Mar 20
Posts: 22
Credit: 104,945,714
RAC: 12,072
Message 70977 - Posted: 18 Jul 2021, 3:02:15 UTC

Much better here as well. Thanks for the fix!
ID: 70977 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cavalary
Avatar

Send message
Joined: 23 Aug 11
Posts: 33
Credit: 11,062,253
RAC: 0
Message 70978 - Posted: 18 Jul 2021, 14:59:55 UTC - in response to Message 70974.  

Unfortunately it won't completely remove validate errors entirely, just the ones that are due to computer precision problems involving the gap in the data. Other validate errors will still happen at their normal rate.

Based on the server stats, it looked like this fix cut the validate error rate to about 1/10 of what it was before.

Well, had very few invalids before, and now the number is still significant, as my RAC graph shows.
Also just saw a few WUs take an unusually long time, 75% more than normal, talking of actual CPU time:
https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=258544929
https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=257964475
https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=257740404
This last one being validated, and having the regular number of credits, so it'd seem that the flops are considered to be standard.
ID: 70978 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Siran d'Vel'nahr
Avatar

Send message
Joined: 1 Jul 08
Posts: 88
Credit: 25,079,058
RAC: 0
Message 70979 - Posted: 18 Jul 2021, 16:27:55 UTC

The only invalids I have now are 13 from January and February. All recent ones are gone. Woohoo! :-)

Have a great day! :)

Siran
CAPT Siran d'Vel'nahr XO - L L & P _\\//
USS Vre'kasht NCC-33187
Winders 10 OS? "What a piece of junk!" - L. Skywalker
"Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath
ID: 70979 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 70980 - Posted: 18 Jul 2021, 17:35:33 UTC - in response to Message 70978.  

Hmm, I've seen other reports of some separation WUs taking very long to crunch on certain CPUs. I wasn't able to find any common theme in them, though. It could be something to do with the type of CPU being used or some setting on the users' machines.

If it keeps up being a problem I can take a look at it, but it doesn't seem like it's every user, and it doesn't seem like it's a high fraction of WUs for the users who are reporting it.
ID: 70980 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cavalary
Avatar

Send message
Joined: 23 Aug 11
Posts: 33
Credit: 11,062,253
RAC: 0
Message 70981 - Posted: 20 Jul 2021, 1:27:16 UTC

Here's another long one: https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=259505534

And speaking of CPUs, may it be while I'm still being hit by this, still having many invalids and RAC going down, because I'm doing CPU only and most results are on GPUs and if the WU is sent to me and someone else who runs it on a GPU and there's a small but sufficiently significant difference to require a 3rd result, that 3rd one is likely to also be on a GPU and therefore the wingman results are more likely to be closer to each other and leave mine out?
ID: 70981 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Nigel Garvey

Send message
Joined: 2 Apr 11
Posts: 14
Credit: 4,527,461
RAC: 0
Message 70982 - Posted: 20 Jul 2021, 11:47:23 UTC

I have two validation errors showing at the moment. Both are a waste of a couple of hours of a healthy machine's CPU time. It seems to be macOS vs. Windows and Linux rather than CPU vs. GPU.
NG
ID: 70982 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 70983 - Posted: 20 Jul 2021, 15:44:07 UTC
Last modified: 20 Jul 2021, 15:46:16 UTC

On average, validate errors should be ~1% of the server's total WU load. This can be due to a few different things, such as problems with individual machines, bad drivers, mismatches between certain types of machines, bad spots in the optimization likelihood surface, and bugs in the program. This is a challenge with running the server: you want the tolerance for validation to be lenient enough that small numerical errors don't cause a good WU to be thrown out, but you want the tolerance to be stringent enough that you aren't letting faulty WUs through. Based on this philosophy, it would actually be bad if there were zero validate errors on the server, since it would mean that we are probably letting too many shady WUs through.

Apologies if it means that you end up wasting time on crunching WUs that end up having validate errors. If a substantial fraction of the WUs on your machine have validate errors, that may point to a problem with your machine and not MilkyWay@home, since the global server validate error rate is only ~1.5%. It is common for Linux vs. Windows vs. MacOS machines to mismatch on validation, since machines running different OSs will have different drivers, which will probably treat computer precision differently.

I just want to be clear that this fix doesn't mean that validate errors will go away entirely -- it just means that people shouldn't be seeing the 20% error rates that we had before the validator update. A small amount of validate errors is unavoidable.
ID: 70983 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cavalary
Avatar

Send message
Joined: 23 Aug 11
Posts: 33
Credit: 11,062,253
RAC: 0
Message 70984 - Posted: 21 Jul 2021, 2:04:52 UTC - in response to Message 70983.  

With no changes to my computer and the validate errors starting around June 25, when I gather they started for others as well, it seems unlikely to be a problem with my machine, but fact is that my RAC is down by some 40% from what it used to be, with validate errors accounting for it. Ah well, we'll see how it goes.
ID: 70984 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 70985 - Posted: 21 Jul 2021, 3:12:51 UTC

The current separation runs perform division by very small numbers in order to calculate the likelihood. The new validator and the gap fix have tried to account for that, but if your machine handles division by very small numbers different than other machines, then unfortunately the validate errors may be unavoidable. The good news is, if that's the case, when we put up new runs that should go away and your RAC should return to normal.
ID: 70985 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Spatzthecat

Send message
Joined: 1 Dec 10
Posts: 82
Credit: 15,452,009,012
RAC: 0
Message 70986 - Posted: 21 Jul 2021, 9:37:10 UTC - in response to Message 70984.  

With no changes to my computer and the validate errors starting around June 25, when I gather they started for others as well, it seems unlikely to be a problem with my machine, but fact is that my RAC is down by some 40% from what it used to be, with validate errors accounting for it. Ah well, we'll see how it goes.


The validate errors will account for some of your 40% reduction in RAC but there is also a disparity in the old unit resulting in 227 points and the new at 230 which considering the work unit takes longer than the 3 point extra awarded there is a defecit on every work unit. This has resulted in a shortfall of about 250K per day in my situation without the failed units.
ID: 70986 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Nigel Conway

Send message
Joined: 11 Feb 14
Posts: 4
Credit: 321,527
RAC: 0
Message 70989 - Posted: 21 Jul 2021, 18:26:58 UTC

Work flow https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=134898225 has multiple "Completed, validation inconclusive".
ID: 70989 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : News : Separation Validator Updates/Brief Server Outage(s)

©2024 Astroinformatics Group