Message boards :
News :
Separation Validator Updates/Brief Server Outage(s)
Message board moderation
Author | Message |
---|---|
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Hello Everyone, I will be updating the separation validator starting at 3PM ET. The server will go down for a short time and then come back up. In the case that the new validator causes problems, the server will go back down again to revert to the old validator. I will be monitoring the situation and would appreciate input on any workunits that fail validation after the new validator goes live. The server may go down/back up a few times during this process. Thanks for your patience. I'll keep you all posted on the status of things as they happen. Best, Tom |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
The first server outage is finished, and the validator started successfully. Monitoring the validator now to see if it throws any errors or comes down unexpectedly. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
I've already seen some results be successfully validated by the new validator. Fingers crossed things are working. |
Send message Joined: 24 Jan 11 Posts: 716 Credit: 557,681,468 RAC: 30,597 |
All my validate errors today are from BEFORE the validator changeover. Keeping my fingers crossed the high invalid rate is cured. |
Send message Joined: 1 Jul 08 Posts: 88 Credit: 25,079,058 RAC: 0 |
Greetings, My RAC is going back up and my Invalid number is decreasing. I believe you found the sweet spot Tom. :-) Have a great day! :) Siran CAPT Siran d'Vel'nahr XO - L L & P _\\// USS Vre'kasht NCC-33187 Winders 10 OS? "What a piece of junk!" - L. Skywalker "Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath |
Send message Joined: 23 Aug 11 Posts: 35 Credit: 11,920,560 RAC: 20,402 |
Seeing 16 more validate errors since this was up, and 2 more inconclusives with wingman results... |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Unfortunately it won't completely remove validate errors entirely, just the ones that are due to computer precision problems involving the gap in the data. Other validate errors will still happen at their normal rate. Based on the server stats, it looked like this fix cut the validate error rate to about 1/10 of what it was before. |
Send message Joined: 26 Apr 08 Posts: 87 Credit: 64,801,496 RAC: 0 |
Looks promising. My invalid count has gone way down and my RAC is climbing again. Thank you for your work on this. Plus SETI Classic = 21,082 WUs |
Send message Joined: 4 Jul 09 Posts: 99 Credit: 17,506,391 RAC: 2,645 |
Similar results my invalid rate has dropped significantly. I am one happy cruncher. Thank you Bill F In October of 1969 I took an oath to support and defend the Constitution of the United States against all enemies, foreign and domestic; There was no expiration date. |
Send message Joined: 7 Mar 20 Posts: 22 Credit: 106,550,624 RAC: 11,746 |
Much better here as well. Thanks for the fix! |
Send message Joined: 23 Aug 11 Posts: 35 Credit: 11,920,560 RAC: 20,402 |
Unfortunately it won't completely remove validate errors entirely, just the ones that are due to computer precision problems involving the gap in the data. Other validate errors will still happen at their normal rate. Well, had very few invalids before, and now the number is still significant, as my RAC graph shows. Also just saw a few WUs take an unusually long time, 75% more than normal, talking of actual CPU time: https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=258544929 https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=257964475 https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=257740404 This last one being validated, and having the regular number of credits, so it'd seem that the flops are considered to be standard. |
Send message Joined: 1 Jul 08 Posts: 88 Credit: 25,079,058 RAC: 0 |
The only invalids I have now are 13 from January and February. All recent ones are gone. Woohoo! :-) Have a great day! :) Siran CAPT Siran d'Vel'nahr XO - L L & P _\\// USS Vre'kasht NCC-33187 Winders 10 OS? "What a piece of junk!" - L. Skywalker "Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Hmm, I've seen other reports of some separation WUs taking very long to crunch on certain CPUs. I wasn't able to find any common theme in them, though. It could be something to do with the type of CPU being used or some setting on the users' machines. If it keeps up being a problem I can take a look at it, but it doesn't seem like it's every user, and it doesn't seem like it's a high fraction of WUs for the users who are reporting it. |
Send message Joined: 23 Aug 11 Posts: 35 Credit: 11,920,560 RAC: 20,402 |
Here's another long one: https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=259505534 And speaking of CPUs, may it be while I'm still being hit by this, still having many invalids and RAC going down, because I'm doing CPU only and most results are on GPUs and if the WU is sent to me and someone else who runs it on a GPU and there's a small but sufficiently significant difference to require a 3rd result, that 3rd one is likely to also be on a GPU and therefore the wingman results are more likely to be closer to each other and leave mine out? |
Send message Joined: 2 Apr 11 Posts: 14 Credit: 4,527,461 RAC: 0 |
I have two validation errors showing at the moment. Both are a waste of a couple of hours of a healthy machine's CPU time. It seems to be macOS vs. Windows and Linux rather than CPU vs. GPU. NG |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
On average, validate errors should be ~1% of the server's total WU load. This can be due to a few different things, such as problems with individual machines, bad drivers, mismatches between certain types of machines, bad spots in the optimization likelihood surface, and bugs in the program. This is a challenge with running the server: you want the tolerance for validation to be lenient enough that small numerical errors don't cause a good WU to be thrown out, but you want the tolerance to be stringent enough that you aren't letting faulty WUs through. Based on this philosophy, it would actually be bad if there were zero validate errors on the server, since it would mean that we are probably letting too many shady WUs through. Apologies if it means that you end up wasting time on crunching WUs that end up having validate errors. If a substantial fraction of the WUs on your machine have validate errors, that may point to a problem with your machine and not MilkyWay@home, since the global server validate error rate is only ~1.5%. It is common for Linux vs. Windows vs. MacOS machines to mismatch on validation, since machines running different OSs will have different drivers, which will probably treat computer precision differently. I just want to be clear that this fix doesn't mean that validate errors will go away entirely -- it just means that people shouldn't be seeing the 20% error rates that we had before the validator update. A small amount of validate errors is unavoidable. |
Send message Joined: 23 Aug 11 Posts: 35 Credit: 11,920,560 RAC: 20,402 |
With no changes to my computer and the validate errors starting around June 25, when I gather they started for others as well, it seems unlikely to be a problem with my machine, but fact is that my RAC is down by some 40% from what it used to be, with validate errors accounting for it. Ah well, we'll see how it goes. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
The current separation runs perform division by very small numbers in order to calculate the likelihood. The new validator and the gap fix have tried to account for that, but if your machine handles division by very small numbers different than other machines, then unfortunately the validate errors may be unavoidable. The good news is, if that's the case, when we put up new runs that should go away and your RAC should return to normal. |
Send message Joined: 1 Dec 10 Posts: 82 Credit: 15,452,009,012 RAC: 0 |
With no changes to my computer and the validate errors starting around June 25, when I gather they started for others as well, it seems unlikely to be a problem with my machine, but fact is that my RAC is down by some 40% from what it used to be, with validate errors accounting for it. Ah well, we'll see how it goes. The validate errors will account for some of your 40% reduction in RAC but there is also a disparity in the old unit resulting in 227 points and the new at 230 which considering the work unit takes longer than the 3 point extra awarded there is a defecit on every work unit. This has resulted in a shortfall of about 250K per day in my situation without the failed units. |
Send message Joined: 11 Feb 14 Posts: 4 Credit: 321,527 RAC: 0 |
Work flow https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=134898225 has multiple "Completed, validation inconclusive". |
©2025 Astroinformatics Group