Welcome to MilkyWay@home

Validator Outage

Message boards : News : Validator Outage
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 6 · Next

AuthorMessage
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71137 - Posted: 21 Sep 2021, 16:13:00 UTC

Hey Everyone,

MilkyWay@home is currently experiencing an outage for the Separation Validator. I brought it back up once, and then it crashed again. I am trying to bring it back up. In the meantime, connections to the download/upload servers may stop and start intermittently as I work on the server.

Thanks for your patience, I will keep you updated as things change.

Tom
ID: 71137 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71138 - Posted: 21 Sep 2021, 16:54:15 UTC

I've gotten to the root of the problem. One of the WUs with 7 sub-tasks (see https://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4762) entered into the validator, which crashed because the list of parameters is too long to fit in a mysql string in the database. Every time I try to bring the validator up, it does actually come up, and then crashes because it runs into the same problematic WU that is trying to get validated.

I have no idea why WUs with bundles of 7 tasks are even getting sent out, but at the meantime I may be able to clear out this particular WU and get the validator running again. If there's another 7-subtask WU down the line (very possible based on conversations I've had), I expect that the validator will crash again. I think I can patch a fix into the validator to avoid these overly-long parameter WUs, but I can't do that until I get back to NY, which means that the separation validator may be down until at least tomorrow.

In the meantime, if you are stuck waiting for the validator, you can always crunch N-Body WUs or temporarily switch to another project. We appreciate your volunteer time very much, and thank you for letting me know that there was a problem with the validator to begin with.
ID: 71138 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71141 - Posted: 21 Sep 2021, 19:56:48 UTC

I cancelled the 35k jobs that had 7 bundled tasks, but the validator is still stuck on the same WU. If you run into a little lost credit (should be a fraction of a percent) it's probably because I killed those jobs.
ID: 71141 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Temporelo
Avatar

Send message
Joined: 2 Sep 09
Posts: 4
Credit: 15,268,335
RAC: 0
Message 71142 - Posted: 22 Sep 2021, 0:00:25 UTC - in response to Message 71137.  

Thank your for your dedication, it is appreciated !
ID: 71142 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71143 - Posted: 22 Sep 2021, 19:30:12 UTC
Last modified: 22 Sep 2021, 19:48:36 UTC

I think that I've fixed the validator outage. The validator will now just skip over WUs with parameters that are too large to fit in the database. Unfortunately, it will have to mark them as invalid, so this may lower RAC by ~1%. The next step is figuring out why those over-bundled WUs happen in the first place and stopping that.

Thanks for your patience!

EDIT: The validator is returning expected results for Separation as of UTC 19:33 today. The validator has a backlog of a couple million WUs to crunch through, so you will probably see a spike in credit over a little while as those are validated.
ID: 71143 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bluestang

Send message
Joined: 13 Oct 16
Posts: 112
Credit: 1,174,293,644
RAC: 0
Message 71144 - Posted: 22 Sep 2021, 20:16:34 UTC

Thanks for the updates and quick solution!
ID: 71144 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Max_Pirx

Send message
Joined: 13 Dec 17
Posts: 46
Credit: 2,421,362,376
RAC: 0
Message 71145 - Posted: 22 Sep 2021, 20:23:39 UTC

Cheers for working out a solution so quickly. Nice work!
ID: 71145 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
muca

Send message
Joined: 21 May 20
Posts: 3
Credit: 22,194,487
RAC: 43
Message 71147 - Posted: 24 Sep 2021, 7:48:13 UTC

One difference between I have found. All Invalid WUs have
<number_WUs> 7 </number_WUs>
<number_params_per_WU> 26 </number_params_per_WU>

but validated
<number_WUs> 4 </number_WUs>
<number_params_per_WU> 26 </number_params_per_WU>

maybe it's a case?
ID: 71147 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71148 - Posted: 24 Sep 2021, 14:42:22 UTC

Yeah, that's how I identified which jobs to kill. The validator does it a little differently, but it's always the jobs with no. WUs >= 7 that will invalidate.
ID: 71148 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brent

Send message
Joined: 16 Mar 10
Posts: 12
Credit: 22,284,745
RAC: 0
Message 71149 - Posted: 24 Sep 2021, 16:31:02 UTC

I am getting a lot of Invalid tasks (see below) with significant loss of computer time:

Task Work unit Computer Sent Time reported Status Run time CPU time Credit Application

341345645 188500118 705499 23 Sep 2021, 20:01:50 UTC 24 Sep 2021, 16:02:57 UTC Validate error 7,469.61 7,421.86 --- Milkyway@home Separation v1.46
windows_x86_64
340954782 188147268 705499 23 Sep 2021, 12:48:50 UTC 24 Sep 2021, 8:51:49 UTC Validate error 7,371.70 7,371.70 --- Milkyway@home Separation v1.46
windows_x86_64
340492122 189425464 705499 23 Sep 2021, 3:35:34 UTC 23 Sep 2021, 23:53:35 UTC Validate error 7,478.13 7,459.58 --- Milkyway@home Separation v1.46
windows_x86_64
339950653 183681961 705499 22 Sep 2021, 16:35:36 UTC 23 Sep 2021, 12:38:40 UTC Validate error 7,390.94 7,390.94 --- Milkyway@home Separation v1.46
windows_x86_64
339881594 190781418 705499 22 Sep 2021, 15:19:27 UTC 23 Sep 2021, 11:30:45 UTC Validate error 7,372.12 7,367.42 --- Milkyway@home Separation v1.46
windows_x86_64
339786777 190693366 705499 22 Sep 2021, 13:30:24 UTC 23 Sep 2021, 9:25:52 UTC Validate error 7,366.25 7,366.25 --- Milkyway@home Separation v1.46
windows_x86_64
339227791 190181707 705499 22 Sep 2021, 3:02:01 UTC 22 Sep 2021, 22:57:53 UTC Validate error 7,684.13 7,490.67 --- Milkyway@home Separation v1.46
windows_x86_64
338697271 189713157 705499 21 Sep 2021, 16:59:41 UTC 22 Sep 2021, 12:42:17 UTC Validate error 7,405.41 7,391.31 --- Milkyway@home Separation v1.46
windows_x86_64
337266954 188484311 705499 20 Sep 2021, 15:13:35 UTC 21 Sep 2021, 11:33:58 UTC Validate error 7,380.12 7,378.81 --- Milkyway@home Separation v1.46
windows_x86_64
337121921 188350085 705499 20 Sep 2021, 12:37:19 UTC 21 Sep 2021, 8:38:04 UTC Validate error 7,356.50 7,346.25 --- Milkyway@home Separation v1.46
windows_x86_64
337108544 188337337 705499 20 Sep 2021, 12:27:06 UTC 21 Sep 2021, 8:06:11 UTC Validate error 7,347.52 7,347.52 --- Milkyway@home Separation v1.46
windows_x86_64
336975085 188213534 705499 20 Sep 2021, 10:01:26 UTC 21 Sep 2021, 5:51:30 UTC Validate error 7,498.39 7,462.39 --- Milkyway@home Separation v1.46
windows_x86_64
ID: 71149 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71150 - Posted: 24 Sep 2021, 20:34:54 UTC

According to the server, these validation errors are accounting for ~5% of all jobs. That's a lot higher than I'd like. Next week I plan on spending some time digging through the WU generator code to try and fix this issue.
ID: 71150 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brent

Send message
Joined: 16 Mar 10
Posts: 12
Credit: 22,284,745
RAC: 0
Message 71152 - Posted: 25 Sep 2021, 2:09:26 UTC - in response to Message 71150.  

Thank you Tom for looking into this issue
ID: 71152 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,853
RAC: 272
Message 71153 - Posted: 25 Sep 2021, 6:23:24 UTC - in response to Message 71149.  
Last modified: 25 Sep 2021, 6:25:11 UTC

I am also getting a lot of invalid jobs, although I have a much lower RAC I seem to be on 10% invalid.

The one common thread I notice is they are all single CPU tasks, those that are using multiple CPU’s 4 in my case seem unaffected.
ID: 71153 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,853
RAC: 272
Message 71159 - Posted: 27 Sep 2021, 9:51:56 UTC - in response to Message 71153.  
Last modified: 27 Sep 2021, 10:23:11 UTC

Looks to me like all WU's processed on my 32bit machine are getting rejected as invalid. Am running 3 as a test. One just got validated so its not that seemingly. One just invalid so a 50% failure rate so far, be interesting to see the last one of the three.
ID: 71159 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,853
RAC: 272
Message 71161 - Posted: 27 Sep 2021, 15:13:49 UTC - in response to Message 71159.  

Looks to me like all WU's processed on my 32bit machine are getting rejected as invalid. Am running 3 as a test. One just got validated so its not that seemingly. One just invalid so a 50% failure rate so far, be interesting to see the last one of the three.


The last one is also invalid which gives a 66% failure rate. Wont be doing any more wasting far too much computer time.
ID: 71161 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
starman

Send message
Joined: 9 Mar 09
Posts: 2
Credit: 10,137,164
RAC: 0
Message 71168 - Posted: 28 Sep 2021, 21:00:10 UTC - in response to Message 71137.  

Is this the reason I am getting computation errors when running milkyway or do I have other issues.
ID: 71168 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 696
Credit: 540,008,816
RAC: 86,801
Message 71169 - Posted: 29 Sep 2021, 1:12:37 UTC - in response to Message 71168.  

I'd say you have other issues. Weird error messages in your failed tasks. As if you were moving tasks around in the BOINC directory while they were being crunched.

Not what this thread is about.
ID: 71169 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Frank

Send message
Joined: 2 Nov 10
Posts: 25
Credit: 1,894,269,109
RAC: 0
Message 71195 - Posted: 30 Sep 2021, 21:45:35 UTC - in response to Message 71169.  

Keith Myers posted (mesage 71189) to this thread and related that the heart of the Milkyway app may be somewhat mushy and could cause both validation and "errors while computing" problems. In response to his posting he received "Not what this thread is about".
We need to pay attention to what Keith says. He's only been doing this for 25 years and may be the strongest expert we have. The connection between the validation and the errors and the poor coding will probably become evident when the problems are fixed.
I am not real happy about either of the problems I encounter about 200 Invalids every day along with about 30 Computer Errors.
If the 7 w/u tasks are messing up validation , how is that se are getting more 7 w/u tasks all the time. Something must be happening between the creation of the task and its starting execution. Could Internet noise be causing the corruption of the task?
The Errors while computing find that the executing computer has tried to run an unknown command. It seems that "unknown command" is not due to an app problem, otherwise, all tasks would error after
experiencing the first one. Did you know that Internet Noise causes most transfer errors? It drives me nuts on my computers, cell phones and smart TVs (some folks say that is more like a putt than a drive).
I really want these problems to be fixed, soon, so lots of luck.
ID: 71195 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 696
Credit: 540,008,816
RAC: 86,801
Message 71196 - Posted: 1 Oct 2021, 0:37:10 UTC

Almost all the inconclusive validation errors are due to the small differences in computed result on different platforms because of differences in how the cpu or gpu handles the small FFT calculations and rounding.

You are most likely to validate against another host using the same platform you compute the task on. AMD against AMD. Nvidia against Nvidia. Android against Android. Intel against Intel etc. etc.

The devs could relax the validator mechanism but if they go too far, they let the result get past beyond their validation threshold and the science falls down. They have stated that up to 10% is acceptable and of course would like it to be more on the order of 1%.

The parameters that are passed to the application are simple text values and they really should implement the parameter set as binary so they could do a CRC check against the file to guarantee the parameter set isn't being corrupted during internet transmission.

Just had to laugh at your Internet Noise complaint as I am currently listening to the blaring digital noise soup on the shortwave bands from all our modern digital devices leaking everything from DC to light out into the electromagnetic spectrum. I can remember listening to the shortwaves back in the 70's and it was quiet as a cave back before all the modern electronics entered our lives. Every power supply was linear and quiet as can be except maybe for some slight 60Hz hum. Now all our power supplies are switched mode and splatter crap everywhere in the bands.

It is a wonder that any radio telescope can hear anything over our own din.
ID: 71196 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 9 Jul 17
Posts: 100
Credit: 16,967,906
RAC: 0
Message 71209 - Posted: 5 Oct 2021, 15:15:21 UTC - in response to Message 71196.  

I can remember listening to the shortwaves back in the 70's and it was quiet as a cave back before all the modern electronics entered our lives. Every power supply was linear and quiet as can be except maybe for some slight 60Hz hum. Now all our power supplies are switched mode and splatter crap everywhere in the bands.
But Radio Moscow was in the middle of 40 meters. I wouldn't recognize the place now.
ID: 71209 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 6 · Next

Message boards : News : Validator Outage

©2024 Astroinformatics Group