Welcome to MilkyWay@home

Validator Outage


Advanced search

Message boards : News : Validator Outage
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
ProfileTom Donlon
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 148
Credit: 61,362,156
RAC: 148,160
50 million credit badge2 year member badge
Message 71137 - Posted: 21 Sep 2021, 16:13:00 UTC

Hey Everyone,

MilkyWay@home is currently experiencing an outage for the Separation Validator. I brought it back up once, and then it crashed again. I am trying to bring it back up. In the meantime, connections to the download/upload servers may stop and start intermittently as I work on the server.

Thanks for your patience, I will keep you updated as things change.

Tom
ID: 71137 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 148
Credit: 61,362,156
RAC: 148,160
50 million credit badge2 year member badge
Message 71138 - Posted: 21 Sep 2021, 16:54:15 UTC

I've gotten to the root of the problem. One of the WUs with 7 sub-tasks (see https://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4762) entered into the validator, which crashed because the list of parameters is too long to fit in a mysql string in the database. Every time I try to bring the validator up, it does actually come up, and then crashes because it runs into the same problematic WU that is trying to get validated.

I have no idea why WUs with bundles of 7 tasks are even getting sent out, but at the meantime I may be able to clear out this particular WU and get the validator running again. If there's another 7-subtask WU down the line (very possible based on conversations I've had), I expect that the validator will crash again. I think I can patch a fix into the validator to avoid these overly-long parameter WUs, but I can't do that until I get back to NY, which means that the separation validator may be down until at least tomorrow.

In the meantime, if you are stuck waiting for the validator, you can always crunch N-Body WUs or temporarily switch to another project. We appreciate your volunteer time very much, and thank you for letting me know that there was a problem with the validator to begin with.
ID: 71138 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 148
Credit: 61,362,156
RAC: 148,160
50 million credit badge2 year member badge
Message 71141 - Posted: 21 Sep 2021, 19:56:48 UTC

I cancelled the 35k jobs that had 7 bundled tasks, but the validator is still stuck on the same WU. If you run into a little lost credit (should be a fraction of a percent) it's probably because I killed those jobs.
ID: 71141 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTemporelo
Avatar

Send message
Joined: 2 Sep 09
Posts: 4
Credit: 14,131,954
RAC: 40,528
10 million credit badge12 year member badge
Message 71142 - Posted: 22 Sep 2021, 0:00:25 UTC - in response to Message 71137.  

Thank your for your dedication, it is appreciated !
ID: 71142 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 148
Credit: 61,362,156
RAC: 148,160
50 million credit badge2 year member badge
Message 71143 - Posted: 22 Sep 2021, 19:30:12 UTC
Last modified: 22 Sep 2021, 19:48:36 UTC

I think that I've fixed the validator outage. The validator will now just skip over WUs with parameters that are too large to fit in the database. Unfortunately, it will have to mark them as invalid, so this may lower RAC by ~1%. The next step is figuring out why those over-bundled WUs happen in the first place and stopping that.

Thanks for your patience!

EDIT: The validator is returning expected results for Separation as of UTC 19:33 today. The validator has a backlog of a couple million WUs to crunch through, so you will probably see a spike in credit over a little while as those are validated.
ID: 71143 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bluestang

Send message
Joined: 13 Oct 16
Posts: 106
Credit: 1,051,731,537
RAC: 106,223
1 billion credit badge5 year member badge
Message 71144 - Posted: 22 Sep 2021, 20:16:34 UTC

Thanks for the updates and quick solution!
ID: 71144 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Max_Pirx

Send message
Joined: 13 Dec 17
Posts: 17
Credit: 1,248,926,912
RAC: 1,536,366
1 billion credit badge3 year member badge
Message 71145 - Posted: 22 Sep 2021, 20:23:39 UTC

Cheers for working out a solution so quickly. Nice work!
ID: 71145 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
muca

Send message
Joined: 21 May 20
Posts: 2
Credit: 10,035,056
RAC: 27,192
10 million credit badge1 year member badge
Message 71147 - Posted: 24 Sep 2021, 7:48:13 UTC

One difference between I have found. All Invalid WUs have
<number_WUs> 7 </number_WUs>
<number_params_per_WU> 26 </number_params_per_WU>

but validated
<number_WUs> 4 </number_WUs>
<number_params_per_WU> 26 </number_params_per_WU>

maybe it's a case?
ID: 71147 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 148
Credit: 61,362,156
RAC: 148,160
50 million credit badge2 year member badge
Message 71148 - Posted: 24 Sep 2021, 14:42:22 UTC

Yeah, that's how I identified which jobs to kill. The validator does it a little differently, but it's always the jobs with no. WUs >= 7 that will invalidate.
ID: 71148 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brent

Send message
Joined: 16 Mar 10
Posts: 12
Credit: 4,071,257
RAC: 26,865
3 million credit badge11 year member badge
Message 71149 - Posted: 24 Sep 2021, 16:31:02 UTC

I am getting a lot of Invalid tasks (see below) with significant loss of computer time:

Task Work unit Computer Sent Time reported Status Run time CPU time Credit Application

341345645 188500118 705499 23 Sep 2021, 20:01:50 UTC 24 Sep 2021, 16:02:57 UTC Validate error 7,469.61 7,421.86 --- Milkyway@home Separation v1.46
windows_x86_64
340954782 188147268 705499 23 Sep 2021, 12:48:50 UTC 24 Sep 2021, 8:51:49 UTC Validate error 7,371.70 7,371.70 --- Milkyway@home Separation v1.46
windows_x86_64
340492122 189425464 705499 23 Sep 2021, 3:35:34 UTC 23 Sep 2021, 23:53:35 UTC Validate error 7,478.13 7,459.58 --- Milkyway@home Separation v1.46
windows_x86_64
339950653 183681961 705499 22 Sep 2021, 16:35:36 UTC 23 Sep 2021, 12:38:40 UTC Validate error 7,390.94 7,390.94 --- Milkyway@home Separation v1.46
windows_x86_64
339881594 190781418 705499 22 Sep 2021, 15:19:27 UTC 23 Sep 2021, 11:30:45 UTC Validate error 7,372.12 7,367.42 --- Milkyway@home Separation v1.46
windows_x86_64
339786777 190693366 705499 22 Sep 2021, 13:30:24 UTC 23 Sep 2021, 9:25:52 UTC Validate error 7,366.25 7,366.25 --- Milkyway@home Separation v1.46
windows_x86_64
339227791 190181707 705499 22 Sep 2021, 3:02:01 UTC 22 Sep 2021, 22:57:53 UTC Validate error 7,684.13 7,490.67 --- Milkyway@home Separation v1.46
windows_x86_64
338697271 189713157 705499 21 Sep 2021, 16:59:41 UTC 22 Sep 2021, 12:42:17 UTC Validate error 7,405.41 7,391.31 --- Milkyway@home Separation v1.46
windows_x86_64
337266954 188484311 705499 20 Sep 2021, 15:13:35 UTC 21 Sep 2021, 11:33:58 UTC Validate error 7,380.12 7,378.81 --- Milkyway@home Separation v1.46
windows_x86_64
337121921 188350085 705499 20 Sep 2021, 12:37:19 UTC 21 Sep 2021, 8:38:04 UTC Validate error 7,356.50 7,346.25 --- Milkyway@home Separation v1.46
windows_x86_64
337108544 188337337 705499 20 Sep 2021, 12:27:06 UTC 21 Sep 2021, 8:06:11 UTC Validate error 7,347.52 7,347.52 --- Milkyway@home Separation v1.46
windows_x86_64
336975085 188213534 705499 20 Sep 2021, 10:01:26 UTC 21 Sep 2021, 5:51:30 UTC Validate error 7,498.39 7,462.39 --- Milkyway@home Separation v1.46
windows_x86_64
ID: 71149 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 148
Credit: 61,362,156
RAC: 148,160
50 million credit badge2 year member badge
Message 71150 - Posted: 24 Sep 2021, 20:34:54 UTC

According to the server, these validation errors are accounting for ~5% of all jobs. That's a lot higher than I'd like. Next week I plan on spending some time digging through the WU generator code to try and fix this issue.
ID: 71150 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brent

Send message
Joined: 16 Mar 10
Posts: 12
Credit: 4,071,257
RAC: 26,865
3 million credit badge11 year member badge
Message 71152 - Posted: 25 Sep 2021, 2:09:26 UTC - in response to Message 71150.  

Thank you Tom for looking into this issue
ID: 71152 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 12
Credit: 172,115
RAC: 6,872
100 thousand credit badge9 year member badge
Message 71153 - Posted: 25 Sep 2021, 6:23:24 UTC - in response to Message 71149.  
Last modified: 25 Sep 2021, 6:25:11 UTC

I am also getting a lot of invalid jobs, although I have a much lower RAC I seem to be on 10% invalid.

The one common thread I notice is they are all single CPU tasks, those that are using multiple CPU’s 4 in my case seem unaffected.
ID: 71153 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 12
Credit: 172,115
RAC: 6,872
100 thousand credit badge9 year member badge
Message 71159 - Posted: 27 Sep 2021, 9:51:56 UTC - in response to Message 71153.  
Last modified: 27 Sep 2021, 10:23:11 UTC

Looks to me like all WU's processed on my 32bit machine are getting rejected as invalid. Am running 3 as a test. One just got validated so its not that seemingly. One just invalid so a 50% failure rate so far, be interesting to see the last one of the three.
ID: 71159 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 12
Credit: 172,115
RAC: 6,872
100 thousand credit badge9 year member badge
Message 71161 - Posted: 27 Sep 2021, 15:13:49 UTC - in response to Message 71159.  

Looks to me like all WU's processed on my 32bit machine are getting rejected as invalid. Am running 3 as a test. One just got validated so its not that seemingly. One just invalid so a 50% failure rate so far, be interesting to see the last one of the three.


The last one is also invalid which gives a 66% failure rate. Wont be doing any more wasting far too much computer time.
ID: 71161 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
starman

Send message
Joined: 9 Mar 09
Posts: 2
Credit: 7,977,985
RAC: 292
5 million credit badge12 year member badge
Message 71168 - Posted: 28 Sep 2021, 21:00:10 UTC - in response to Message 71137.  

Is this the reason I am getting computation errors when running milkyway or do I have other issues.
ID: 71168 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 474
Credit: 356,631,800
RAC: 264,072
300 million credit badge10 year member badgeextraordinary contributions badge
Message 71169 - Posted: 29 Sep 2021, 1:12:37 UTC - in response to Message 71168.  

I'd say you have other issues. Weird error messages in your failed tasks. As if you were moving tasks around in the BOINC directory while they were being crunched.

Not what this thread is about.
ID: 71169 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Frank

Send message
Joined: 2 Nov 10
Posts: 13
Credit: 934,132,155
RAC: 2,490,737
500 million credit badge10 year member badge
Message 71195 - Posted: 30 Sep 2021, 21:45:35 UTC - in response to Message 71169.  

Keith Myers posted (mesage 71189) to this thread and related that the heart of the Milkyway app may be somewhat mushy and could cause both validation and "errors while computing" problems. In response to his posting he received "Not what this thread is about".
We need to pay attention to what Keith says. He's only been doing this for 25 years and may be the strongest expert we have. The connection between the validation and the errors and the poor coding will probably become evident when the problems are fixed.
I am not real happy about either of the problems I encounter about 200 Invalids every day along with about 30 Computer Errors.
If the 7 w/u tasks are messing up validation , how is that se are getting more 7 w/u tasks all the time. Something must be happening between the creation of the task and its starting execution. Could Internet noise be causing the corruption of the task?
The Errors while computing find that the executing computer has tried to run an unknown command. It seems that "unknown command" is not due to an app problem, otherwise, all tasks would error after
experiencing the first one. Did you know that Internet Noise causes most transfer errors? It drives me nuts on my computers, cell phones and smart TVs (some folks say that is more like a putt than a drive).
I really want these problems to be fixed, soon, so lots of luck.
ID: 71195 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 474
Credit: 356,631,800
RAC: 264,072
300 million credit badge10 year member badgeextraordinary contributions badge
Message 71196 - Posted: 1 Oct 2021, 0:37:10 UTC

Almost all the inconclusive validation errors are due to the small differences in computed result on different platforms because of differences in how the cpu or gpu handles the small FFT calculations and rounding.

You are most likely to validate against another host using the same platform you compute the task on. AMD against AMD. Nvidia against Nvidia. Android against Android. Intel against Intel etc. etc.

The devs could relax the validator mechanism but if they go too far, they let the result get past beyond their validation threshold and the science falls down. They have stated that up to 10% is acceptable and of course would like it to be more on the order of 1%.

The parameters that are passed to the application are simple text values and they really should implement the parameter set as binary so they could do a CRC check against the file to guarantee the parameter set isn't being corrupted during internet transmission.

Just had to laugh at your Internet Noise complaint as I am currently listening to the blaring digital noise soup on the shortwave bands from all our modern digital devices leaking everything from DC to light out into the electromagnetic spectrum. I can remember listening to the shortwaves back in the 70's and it was quiet as a cave back before all the modern electronics entered our lives. Every power supply was linear and quiet as can be except maybe for some slight 60Hz hum. Now all our power supplies are switched mode and splatter crap everywhere in the bands.

It is a wonder that any radio telescope can hear anything over our own din.
ID: 71196 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 9 Jul 17
Posts: 89
Credit: 14,157,468
RAC: 804
10 million credit badge4 year member badge
Message 71209 - Posted: 5 Oct 2021, 15:15:21 UTC - in response to Message 71196.  

I can remember listening to the shortwaves back in the 70's and it was quiet as a cave back before all the modern electronics entered our lives. Every power supply was linear and quiet as can be except maybe for some slight 60Hz hum. Now all our power supplies are switched mode and splatter crap everywhere in the bands.
But Radio Moscow was in the middle of 40 meters. I wouldn't recognize the place now.
ID: 71209 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : News : Validator Outage

©2021 Astroinformatics Group