Welcome to MilkyWay@home

Validator Outage

Message boards : News : Validator Outage
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,946,411
RAC: 22,500
Message 71295 - Posted: 1 Nov 2021, 11:06:35 UTC - in response to Message 71292.  

I will take a day to look at the source code again when I am recovered from my surgery and we are finished working on our NSF proposal. I am hoping that it is something that can be fixed on the server side of things, and the Separation client code doesn't need to be rebuilt.


I too hope you can recover quickly from your surgery!! Maybe during that time you can reach out to the other Boinc Admins at the other Projects for some quick advice on where to look ie the Seperation side or the Boinc side of things. Maybe they can help with the setting requiring a 10 minute back-off when getting new gpu tasks as well. I'm hoping you can work this is inbetween all the 'other' work stuff you will be doing as well. Good luck on your surgery!!!
ID: 71295 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71296 - Posted: 1 Nov 2021, 17:02:22 UTC
Last modified: 1 Nov 2021, 17:12:55 UTC

Thank you everyone! I just got my wisdom teeth removed this morning so I'm not sure how much I will be working for a little while. Nothing serious!

Unfortunately I have so many projects that this one often gets pushed to low priority, which I definitely understand is frustrating for you. I'd like to be able to do everything quickly, but it's not usually possible. I'll do my best to squeeze this in soon, though.

Thanks for being understanding.
ID: 71296 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
paris
Avatar

Send message
Joined: 26 Apr 08
Posts: 87
Credit: 64,801,496
RAC: 0
Message 71300 - Posted: 1 Nov 2021, 23:19:05 UTC

Thank you, Tom Donlon, for all that you do. It is appreciated.


Plus SETI Classic = 21,082 WUs
ID: 71300 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Spatzthecat

Send message
Joined: 1 Dec 10
Posts: 82
Credit: 15,452,009,012
RAC: 2
Message 71303 - Posted: 3 Nov 2021, 10:15:15 UTC

Invalid and errored tasks on the increase again 560 invalid in 24 hrs 480 error.
This really needs to be sorted Tom.
ID: 71303 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71305 - Posted: 3 Nov 2021, 15:10:55 UTC

Cleared out the bad WU. Apologies.
ID: 71305 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Spatzthecat

Send message
Joined: 1 Dec 10
Posts: 82
Credit: 15,452,009,012
RAC: 2
Message 71306 - Posted: 3 Nov 2021, 15:52:12 UTC - in response to Message 71305.  

Thank you Tom.
I hope you are feeling much better.
ID: 71306 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Frank

Send message
Joined: 2 Nov 10
Posts: 25
Credit: 1,894,269,109
RAC: 0
Message 71307 - Posted: 5 Nov 2021, 2:26:43 UTC

Tom,
We need you to unstuck the software again. Today I encountered about 60 7 WU tasks. They were all sent today and run today. On average they spent 6 hours in my computers. So they are fresh; indicating to me that you have a stuck one.
And no, I haven't forgotten my war against Errors While Computing. Did you know that Rosetta at Home is experiencing a bunch of flawed tasks, even as we speak. Milkyway and Rosetta are both BOINC users. Makes one wonder whether the malfeasant software might live in BOINC servers. I would like to see the end of Errors While Computing; they imply my computers erred. It ain't true.
ID: 71307 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71308 - Posted: 5 Nov 2021, 14:33:01 UTC

I took a look and there aren't any stuck WUs. They might just be leftover from the last stuck one.

It could totally be that there's a bug somewhere in the BOINC code, but it's also just as reasonable that the issue is with the custom Milkyway code. Good to know that we're all struggling together with these things.
ID: 71308 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Spatzthecat

Send message
Joined: 1 Dec 10
Posts: 82
Credit: 15,452,009,012
RAC: 2
Message 71320 - Posted: 6 Nov 2021, 16:38:19 UTC

Hello Tom,
1367 Errors
ID: 71320 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 208
Credit: 105,445,895
RAC: 36,619
Message 71321 - Posted: 6 Nov 2021, 17:07:46 UTC - in response to Message 71320.  

Hello Tom,
1367 Errors

As a user who'd like to help if he can, may I ask whether those are marked as Error or Invalid (or a mixture)? And are they Separation or NBody?? I can't tell because your computers are currently hidden...

The issue Tom is trying to deal with in this thread is only Separation tasks being marked Invalid because of malformed tasks being sent out.

If you're suffering errors rather than being hit by huge numbers of those tasks that will go Invalid, the community might be able to help if they can see what sort of errors you are getting; and given that there aren't lots of other people sailing in reporting bad tasks going Invalid at the moment, there's cause to wonder...

Cheers - Al.

P.S. I have monitoring set up on my systems to spot the "7 work unit" tasks as they arrive (so I can choose to abort them before they start) - I haven't seen one in several days (and would be quick to report any I saw that weren't late retries!...)
ID: 71321 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Spatzthecat

Send message
Joined: 1 Dec 10
Posts: 82
Credit: 15,452,009,012
RAC: 2
Message 71322 - Posted: 6 Nov 2021, 17:53:54 UTC - in response to Message 71321.  

Hi Al,
All of my hosts crunch Separation Units.
My current level of wasted units Invalid 209 Error 1355
I have experienced over 7500 Invalid recently but this is quite high for Error's

Cheers
ID: 71322 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 208
Credit: 105,445,895
RAC: 36,619
Message 71323 - Posted: 6 Nov 2021, 18:36:53 UTC - in response to Message 71322.  

Hi Al,
All of my hosts crunch Separation Units.
My current level of wasted units Invalid 209 Error 1355
I have experienced over 7500 Invalid recently but this is quite high for Error's

Cheers

O.K. -- thanks for clarifying...

Without knowing how many systems are involved, and how many of those Errors are tasks not returned in time (I think MW logs both User Aborted and Server Aborted as Error...), it's difficult to comment further on Errors. If some of them are genuine errors (e.g. GPU code compilation errors, device errors, invalid memory accesses...) whatever is causing those might also be contributing to the Invalid count by returning results that actually are Invalid!

At risk of telling you something you already know (in which case my apologies!), there are currently two reasons a task can be flagged Invalid. The case that can (and presumably, eventually, will) be solved is the tasks that seem to contain 7 sub-tasks instead of 4 - Tom has arranged that those are automatically flagged Invalid -- there's potential bad science in there!

The other case is because as the results converge(?) on a final solution some of the calculations may be dealing with very small values and different bits of hardware will produce subtly different results due to rounding errors, truncations to zero, and so forth. If such issues happen early enough in a particular sub-task the results from different hardware (or different compilers) can diverge by enough to introduce uncertainty as to which results to select as "canonical" -- in such cases, someone is going to end up with Invalid results...

If you've got a consistent source of Error flags that isn't due to time-outs, you could always start a thread about it in Number Crunching, and someone will certainly pitch in to help...

Happy crunching - Al.
ID: 71323 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 208
Credit: 105,445,895
RAC: 36,619
Message 71325 - Posted: 7 Nov 2021, 18:43:04 UTC
Last modified: 7 Nov 2021, 19:10:08 UTC

Tom,

There seem to be some more 7-WU tasks out there; I've just aborted tasks from the workunits listed below; all of them were initially created within the last 24 hours...

Workunit 234295593 -- de_modfit_84_bundle4_4s_south4s_gapfix_1636154231_1109369

Workunit 234328098 -- de_modfit_85_bundle4_4s_south4s_gapfix_bgset2_1636154231_1140632

Workunit 234395293 -- de_modfit_85_bundle4_4s_south4s_gapfix_bgset2_1636154231_1204374

Workunit 235146445 -- de_modfit_84_bundle4_4s_south4s_gapfix_1636154231_1915344

Fortunately, these constitute less than 3% of the total tasks I've received so far today!

[Edit - two more!...]
Workunit 235070057 -- de_modfit_85_bundle4_4s_south4s_gapfix_bgset2_1636154231_1842523

Workunit 235140512 -- de_modfit_84_bundle4_4s_south4s_gapfix_1636154231_1909654

Cheers - Al.
ID: 71325 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Spatzthecat

Send message
Joined: 1 Dec 10
Posts: 82
Credit: 15,452,009,012
RAC: 2
Message 71326 - Posted: 7 Nov 2021, 19:00:41 UTC

Invalid 760 Error's 1640
ID: 71326 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Spatzthecat

Send message
Joined: 1 Dec 10
Posts: 82
Credit: 15,452,009,012
RAC: 2
Message 71328 - Posted: 8 Nov 2021, 10:39:38 UTC

Hello Tom,
Things are just getting worse.

Invalid 1122 Error's 1717
ID: 71328 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71329 - Posted: 8 Nov 2021, 16:08:40 UTC

Hi All,

I was away from my computers yesterday and wasn't able to fix things until now. Cleared out two WUs.

Also, I've noticed some users getting hostile about this issue in the message boards and my DMs. I would like to remind you that Separation is a small part of my job, and it is not the focus of my PhD thesis. Unfortunately I don't always have the time to drop everything and dig through Separation code. Right now, all of my time is going into proposals to keep myself and MilkyWay@home funded. When things calm down, I will take the time and find a solution.

I understand that many of you have a financial interest in crunching WUs, and that the longer that I take to fix the problem the less money that you make. Apologies for the delays.
ID: 71329 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71332 - Posted: 8 Nov 2021, 16:19:29 UTC

I went through and purged all unsent jobs from the DB that had 7 WUs. Hopefully this decreases your validate error count rapidly instead of just waiting for things to work themselves out.
ID: 71332 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 208
Credit: 105,445,895
RAC: 36,619
Message 71348 - Posted: 11 Nov 2021, 18:33:27 UTC

Tom,

Sorry to be the bearer of bad news yet again... I've just cleared out the three tasks below, all for new work units

Workunit 239227782
name 	de_modfit_84_bundle4_4s_south4s_gapfix_1636388182_2840024
created 	11 Nov 2021, 8:25:19 UTC
  Task 409067936
  Created 11 Nov 2021, 17:39:20 UTC, Sent 11 Nov 2021, 17:49:36 UTC

Workunit 239471045
name 	de_modfit_84_bundle4_4s_south4s_gapfix_1636388182_3069694
created 	11 Nov 2021, 13:29:40 UTC
  Task 409067359
  Created 11 Nov 2021, 17:38:27 UTC, Sent 11 Nov 2021, 17:49:36 UTC

Workunit 239471509
name 	de_modfit_84_bundle4_4s_south4s_gapfix_1636388182_3070149
created 	11 Nov 2021, 13:30:03 UTC
  Task 409067360
  Created 11 Nov 2021, 17:38:27 UTC, Sent 11 Nov 2021, 17:49:36 UTC


Is there anything any of us can do to help you resolve this (other than just drawing your attention to them - hopefully early enough to let you stop them quickly!); I (for one) would be willing to look at code, but I'm on the wrong side of the Atlantic to just drop in :-)

Cheers - Al.
ID: 71348 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Frank

Send message
Joined: 2 Nov 10
Posts: 25
Credit: 1,894,269,109
RAC: 0
Message 71351 - Posted: 12 Nov 2021, 16:31:47 UTC

`Tom,
The 7 WUs are back. Back consuming 1.62 times the energy required to run the normal 4 WU tasks and wasting 1.62 times the computing time. It is a serious problem.
I know you will "unstick" the hung 7 WU task and all will be well for a couple of days. That isn't a fix it is a work around. And, it is not fair to you; you have to spend a bunch of time mucking around in the software keep the system running; again and again.
What must we do to get a solution to this problem? It has to be fixed. We can't trivialize this problem as an inconvenience we can tolerate. Some users might tolerate it but many will not (including me).
Save the Wilkyway!! Is that cosmic or what?
ID: 71351 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71352 - Posted: 12 Nov 2021, 16:43:48 UTC

Cleared out 2 bad WUs. I may have an idea for a patchy fix that could be implemented quickly. Might take a look today and try it if I get around to it. If I do, the server will go up and down a few times later.
ID: 71352 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : News : Validator Outage

©2024 Astroinformatics Group