Welcome to MilkyWay@home

Reverting Change to Remove Unreliable Hosts

Message boards : News : Reverting Change to Remove Unreliable Hosts
Message board moderation

To post messages, you must log in.

AuthorMessage
Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 25 Feb 13
Posts: 580
Credit: 94,200,158
RAC: 0
Message 67207 - Posted: 5 Mar 2018, 20:15:32 UTC

Hey Everyone,

I am going to be reverting the change to use the built in BOINC use reliable hosts option. It seems to be having unintended consequences to the usability of the project for some users.

In the future, Sidd and I will look into manually removing the worst offenders who are sending back erroring workunits.

For anyone effected, I apologize if you haven't been able to crunch for us recently.

Jake
ID: 67207 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile TimeRanger

Send message
Joined: 31 Oct 10
Posts: 83
Credit: 38,632,375
RAC: 0
Message 67216 - Posted: 6 Mar 2018, 10:26:31 UTC - in response to Message 67207.  

I was kind of hoping that once those offenders were removed, the number of WUs in our cache could be increased, to carry us through any server down time. I know that my machine isn't anywhere near the fastest, but any outage more than a few hours and I start to run out of work.
ID: 67216 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 25 Feb 13
Posts: 580
Credit: 94,200,158
RAC: 0
Message 67221 - Posted: 6 Mar 2018, 15:02:16 UTC

Sadly, the issue was something that could not be resolved using the current BOINC defaults for removing unreliable hosts. By using the BOINC reliable host detection, it actually limited the workunits given to hosts who were trying to crunch both CPU and GPU applications. The server would, seemingly at random, choose one or the other to compute on but not both. This was causing issues for a lot of users so we had to turn off the option.

I am going to look into implementing a custom version of reliable host detection in the near future to help reduce invalid workunits caused by too many hosts returning nothing but errors. It might be a while before I get it up and running though.

Jake
ID: 67221 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 25 Feb 13
Posts: 580
Credit: 94,200,158
RAC: 0
Message 67222 - Posted: 7 Mar 2018, 19:45:33 UTC

Hey Everyone,

I finally figured out how to start banning hosts with really high error rates. I am going to be spending some time working with Sidd and other MilkyWay@home developers to determine a good way to decide who deserves to be banned and how long they should be banned for.

Jake
ID: 67222 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3339
Credit: 524,010,781
RAC: 0
Message 67224 - Posted: 8 Mar 2018, 11:29:58 UTC - in response to Message 67222.  

Hey Everyone,

I finally figured out how to start banning hosts with really high error rates. I am going to be spending some time working with Sidd and other MilkyWay@home developers to determine a good way to decide who deserves to be banned and how long they should be banned for.

Jake


To me sending them one to 5 wu's per day, that they then trash, would be sufficient to establish that the host is still unreliable. If it returns a valid wu then send it a handful more wu's and if they return those as valid then open the pipe and welcome them back. This does mean being pretty sure that the host IS NOT reliable in the first place though and that criteria is up to you guys. To me some are pretty obvious, others are more iffy.

I don't think any of us want some pc banned that's going thru a 'rough patch' or crashed and the user is working on it, but none of us want to go back to having some pc be our wingman that hasn't returned a valid wu in 6 months or more. I personally wouldn't mind having a few wu's like that, but pages of them would not be a good thing.
ID: 67224 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 25 Feb 13
Posts: 580
Credit: 94,200,158
RAC: 0
Message 67226 - Posted: 8 Mar 2018, 15:51:57 UTC

That's a good idea. I agree that we want to have some automated way to retest hosts who have been sending errors.

Additionally, I have added a way for people to see if their hosts are currently suspended. That way, if people are actively attempting to fix their hosts, we can work with them to unsuspend the host while they work.

Jake
ID: 67226 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
macgeyer

Send message
Joined: 2 Mar 18
Posts: 9
Credit: 457,043,383
RAC: 0
Message 67228 - Posted: 8 Mar 2018, 16:46:59 UTC - in response to Message 67226.  

Don't forget a beginner can't use the forum due to credit limit, if his computer starts with errors it's impossible to get in contact with you; I had this problem last week.
ID: 67228 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Oct 16
Posts: 167
Credit: 1,008,062,758
RAC: 3,336
Message 67238 - Posted: 10 Mar 2018, 18:15:37 UTC

All or nearly all errors on these hosts.

https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=512556
https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=606991
https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=191911
https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=737902
https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=718669
ID: 67238 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : News : Reverting Change to Remove Unreliable Hosts

©2024 Astroinformatics Group