rpi_logo
Reverting Change to Remove Unreliable Hosts
Reverting Change to Remove Unreliable Hosts
log in

Advanced search

Message boards : News : Reverting Change to Remove Unreliable Hosts

Author Message
Profile Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist
Send message
Joined: 25 Feb 13
Posts: 501
Credit: 34,647,251
RAC: 415

Message 67207 - Posted: 5 Mar 2018, 20:15:32 UTC

Hey Everyone,

I am going to be reverting the change to use the built in BOINC use reliable hosts option. It seems to be having unintended consequences to the usability of the project for some users.

In the future, Sidd and I will look into manually removing the worst offenders who are sending back erroring workunits.

For anyone effected, I apologize if you haven't been able to crunch for us recently.

Jake

Profile TimeRanger
Send message
Joined: 31 Oct 10
Posts: 75
Credit: 30,586,609
RAC: 26,654

Message 67216 - Posted: 6 Mar 2018, 10:26:31 UTC - in response to Message 67207.

I was kind of hoping that once those offenders were removed, the number of WUs in our cache could be increased, to carry us through any server down time. I know that my machine isn't anywhere near the fastest, but any outage more than a few hours and I start to run out of work.

Profile Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist
Send message
Joined: 25 Feb 13
Posts: 501
Credit: 34,647,251
RAC: 415

Message 67221 - Posted: 6 Mar 2018, 15:02:16 UTC

Sadly, the issue was something that could not be resolved using the current BOINC defaults for removing unreliable hosts. By using the BOINC reliable host detection, it actually limited the workunits given to hosts who were trying to crunch both CPU and GPU applications. The server would, seemingly at random, choose one or the other to compute on but not both. This was causing issues for a lot of users so we had to turn off the option.

I am going to look into implementing a custom version of reliable host detection in the near future to help reduce invalid workunits caused by too many hosts returning nothing but errors. It might be a while before I get it up and running though.

Jake

Profile Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist
Send message
Joined: 25 Feb 13
Posts: 501
Credit: 34,647,251
RAC: 415

Message 67222 - Posted: 7 Mar 2018, 19:45:33 UTC

Hey Everyone,

I finally figured out how to start banning hosts with really high error rates. I am going to be spending some time working with Sidd and other MilkyWay@home developers to determine a good way to decide who deserves to be banned and how long they should be banned for.

Jake

Profile mikey
Avatar
Send message
Joined: 8 May 09
Posts: 2182
Credit: 231,022,148
RAC: 208,392

Message 67224 - Posted: 8 Mar 2018, 11:29:58 UTC - in response to Message 67222.

Hey Everyone,

I finally figured out how to start banning hosts with really high error rates. I am going to be spending some time working with Sidd and other MilkyWay@home developers to determine a good way to decide who deserves to be banned and how long they should be banned for.

Jake


To me sending them one to 5 wu's per day, that they then trash, would be sufficient to establish that the host is still unreliable. If it returns a valid wu then send it a handful more wu's and if they return those as valid then open the pipe and welcome them back. This does mean being pretty sure that the host IS NOT reliable in the first place though and that criteria is up to you guys. To me some are pretty obvious, others are more iffy.

I don't think any of us want some pc banned that's going thru a 'rough patch' or crashed and the user is working on it, but none of us want to go back to having some pc be our wingman that hasn't returned a valid wu in 6 months or more. I personally wouldn't mind having a few wu's like that, but pages of them would not be a good thing.

Profile Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist
Send message
Joined: 25 Feb 13
Posts: 501
Credit: 34,647,251
RAC: 415

Message 67226 - Posted: 8 Mar 2018, 15:51:57 UTC

That's a good idea. I agree that we want to have some automated way to retest hosts who have been sending errors.

Additionally, I have added a way for people to see if their hosts are currently suspended. That way, if people are actively attempting to fix their hosts, we can work with them to unsuspend the host while they work.

Jake

macgeyer
Send message
Joined: 2 Mar 18
Posts: 6
Credit: 1,241,834
RAC: 0

Message 67228 - Posted: 8 Mar 2018, 16:46:59 UTC - in response to Message 67226.

Don't forget a beginner can't use the forum due to credit limit, if his computer starts with errors it's impossible to get in contact with you; I had this problem last week.

mmonnin
Send message
Joined: 2 Oct 16
Posts: 102
Credit: 81,199,642
RAC: 35,480

Message 67238 - Posted: 10 Mar 2018, 18:15:37 UTC

All or nearly all errors on these hosts.

https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=512556
https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=606991
https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=191911
https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=737902
https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=718669


Post to thread

Message boards : News : Reverting Change to Remove Unreliable Hosts


Main page · Your account · Message boards


Copyright © 2018 AstroInformatics Group