Welcome to MilkyWay@home

Reducing Workunits to Unreliable Hosts

Message boards : News : Reducing Workunits to Unreliable Hosts
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 25 Feb 13
Posts: 580
Credit: 94,200,158
RAC: 0
Message 66988 - Posted: 22 Jan 2018, 15:37:37 UTC

Hey Everyone,

I just tried turning on some options to reduce workunits sent to hosts that return a significant number of errors. If you see any issues, please let me know.

Thank you all for your continued support.

Jake
ID: 66988 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,940,055
RAC: 22,560
Message 66997 - Posted: 23 Jan 2018, 11:08:01 UTC - in response to Message 66988.  

Hey Everyone,

I just tried turning on some options to reduce workunits sent to hosts that return a significant number of errors. If you see any issues, please let me know.

Thank you all for your continued support.

Jake


Thank you very much, hopefully the credits will flow more quickly now.
ID: 66997 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 16 Jun 08
Posts: 93
Credit: 366,882,323
RAC: 0
Message 66999 - Posted: 23 Jan 2018, 15:29:39 UTC - in response to Message 66988.  
Last modified: 23 Jan 2018, 15:42:39 UTC

I just tried turning on some options to reduce workunits sent to hosts that return a significant number of errors. If you see any issues, please let me know...

Unfortunately, host 643627 went through another cycle this morning of returning 80 errored tasks and getting 80 new tasks. Even though the host has a history of returning a lot of errors, does it take time for the server to "learn" that it's unreliable?
ID: 66999 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 16 Jun 08
Posts: 93
Credit: 366,882,323
RAC: 0
Message 67009 - Posted: 27 Jan 2018, 14:54:57 UTC - in response to Message 66999.  

Unfortunately, host 643627 went through another cycle this morning of returning 80 errored tasks and getting 80 new tasks...

And it continues...
ID: 67009 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,940,055
RAC: 22,560
Message 67010 - Posted: 28 Jan 2018, 12:11:50 UTC - in response to Message 67009.  

Unfortunately, host 643627 went through another cycle this morning of returning 80 errored tasks and getting 80 new tasks...


And it continues...


What's strange is they have 2 pc's, one works and one doesn't at all, it's got the 2 second error problem.
ID: 67010 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 16 Jun 08
Posts: 93
Credit: 366,882,323
RAC: 0
Message 67011 - Posted: 29 Jan 2018, 4:23:36 UTC

Another otherwise likely valid result of mine invalidated due to unreliable wingmen (628802 and 761112).
ID: 67011 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 25 Feb 13
Posts: 580
Credit: 94,200,158
RAC: 0
Message 67012 - Posted: 29 Jan 2018, 19:40:30 UTC

Hey Everyone,

I wanted to give this a couple days to see if the server just had to learn who was unreliable. Seems that's not the case. I will do a little more research into configuring the server better throughout the week and run some more tests.

Jake
ID: 67012 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jaczar

Send message
Joined: 25 Jan 14
Posts: 1
Credit: 17,492,399
RAC: 0
Message 67031 - Posted: 5 Feb 2018, 5:20:13 UTC

One of my PCs is not getting any credit for HOST AVERAGE work done, another seems to be working OK. The OK one also posts USER AVERAGE totals which seem to include activity of BOTH of my Milkyway machines as well as what looks like correct HOST AVERAGE totals.
ID: 67031 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 16 Jun 08
Posts: 93
Credit: 366,882,323
RAC: 0
Message 67032 - Posted: 5 Feb 2018, 16:11:24 UTC - in response to Message 67031.  

One of my PCs is not getting any credit for HOST AVERAGE work done, another seems to be working OK. The OK one also posts USER AVERAGE totals which seem to include activity of BOTH of my Milkyway machines as well as what looks like correct HOST AVERAGE totals.

Just so there's no confusion, what does this have to do with unreliable hosts? For your hosts, I see some user aborts and errors in N-body tasks. However, no massive and continuing computation errors like those pointed out earlier in this thread.
ID: 67032 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 16 Jun 08
Posts: 93
Credit: 366,882,323
RAC: 0
Message 67033 - Posted: 6 Feb 2018, 14:04:24 UTC - in response to Message 67012.  

I will do a little more research into configuring the server better throughout the week and run some more tests...

Any update on this, Jake? I've had another result invalidated due to unreliable wingmen. I guess it doesn't make any difference, but all the hosts I've looked appear to be using GPUs that aren't double precision.
ID: 67033 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 25 Feb 13
Posts: 580
Credit: 94,200,158
RAC: 0
Message 67034 - Posted: 6 Feb 2018, 18:23:41 UTC

Nothing new to report yet. I've tried changing a few things on the server config side, but none of it seems to make a difference. Hopefully I will find the right knob to tweak soon.

Jake
ID: 67034 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,940,055
RAC: 22,560
Message 67035 - Posted: 6 Feb 2018, 20:54:05 UTC - in response to Message 67033.  

I will do a little more research into configuring the server better throughout the week and run some more tests...


Any update on this, Jake? I've had another result invalidated due to unreliable wingmen. I guess it doesn't make any difference, but all the hosts I've looked appear to be using GPUs that aren't double precision.


Maybe they could blacklist some then? Not the actual host but the model of gpu that can't do double precision. That should mean the host can't get work. Another project I crunch for bans any gpu with less than 2gb of ram, I realize that may be a bit easier but it's a start.
ID: 67035 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Cliff
Avatar

Send message
Joined: 28 Nov 14
Posts: 51
Credit: 86,696,721
RAC: 0
Message 67037 - Posted: 8 Feb 2018, 2:08:01 UTC - in response to Message 67034.  

Hi Jake,
While you're at it can you find out whats causing the database to crash every now and then?
Regards,
Cliff.
--
Been there Done That, still no Damn T-Shirt
ID: 67037 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 25 Feb 13
Posts: 580
Credit: 94,200,158
RAC: 0
Message 67045 - Posted: 9 Feb 2018, 19:44:41 UTC

Hey everyone,

I found a small bug in the way that the scheduler tests for reliability on workunits with priority 0. I am going to try changing everything to priority 1 and see if that fixes things. Hopefully this won't change how quickly our workunits are processed or affect how it interacts with workunits from other projects. Let me know if you see any issues with this on your end.

Jake
ID: 67045 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,940,055
RAC: 22,560
Message 67051 - Posted: 10 Feb 2018, 11:40:37 UTC - in response to Message 67045.  

Hey everyone,

I found a small bug in the way that the scheduler tests for reliability on workunits with priority 0. I am going to try changing everything to priority 1 and see if that fixes things. Hopefully this won't change how quickly our workunits are processed or affect how it interacts with workunits from other projects. Let me know if you see any issues with this on your end.

Jake


Thanks I hope that helps!

In the meantime maybe you can see if the pc is affected by it:
http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=763376

It got 128 tasks and trashed them ALL!!
ID: 67051 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 25 Feb 13
Posts: 580
Credit: 94,200,158
RAC: 0
Message 67057 - Posted: 10 Feb 2018, 19:54:35 UTC

Says you have zero workunits in progress. That's a good sign!

Jake
ID: 67057 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Oct 16
Posts: 162
Credit: 1,004,365,004
RAC: 16,506
Message 67059 - Posted: 11 Feb 2018, 0:57:36 UTC
Last modified: 11 Feb 2018, 0:58:10 UTC

All timeouts. That could have just been a failed hard drive or a PC that was turned off. The ones that have computational errors are the problem clients.
ID: 67059 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 25 Feb 13
Posts: 580
Credit: 94,200,158
RAC: 0
Message 67072 - Posted: 12 Feb 2018, 16:49:18 UTC

Hey Everyone,

I've done a little digging to see if unreliable hosts are getting fewer workunits now and that does seem to be the case. I have found several faulty computers who currently have 0 work units in progress where they had 100+ plus as of last week.

I am going to consider this issue resolved for now unless anyone objects.

Jake
ID: 67072 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 16 Jun 08
Posts: 93
Credit: 366,882,323
RAC: 0
Message 67073 - Posted: 12 Feb 2018, 17:19:29 UTC - in response to Message 67072.  

I am going to consider this issue resolved for now unless anyone objects...

It definitely looks like we're making progress. I'd say we're good to go if 643627 doesn't get new work after it reports its latest batch sometime after 1800 UTC.
ID: 67073 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 16 Jun 08
Posts: 93
Credit: 366,882,323
RAC: 0
Message 67074 - Posted: 12 Feb 2018, 19:03:37 UTC - in response to Message 67073.  

I'd say we're good to go if 643627 doesn't get new work after it reports its latest batch sometime after 1800 UTC.

Well, it's still being sent new work in spite of a 100% error rate. However, it's getting fewer and fewer new tasks so maybe it will be completely shut down in another couple of days.

Good work, Jake. Thanks for chasing that down for us.
ID: 67074 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : News : Reducing Workunits to Unreliable Hosts

©2024 Astroinformatics Group