rpi_logo
Reducing Workunits to Unreliable Hosts
Reducing Workunits to Unreliable Hosts
log in

Advanced search

Message boards : News : Reducing Workunits to Unreliable Hosts

1 · 2 · Next
Author Message
Profile Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist
Send message
Joined: 25 Feb 13
Posts: 501
Credit: 34,647,251
RAC: 415

Message 66988 - Posted: 22 Jan 2018, 15:37:37 UTC

Hey Everyone,

I just tried turning on some options to reduce workunits sent to hosts that return a significant number of errors. If you see any issues, please let me know.

Thank you all for your continued support.

Jake

Profile mikey
Avatar
Send message
Joined: 8 May 09
Posts: 2183
Credit: 231,023,850
RAC: 200,304

Message 66997 - Posted: 23 Jan 2018, 11:08:01 UTC - in response to Message 66988.

Hey Everyone,

I just tried turning on some options to reduce workunits sent to hosts that return a significant number of errors. If you see any issues, please let me know.

Thank you all for your continued support.

Jake


Thank you very much, hopefully the credits will flow more quickly now.

Profile ritterm
Avatar
Send message
Joined: 16 Jun 08
Posts: 92
Credit: 365,629,434
RAC: 0

Message 66999 - Posted: 23 Jan 2018, 15:29:39 UTC - in response to Message 66988.
Last modified: 23 Jan 2018, 15:42:39 UTC

I just tried turning on some options to reduce workunits sent to hosts that return a significant number of errors. If you see any issues, please let me know...

Unfortunately, host 643627 went through another cycle this morning of returning 80 errored tasks and getting 80 new tasks. Even though the host has a history of returning a lot of errors, does it take time for the server to "learn" that it's unreliable?
____________

Profile ritterm
Avatar
Send message
Joined: 16 Jun 08
Posts: 92
Credit: 365,629,434
RAC: 0

Message 67009 - Posted: 27 Jan 2018, 14:54:57 UTC - in response to Message 66999.

Unfortunately, host 643627 went through another cycle this morning of returning 80 errored tasks and getting 80 new tasks...

And it continues...
____________

Profile mikey
Avatar
Send message
Joined: 8 May 09
Posts: 2183
Credit: 231,023,850
RAC: 200,304

Message 67010 - Posted: 28 Jan 2018, 12:11:50 UTC - in response to Message 67009.

Unfortunately, host 643627 went through another cycle this morning of returning 80 errored tasks and getting 80 new tasks...


And it continues...


What's strange is they have 2 pc's, one works and one doesn't at all, it's got the 2 second error problem.

Profile ritterm
Avatar
Send message
Joined: 16 Jun 08
Posts: 92
Credit: 365,629,434
RAC: 0

Message 67011 - Posted: 29 Jan 2018, 4:23:36 UTC

Another otherwise likely valid result of mine invalidated due to unreliable wingmen (628802 and 761112).
____________

Profile Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist
Send message
Joined: 25 Feb 13
Posts: 501
Credit: 34,647,251
RAC: 415

Message 67012 - Posted: 29 Jan 2018, 19:40:30 UTC

Hey Everyone,

I wanted to give this a couple days to see if the server just had to learn who was unreliable. Seems that's not the case. I will do a little more research into configuring the server better throughout the week and run some more tests.

Jake

jaczar
Send message
Joined: 25 Jan 14
Posts: 1
Credit: 11,555,700
RAC: 31,990

Message 67031 - Posted: 5 Feb 2018, 5:20:13 UTC

One of my PCs is not getting any credit for HOST AVERAGE work done, another seems to be working OK. The OK one also posts USER AVERAGE totals which seem to include activity of BOTH of my Milkyway machines as well as what looks like correct HOST AVERAGE totals.

Profile ritterm
Avatar
Send message
Joined: 16 Jun 08
Posts: 92
Credit: 365,629,434
RAC: 0

Message 67032 - Posted: 5 Feb 2018, 16:11:24 UTC - in response to Message 67031.

One of my PCs is not getting any credit for HOST AVERAGE work done, another seems to be working OK. The OK one also posts USER AVERAGE totals which seem to include activity of BOTH of my Milkyway machines as well as what looks like correct HOST AVERAGE totals.

Just so there's no confusion, what does this have to do with unreliable hosts? For your hosts, I see some user aborts and errors in N-body tasks. However, no massive and continuing computation errors like those pointed out earlier in this thread.
____________

Profile ritterm
Avatar
Send message
Joined: 16 Jun 08
Posts: 92
Credit: 365,629,434
RAC: 0

Message 67033 - Posted: 6 Feb 2018, 14:04:24 UTC - in response to Message 67012.

I will do a little more research into configuring the server better throughout the week and run some more tests...

Any update on this, Jake? I've had another result invalidated due to unreliable wingmen. I guess it doesn't make any difference, but all the hosts I've looked appear to be using GPUs that aren't double precision.
____________

Profile Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist
Send message
Joined: 25 Feb 13
Posts: 501
Credit: 34,647,251
RAC: 415

Message 67034 - Posted: 6 Feb 2018, 18:23:41 UTC

Nothing new to report yet. I've tried changing a few things on the server config side, but none of it seems to make a difference. Hopefully I will find the right knob to tweak soon.

Jake

Profile mikey
Avatar
Send message
Joined: 8 May 09
Posts: 2183
Credit: 231,023,850
RAC: 200,304

Message 67035 - Posted: 6 Feb 2018, 20:54:05 UTC - in response to Message 67033.

I will do a little more research into configuring the server better throughout the week and run some more tests...


Any update on this, Jake? I've had another result invalidated due to unreliable wingmen. I guess it doesn't make any difference, but all the hosts I've looked appear to be using GPUs that aren't double precision.


Maybe they could blacklist some then? Not the actual host but the model of gpu that can't do double precision. That should mean the host can't get work. Another project I crunch for bans any gpu with less than 2gb of ram, I realize that may be a bit easier but it's a start.

Profile Cliff
Avatar
Send message
Joined: 28 Nov 14
Posts: 51
Credit: 82,399,668
RAC: 119,985

Message 67037 - Posted: 8 Feb 2018, 2:08:01 UTC - in response to Message 67034.

Hi Jake,
While you're at it can you find out whats causing the database to crash every now and then?
____________
Regards,
Cliff.
--
Been there Done That, still no Damn T-Shirt

Profile Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist
Send message
Joined: 25 Feb 13
Posts: 501
Credit: 34,647,251
RAC: 415

Message 67045 - Posted: 9 Feb 2018, 19:44:41 UTC

Hey everyone,

I found a small bug in the way that the scheduler tests for reliability on workunits with priority 0. I am going to try changing everything to priority 1 and see if that fixes things. Hopefully this won't change how quickly our workunits are processed or affect how it interacts with workunits from other projects. Let me know if you see any issues with this on your end.

Jake

Profile mikey
Avatar
Send message
Joined: 8 May 09
Posts: 2183
Credit: 231,023,850
RAC: 200,304

Message 67051 - Posted: 10 Feb 2018, 11:40:37 UTC - in response to Message 67045.

Hey everyone,

I found a small bug in the way that the scheduler tests for reliability on workunits with priority 0. I am going to try changing everything to priority 1 and see if that fixes things. Hopefully this won't change how quickly our workunits are processed or affect how it interacts with workunits from other projects. Let me know if you see any issues with this on your end.

Jake


Thanks I hope that helps!

In the meantime maybe you can see if the pc is affected by it:
http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=763376

It got 128 tasks and trashed them ALL!!

Profile Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist
Send message
Joined: 25 Feb 13
Posts: 501
Credit: 34,647,251
RAC: 415

Message 67057 - Posted: 10 Feb 2018, 19:54:35 UTC

Says you have zero workunits in progress. That's a good sign!

Jake

mmonnin
Send message
Joined: 2 Oct 16
Posts: 102
Credit: 81,199,642
RAC: 32,003

Message 67059 - Posted: 11 Feb 2018, 0:57:36 UTC
Last modified: 11 Feb 2018, 0:58:10 UTC

All timeouts. That could have just been a failed hard drive or a PC that was turned off. The ones that have computational errors are the problem clients.

Profile Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist
Send message
Joined: 25 Feb 13
Posts: 501
Credit: 34,647,251
RAC: 415

Message 67072 - Posted: 12 Feb 2018, 16:49:18 UTC

Hey Everyone,

I've done a little digging to see if unreliable hosts are getting fewer workunits now and that does seem to be the case. I have found several faulty computers who currently have 0 work units in progress where they had 100+ plus as of last week.

I am going to consider this issue resolved for now unless anyone objects.

Jake

Profile ritterm
Avatar
Send message
Joined: 16 Jun 08
Posts: 92
Credit: 365,629,434
RAC: 0

Message 67073 - Posted: 12 Feb 2018, 17:19:29 UTC - in response to Message 67072.

I am going to consider this issue resolved for now unless anyone objects...

It definitely looks like we're making progress. I'd say we're good to go if 643627 doesn't get new work after it reports its latest batch sometime after 1800 UTC.

Profile ritterm
Avatar
Send message
Joined: 16 Jun 08
Posts: 92
Credit: 365,629,434
RAC: 0

Message 67074 - Posted: 12 Feb 2018, 19:03:37 UTC - in response to Message 67073.

I'd say we're good to go if 643627 doesn't get new work after it reports its latest batch sometime after 1800 UTC.

Well, it's still being sent new work in spite of a 100% error rate. However, it's getting fewer and fewer new tasks so maybe it will be completely shut down in another couple of days.

Good work, Jake. Thanks for chasing that down for us.

1 · 2 · Next
Post to thread

Message boards : News : Reducing Workunits to Unreliable Hosts


Main page · Your account · Message boards


Copyright © 2018 AstroInformatics Group