Reducing Workunits to Unreliable Hosts

Author	Message
Jake Weiss Volunteer moderator Project developer Project tester Project scientist Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0	Message 66988 - Posted: 22 Jan 2018, 15:37:37 UTC Hey Everyone, I just tried turning on some options to reduce workunits sent to hosts that return a significant number of errors. If you see any issues, please let me know. Thank you all for your continued support. Jake ID: 66988 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3323 Credit: 521,095,239 RAC: 40,090	Message 66997 - Posted: 23 Jan 2018, 11:08:01 UTC - in response to Message 66988. Hey Everyone, I just tried turning on some options to reduce workunits sent to hosts that return a significant number of errors. If you see any issues, please let me know. Thank you all for your continued support. Jake Thank you very much, hopefully the credits will flow more quickly now. ID: 66997 · Rating: 0 · rate: / Reply Quote

ritterm Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0	Message 66999 - Posted: 23 Jan 2018, 15:29:39 UTC - in response to Message 66988. Last modified: 23 Jan 2018, 15:42:39 UTC I just tried turning on some options to reduce workunits sent to hosts that return a significant number of errors. If you see any issues, please let me know... Unfortunately, host 643627 went through another cycle this morning of returning 80 errored tasks and getting 80 new tasks. Even though the host has a history of returning a lot of errors, does it take time for the server to "learn" that it's unreliable? ID: 66999 · Rating: 0 · rate: / Reply Quote

ritterm Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0	Message 67009 - Posted: 27 Jan 2018, 14:54:57 UTC - in response to Message 66999. Unfortunately, host 643627 went through another cycle this morning of returning 80 errored tasks and getting 80 new tasks... And it continues... ID: 67009 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3323 Credit: 521,095,239 RAC: 40,090	Message 67010 - Posted: 28 Jan 2018, 12:11:50 UTC - in response to Message 67009. Unfortunately, host 643627 went through another cycle this morning of returning 80 errored tasks and getting 80 new tasks... And it continues... What's strange is they have 2 pc's, one works and one doesn't at all, it's got the 2 second error problem. ID: 67010 · Rating: 0 · rate: / Reply Quote

ritterm Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0	Message 67011 - Posted: 29 Jan 2018, 4:23:36 UTC Another otherwise likely valid result of mine invalidated due to unreliable wingmen (628802 and 761112). ID: 67011 · Rating: 0 · rate: / Reply Quote

Jake Weiss Volunteer moderator Project developer Project tester Project scientist Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0	Message 67012 - Posted: 29 Jan 2018, 19:40:30 UTC Hey Everyone, I wanted to give this a couple days to see if the server just had to learn who was unreliable. Seems that's not the case. I will do a little more research into configuring the server better throughout the week and run some more tests. Jake ID: 67012 · Rating: 0 · rate: / Reply Quote

jaczar Send message Joined: 25 Jan 14 Posts: 1 Credit: 17,492,399 RAC: 0	Message 67031 - Posted: 5 Feb 2018, 5:20:13 UTC One of my PCs is not getting any credit for HOST AVERAGE work done, another seems to be working OK. The OK one also posts USER AVERAGE totals which seem to include activity of BOTH of my Milkyway machines as well as what looks like correct HOST AVERAGE totals. ID: 67031 · Rating: 0 · rate: / Reply Quote

ritterm Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0	Message 67032 - Posted: 5 Feb 2018, 16:11:24 UTC - in response to Message 67031. One of my PCs is not getting any credit for HOST AVERAGE work done, another seems to be working OK. The OK one also posts USER AVERAGE totals which seem to include activity of BOTH of my Milkyway machines as well as what looks like correct HOST AVERAGE totals. Just so there's no confusion, what does this have to do with unreliable hosts? For your hosts, I see some user aborts and errors in N-body tasks. However, no massive and continuing computation errors like those pointed out earlier in this thread. ID: 67032 · Rating: 0 · rate: / Reply Quote

ritterm Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0	Message 67033 - Posted: 6 Feb 2018, 14:04:24 UTC - in response to Message 67012. I will do a little more research into configuring the server better throughout the week and run some more tests... Any update on this, Jake? I've had another result invalidated due to unreliable wingmen. I guess it doesn't make any difference, but all the hosts I've looked appear to be using GPUs that aren't double precision. ID: 67033 · Rating: 0 · rate: / Reply Quote

Jake Weiss Volunteer moderator Project developer Project tester Project scientist Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0	Message 67034 - Posted: 6 Feb 2018, 18:23:41 UTC Nothing new to report yet. I've tried changing a few things on the server config side, but none of it seems to make a difference. Hopefully I will find the right knob to tweak soon. Jake ID: 67034 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3323 Credit: 521,095,239 RAC: 40,090	Message 67035 - Posted: 6 Feb 2018, 20:54:05 UTC - in response to Message 67033. I will do a little more research into configuring the server better throughout the week and run some more tests... Any update on this, Jake? I've had another result invalidated due to unreliable wingmen. I guess it doesn't make any difference, but all the hosts I've looked appear to be using GPUs that aren't double precision. Maybe they could blacklist some then? Not the actual host but the model of gpu that can't do double precision. That should mean the host can't get work. Another project I crunch for bans any gpu with less than 2gb of ram, I realize that may be a bit easier but it's a start. ID: 67035 · Rating: 0 · rate: / Reply Quote

Cliff Send message Joined: 28 Nov 14 Posts: 51 Credit: 86,696,721 RAC: 0	Message 67037 - Posted: 8 Feb 2018, 2:08:01 UTC - in response to Message 67034. Hi Jake, While you're at it can you find out whats causing the database to crash every now and then? Regards, Cliff. -- Been there Done That, still no Damn T-Shirt ID: 67037 · Rating: 0 · rate: / Reply Quote

Jake Weiss Volunteer moderator Project developer Project tester Project scientist Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0	Message 67045 - Posted: 9 Feb 2018, 19:44:41 UTC Hey everyone, I found a small bug in the way that the scheduler tests for reliability on workunits with priority 0. I am going to try changing everything to priority 1 and see if that fixes things. Hopefully this won't change how quickly our workunits are processed or affect how it interacts with workunits from other projects. Let me know if you see any issues with this on your end. Jake ID: 67045 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3323 Credit: 521,095,239 RAC: 40,090	Message 67051 - Posted: 10 Feb 2018, 11:40:37 UTC - in response to Message 67045. Hey everyone, I found a small bug in the way that the scheduler tests for reliability on workunits with priority 0. I am going to try changing everything to priority 1 and see if that fixes things. Hopefully this won't change how quickly our workunits are processed or affect how it interacts with workunits from other projects. Let me know if you see any issues with this on your end. Jake Thanks I hope that helps! In the meantime maybe you can see if the pc is affected by it: http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=763376 It got 128 tasks and trashed them ALL!! ID: 67051 · Rating: 0 · rate: / Reply Quote

Jake Weiss Volunteer moderator Project developer Project tester Project scientist Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0	Message 67057 - Posted: 10 Feb 2018, 19:54:35 UTC Says you have zero workunits in progress. That's a good sign! Jake ID: 67057 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Oct 16 Posts: 167 Credit: 1,006,552,031 RAC: 36,373	Message 67059 - Posted: 11 Feb 2018, 0:57:36 UTC Last modified: 11 Feb 2018, 0:58:10 UTC All timeouts. That could have just been a failed hard drive or a PC that was turned off. The ones that have computational errors are the problem clients. ID: 67059 · Rating: 0 · rate: / Reply Quote

Jake Weiss Volunteer moderator Project developer Project tester Project scientist Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0	Message 67072 - Posted: 12 Feb 2018, 16:49:18 UTC Hey Everyone, I've done a little digging to see if unreliable hosts are getting fewer workunits now and that does seem to be the case. I have found several faulty computers who currently have 0 work units in progress where they had 100+ plus as of last week. I am going to consider this issue resolved for now unless anyone objects. Jake ID: 67072 · Rating: 0 · rate: / Reply Quote

ritterm Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0	Message 67073 - Posted: 12 Feb 2018, 17:19:29 UTC - in response to Message 67072. I am going to consider this issue resolved for now unless anyone objects... It definitely looks like we're making progress. I'd say we're good to go if 643627 doesn't get new work after it reports its latest batch sometime after 1800 UTC. ID: 67073 · Rating: 0 · rate: / Reply Quote

ritterm Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0	Message 67074 - Posted: 12 Feb 2018, 19:03:37 UTC - in response to Message 67073. I'd say we're good to go if 643627 doesn't get new work after it reports its latest batch sometime after 1800 UTC. Well, it's still being sent new work in spite of a 100% error rate. However, it's getting fewer and fewer new tasks so maybe it will be completely shut down in another couple of days. Good work, Jake. Thanks for chasing that down for us. ID: 67074 · Rating: 0 · rate: / Reply Quote