Server Downtime 3/21 1PM EST

Author	Message
alanb1951 Send message Joined: 16 Mar 10 Posts: 218 Credit: 110,569,596 RAC: 7,437	Message 72207 - Posted: 23 Mar 2022, 19:49:20 UTC Last modified: 23 Mar 2022, 19:52:27 UTC Tom, Thanks for those responses above - hopefully they reach the audience that needs to read them! :-) I'd second the suggestion made by Skillz... Would shutting the project down for a day (or however long it takes) help speed up the rebuilding process? ...but for a different reason. Clearing the transitioner backlog should (I believe) be done with the project shut down, and I suspect that until you can get the transitioner backlog down to minutes rather than hours there will continue to be issues even when the RAID array has sorted itself out! So having some down time (even prolonged) might not be that bad a thing provided it is flagged up in advance (as you did for the outage on 21st March) I think quite a few projects have seen a rise in demand for work whilst World Community Grid is off the air - perhaps the pressure will ease a little anyway come mid-April (or thereabouts...) Cheers - Al. P.S. What happened to Dylan, the student who introduced himself back in January? ID: 72207 · Rating: 0 · rate: / Reply Quote

unixchick Send message Joined: 21 Feb 22 Posts: 66 Credit: 817,008 RAC: 0	Message 72208 - Posted: 23 Mar 2022, 19:50:17 UTC Thanks for the update Tom ! Reminds me of the projects I worked on during my years as a grad, most of which were not related to my thesis. ID: 72208 · Rating: 0 · rate: / Reply Quote

Tom Donlon Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0	Message 72209 - Posted: 23 Mar 2022, 20:33:49 UTC - in response to Message 72207. P.S. What happened to Dylan, the student who introduced himself back in January? He is still working with us! He is working out the kinks after updating our experimental server build to the newest versions of Ubuntu and BOINC. Hopefully once we figure out the bugs, we can then update this server to the newest Ubuntu and BOINC versions. ID: 72209 · Rating: 0 · rate: / Reply Quote

Kiska Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,225 RAC: 0	Message 72210 - Posted: 23 Mar 2022, 20:37:13 UTC - in response to Message 72207. I'd second the suggestion made by Skillz... Would shutting the project down for a day (or however long it takes) help speed up the rebuilding process? ...but for a different reason. Clearing the transitioner backlog should (I believe) be done with the project shut down, and I suspect that until you can get the transitioner backlog down to minutes rather than hours there will continue to be issues even when the RAID array has sorted itself out! So having some down time (even prolonged) might not be that bad a thing provided it is flagged up in advance (as you did for the outage on 21st March) I third this, since with the db hammering the array, its actually not doing much rebuilding and with it being a RAID5, you need to get back that redundancy ASAP ID: 72210 · Rating: 0 · rate: / Reply Quote

Tom Donlon Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0	Message 72211 - Posted: 23 Mar 2022, 21:26:09 UTC Seems to me you can just stop sending out work, collect all the work currently out, and rebuild the new hard drive much faster than trying to keep things going while it rebuilds. Clearing the transitioner backlog should (I believe) be done with the project shut down, and I suspect that until you can get the transitioner backlog down to minutes rather than hours there will continue to be issues I have been regularly clearing the transitioner backlog, but it fills up very quickly with the large number of validated WUs coming back. I'm not sure that turning off the workunit generators will help (otherwise I'd just do that right now), because when there were no WUs in the "Tasks ready to send" queue, we were still having a large number of WUs waiting for validation that were piling up. We were also having people complain that they couldn't get work at that point too. I think I'll make a post stating that we will have some scheduled downtime over the weekend. During that downtime I'll turn the project off for a day or so, clear the transitioner backlog, and let some of the WUs waiting for validation trickle back in. Maybe some time with the DB service turned off would be good too, in order to rebuild the new drive. ID: 72211 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 738 Credit: 566,066,763 RAC: 13,027	Message 72214 - Posted: 23 Mar 2022, 21:57:45 UTC - in response to Message 72189. Also if you're using consumer SSDs their rated endurance is quite low in a server environment. eg 960GB WD Enterprise SSD: https://www.newegg.com/western-digital-gold-960gb/p/20-250-139 vs a 1TB consumer SSD https://www.newegg.com/western-digital-1tb-black-sn850-nvme/p/N82E16820250161 The 1 TB consumer SSD has endurance of 600TBW while the enterprise drive is 1.4PBW Actually the latest Backblaze server storage failure statistics are showing that SSD's have a much lower average failure rate. https://www.backblaze.com/blog/backblaze-hard-drive-stats-q1-2021/#:~:text=The%20overall%20annualized%20failure%20rate,23%2C600%20drives%20over%20the%20period. ID: 72214 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 738 Credit: 566,066,763 RAC: 13,027	Message 72215 - Posted: 23 Mar 2022, 22:03:31 UTC Last modified: 23 Mar 2022, 22:07:51 UTC In the past 10 days, I have had a total of 3 days with reported validated statistics. The rest were 0 results. Just shy of 7000 tasks awaiting validations since 7 March. That's over 3 weeks now. I'm used to my results validating on average in a day or so when the servers are running normally. [Edit] Also my stats haven't been updated properly for the past several days. Stats still shows 89 task in progress while in fact I have zero in progress. I returned the last of the MW work several days ago and set NNT until the servers are working correctly again. ID: 72215 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 377,740,803 RAC: 16,312	Message 72220 - Posted: 24 Mar 2022, 8:15:58 UTC - in response to Message 72209. P.S. What happened to Dylan, the student who introduced himself back in January? He is still working with us! He is working out the kinks after updating our experimental server build to the newest versions of Ubuntu and BOINC. Hopefully once we figure out the bugs, we can then update this server to the newest Ubuntu and BOINC versions. Great news. Boinc is full of bugs, if you can get a later server version working that would sort a lot of problems out. ID: 72220 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 377,740,803 RAC: 16,312	Message 72221 - Posted: 24 Mar 2022, 8:33:20 UTC - in response to Message 72215. In the past 10 days, I have had a total of 3 days with reported validated statistics. The rest were 0 results. Just shy of 7000 tasks awaiting validations since 7 March. That's over 3 weeks now. I'm used to my results validating on average in a day or so when the servers are running normally. [Edit] Also my stats haven't been updated properly for the past several days. Stats still shows 89 task in progress while in fact I have zero in progress. I returned the last of the MW work several days ago and set NNT until the servers are working correctly again. Or look at it this way, do loads for MW now, and get a huge lump of credits in a while. ID: 72221 · Rating: 0 · rate: / Reply Quote