Welcome to MilkyWay@home

Server Downtime 3/21 1PM EST


Advanced search

Message boards : News : Server Downtime 3/21 1PM EST
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
alanb1951

Send message
Joined: 16 Mar 10
Posts: 138
Credit: 76,053,735
RAC: 72,144
50 million credit badge12 year member badgeextraordinary contributions badge
Message 72207 - Posted: 23 Mar 2022, 19:49:20 UTC
Last modified: 23 Mar 2022, 19:52:27 UTC

Tom,

Thanks for those responses above - hopefully they reach the audience that needs to read them! :-)

I'd second the suggestion made by Skillz...
Would shutting the project down for a day (or however long it takes) help speed up the rebuilding process?
...but for a different reason. Clearing the transitioner backlog should (I believe) be done with the project shut down, and I suspect that until you can get the transitioner backlog down to minutes rather than hours there will continue to be issues even when the RAID array has sorted itself out! So having some down time (even prolonged) might not be that bad a thing provided it is flagged up in advance (as you did for the outage on 21st March)

I think quite a few projects have seen a rise in demand for work whilst World Community Grid is off the air - perhaps the pressure will ease a little anyway come mid-April (or thereabouts...)

Cheers - Al.

P.S. What happened to Dylan, the student who introduced himself back in January?
ID: 72207 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
unixchick
Avatar

Send message
Joined: 21 Feb 22
Posts: 66
Credit: 501,672
RAC: 537
500 thousand credit badge
Message 72208 - Posted: 23 Mar 2022, 19:50:17 UTC

Thanks for the update Tom !

Reminds me of the projects I worked on during my years as a grad, most of which were not related to my thesis.
ID: 72208 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 347
Credit: 77,353,048
RAC: 147,850
50 million credit badge3 year member badge
Message 72209 - Posted: 23 Mar 2022, 20:33:49 UTC - in response to Message 72207.  

P.S. What happened to Dylan, the student who introduced himself back in January?


He is still working with us! He is working out the kinks after updating our experimental server build to the newest versions of Ubuntu and BOINC. Hopefully once we figure out the bugs, we can then update this server to the newest Ubuntu and BOINC versions.
ID: 72209 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kiska

Send message
Joined: 31 Mar 12
Posts: 55
Credit: 90,725,020
RAC: 627,181
50 million credit badge10 year member badge
Message 72210 - Posted: 23 Mar 2022, 20:37:13 UTC - in response to Message 72207.  

I'd second the suggestion made by Skillz...
Would shutting the project down for a day (or however long it takes) help speed up the rebuilding process?
...but for a different reason. Clearing the transitioner backlog should (I believe) be done with the project shut down, and I suspect that until you can get the transitioner backlog down to minutes rather than hours there will continue to be issues even when the RAID array has sorted itself out! So having some down time (even prolonged) might not be that bad a thing provided it is flagged up in advance (as you did for the outage on 21st March)


I third this, since with the db hammering the array, its actually not doing much rebuilding and with it being a RAID5, you need to get back that redundancy ASAP
ID: 72210 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 347
Credit: 77,353,048
RAC: 147,850
50 million credit badge3 year member badge
Message 72211 - Posted: 23 Mar 2022, 21:26:09 UTC

Seems to me you can just stop sending out work, collect all the work currently out, and rebuild the new hard drive much faster than trying to keep things going while it rebuilds.


Clearing the transitioner backlog should (I believe) be done with the project shut down, and I suspect that until you can get the transitioner backlog down to minutes rather than hours there will continue to be issues


I have been regularly clearing the transitioner backlog, but it fills up very quickly with the large number of validated WUs coming back. I'm not sure that turning off the workunit generators will help (otherwise I'd just do that right now), because when there were no WUs in the "Tasks ready to send" queue, we were still having a large number of WUs waiting for validation that were piling up. We were also having people complain that they couldn't get work at that point too.

I think I'll make a post stating that we will have some scheduled downtime over the weekend. During that downtime I'll turn the project off for a day or so, clear the transitioner backlog, and let some of the WUs waiting for validation trickle back in. Maybe some time with the DB service turned off would be good too, in order to rebuild the new drive.
ID: 72211 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 590
Credit: 474,436,984
RAC: 68,721
300 million credit badge11 year member badgeextraordinary contributions badge
Message 72214 - Posted: 23 Mar 2022, 21:57:45 UTC - in response to Message 72189.  



Also if you're using consumer SSDs their rated endurance is quite low in a server environment. eg 960GB WD Enterprise SSD: https://www.newegg.com/western-digital-gold-960gb/p/20-250-139 vs a 1TB consumer SSD https://www.newegg.com/western-digital-1tb-black-sn850-nvme/p/N82E16820250161
The 1 TB consumer SSD has endurance of 600TBW while the enterprise drive is 1.4PBW

Actually the latest Backblaze server storage failure statistics are showing that SSD's have a much lower average failure rate.
https://www.backblaze.com/blog/backblaze-hard-drive-stats-q1-2021/#:~:text=The%20overall%20annualized%20failure%20rate,23%2C600%20drives%20over%20the%20period.
ID: 72214 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 590
Credit: 474,436,984
RAC: 68,721
300 million credit badge11 year member badgeextraordinary contributions badge
Message 72215 - Posted: 23 Mar 2022, 22:03:31 UTC
Last modified: 23 Mar 2022, 22:07:51 UTC

In the past 10 days, I have had a total of 3 days with reported validated statistics. The rest were 0 results.
Just shy of 7000 tasks awaiting validations since 7 March. That's over 3 weeks now. I'm used to my results validating on average in a day or so when the servers are running normally.
[Edit] Also my stats haven't been updated properly for the past several days. Stats still shows 89 task in progress while in fact I have zero in progress. I returned the last of the MW work several days ago and set NNT until the servers are working correctly again.
ID: 72215 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 5 Jul 11
Posts: 705
Credit: 273,640,434
RAC: 229,651
200 million credit badge10 year member badge
Message 72220 - Posted: 24 Mar 2022, 8:15:58 UTC - in response to Message 72209.  

P.S. What happened to Dylan, the student who introduced himself back in January?
He is still working with us! He is working out the kinks after updating our experimental server build to the newest versions of Ubuntu and BOINC. Hopefully once we figure out the bugs, we can then update this server to the newest Ubuntu and BOINC versions.
Great news. Boinc is full of bugs, if you can get a later server version working that would sort a lot of problems out.
ID: 72220 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 5 Jul 11
Posts: 705
Credit: 273,640,434
RAC: 229,651
200 million credit badge10 year member badge
Message 72221 - Posted: 24 Mar 2022, 8:33:20 UTC - in response to Message 72215.  

In the past 10 days, I have had a total of 3 days with reported validated statistics. The rest were 0 results.
Just shy of 7000 tasks awaiting validations since 7 March. That's over 3 weeks now. I'm used to my results validating on average in a day or so when the servers are running normally.
[Edit] Also my stats haven't been updated properly for the past several days. Stats still shows 89 task in progress while in fact I have zero in progress. I returned the last of the MW work several days ago and set NNT until the servers are working correctly again.
Or look at it this way, do loads for MW now, and get a huge lump of credits in a while.
ID: 72221 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3

Message boards : News : Server Downtime 3/21 1PM EST

©2022 Astroinformatics Group