Message boards :
News :
Server Downtime 3/21 1PM EST
Message board moderation
Previous · 1 · 2 · 3
Author | Message |
---|---|
Send message Joined: 16 Mar 10 Posts: 213 Credit: 108,994,427 RAC: 29,415 |
Tom, Thanks for those responses above - hopefully they reach the audience that needs to read them! :-) I'd second the suggestion made by Skillz... Would shutting the project down for a day (or however long it takes) help speed up the rebuilding process?...but for a different reason. Clearing the transitioner backlog should (I believe) be done with the project shut down, and I suspect that until you can get the transitioner backlog down to minutes rather than hours there will continue to be issues even when the RAID array has sorted itself out! So having some down time (even prolonged) might not be that bad a thing provided it is flagged up in advance (as you did for the outage on 21st March) I think quite a few projects have seen a rise in demand for work whilst World Community Grid is off the air - perhaps the pressure will ease a little anyway come mid-April (or thereabouts...) Cheers - Al. P.S. What happened to Dylan, the student who introduced himself back in January? |
Send message Joined: 21 Feb 22 Posts: 66 Credit: 817,008 RAC: 0 |
Thanks for the update Tom ! Reminds me of the projects I worked on during my years as a grad, most of which were not related to my thesis. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
P.S. What happened to Dylan, the student who introduced himself back in January? He is still working with us! He is working out the kinks after updating our experimental server build to the newest versions of Ubuntu and BOINC. Hopefully once we figure out the bugs, we can then update this server to the newest Ubuntu and BOINC versions. |
Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,225 RAC: 0 |
I'd second the suggestion made by Skillz...Would shutting the project down for a day (or however long it takes) help speed up the rebuilding process?...but for a different reason. Clearing the transitioner backlog should (I believe) be done with the project shut down, and I suspect that until you can get the transitioner backlog down to minutes rather than hours there will continue to be issues even when the RAID array has sorted itself out! So having some down time (even prolonged) might not be that bad a thing provided it is flagged up in advance (as you did for the outage on 21st March) I third this, since with the db hammering the array, its actually not doing much rebuilding and with it being a RAID5, you need to get back that redundancy ASAP |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Seems to me you can just stop sending out work, collect all the work currently out, and rebuild the new hard drive much faster than trying to keep things going while it rebuilds. Clearing the transitioner backlog should (I believe) be done with the project shut down, and I suspect that until you can get the transitioner backlog down to minutes rather than hours there will continue to be issues I have been regularly clearing the transitioner backlog, but it fills up very quickly with the large number of validated WUs coming back. I'm not sure that turning off the workunit generators will help (otherwise I'd just do that right now), because when there were no WUs in the "Tasks ready to send" queue, we were still having a large number of WUs waiting for validation that were piling up. We were also having people complain that they couldn't get work at that point too. I think I'll make a post stating that we will have some scheduled downtime over the weekend. During that downtime I'll turn the project off for a day or so, clear the transitioner backlog, and let some of the WUs waiting for validation trickle back in. Maybe some time with the DB service turned off would be good too, in order to rebuild the new drive. |
Send message Joined: 24 Jan 11 Posts: 715 Credit: 556,874,007 RAC: 43,216 |
Actually the latest Backblaze server storage failure statistics are showing that SSD's have a much lower average failure rate. https://www.backblaze.com/blog/backblaze-hard-drive-stats-q1-2021/#:~:text=The%20overall%20annualized%20failure%20rate,23%2C600%20drives%20over%20the%20period. |
Send message Joined: 24 Jan 11 Posts: 715 Credit: 556,874,007 RAC: 43,216 |
In the past 10 days, I have had a total of 3 days with reported validated statistics. The rest were 0 results. Just shy of 7000 tasks awaiting validations since 7 March. That's over 3 weeks now. I'm used to my results validating on average in a day or so when the servers are running normally. [Edit] Also my stats haven't been updated properly for the past several days. Stats still shows 89 task in progress while in fact I have zero in progress. I returned the last of the MW work several days ago and set NNT until the servers are working correctly again. |
Send message Joined: 5 Jul 11 Posts: 990 Credit: 376,143,149 RAC: 0 |
Great news. Boinc is full of bugs, if you can get a later server version working that would sort a lot of problems out.P.S. What happened to Dylan, the student who introduced himself back in January?He is still working with us! He is working out the kinks after updating our experimental server build to the newest versions of Ubuntu and BOINC. Hopefully once we figure out the bugs, we can then update this server to the newest Ubuntu and BOINC versions. |
Send message Joined: 5 Jul 11 Posts: 990 Credit: 376,143,149 RAC: 0 |
In the past 10 days, I have had a total of 3 days with reported validated statistics. The rest were 0 results.Or look at it this way, do loads for MW now, and get a huge lump of credits in a while. |
©2024 Astroinformatics Group