Server Downtime March 28, 2022 (12 hours starting 00:00 UTC)

Author	Message
Keith Myers Send message Joined: 24 Jan 11 Posts: 739 Credit: 567,035,306 RAC: 33,923	Message 72402 - Posted: 1 Apr 2022, 3:52:29 UTC - in response to Message 72400. I still haven't gotten any WUs. Thanks Keith for the confirmation. Really? I guess its easier to fulfil smaller work requests 4/1/2022 2:37:30 PM \| Milkyway@Home \| [sched_op] Starting scheduler request 4/1/2022 2:37:30 PM \| Milkyway@Home \| Sending scheduler request: To report completed tasks. 4/1/2022 2:37:30 PM \| Milkyway@Home \| Reporting 1 completed tasks 4/1/2022 2:37:30 PM \| Milkyway@Home \| Requesting new tasks for CPU 4/1/2022 2:37:30 PM \| Milkyway@Home \| [sched_op] CPU work request: 75.25 seconds; 0.00 devices 4/1/2022 2:37:30 PM \| Milkyway@Home \| [sched_op] NVIDIA GPU work request: 0.00 seconds; 0.00 devices 4/1/2022 2:37:33 PM \| Milkyway@Home \| Scheduler request completed: got 1 new tasks 4/1/2022 2:37:33 PM \| Milkyway@Home \| [sched_op] Server version 713 4/1/2022 2:37:33 PM \| Milkyway@Home \| Project requested delay of 91 seconds I shouldn't have any issues since I only carry 30 tasks on any host at any time. Only this daily driver has recovered my 30 task cache currently. The other hosts are getting no work available responses. ID: 72402 · Rating: 0 · rate: / Reply Quote

unixchick Send message Joined: 21 Feb 22 Posts: 66 Credit: 817,008 RAC: 0	Message 72403 - Posted: 1 Apr 2022, 5:12:10 UTC The status has finally updated, and I'm getting WUs again. Not sure what changed, but I'm happy I have a few WUs to work on. I'm running with no cache so I'm only asking for 1 to 4 WUs at a time, so I'm not asking for much. It has been a bit slow on some requests, but this was a longer dry period than I have had lately. ID: 72403 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 218 Credit: 110,624,597 RAC: 1,723	Message 72404 - Posted: 1 Apr 2022, 6:01:00 UTC - in response to Message 72401. I am also seeing database corruption. I am now seeing timed out tasks that I never had received on any of my hosts during the server disk failure disruption. Keith, Those are almost certainly tasks that effectively got "orphaned" because of a network timeout happening at an extremely unlucky point during a download request. Whether that constitutes a database corruption is a matter of opinion :-) I had 64 on my Ryzen and 17 on my Intel, both for requests around 18:30..18:45 UTC on 20th March; in both cases there was a network timeout message in the BOINC log associated with a transfer requested at around that time. They are currently still sitting there marked In Progress although I can tell that I never received them (wasted a lot of time doing task name checks to check what had happened...) I hope they'll time out this afternoon, as that's when their due date is. And yes, I tried a Reset on one of the machines fairly soon after noticing the MW page thought I'd got more tasks than I actually had but, possibly because of other timing issues, it didn't try to re-send them. As it only seemed to get round to dealing with tasks that were scheduled for a retry around then, perhaps no surprise... (And, as I currently have quite a few jobs buffered, I'm not going to try another reset to see if it works now...) Cheers - Al. ID: 72404 · Rating: 0 · rate: / Reply Quote

Septimus Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,905,857 RAC: 0	Message 72405 - Posted: 1 Apr 2022, 6:10:27 UTC - in response to Message 72404. Every WU seems to be going into Validation Inconclusive, will see if I get credits for them before I do any more, some go back over 14 days , donâ€™t know who is getting credits, itâ€™s not me. ID: 72405 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 739 Credit: 567,035,306 RAC: 33,923	Message 72406 - Posted: 1 Apr 2022, 6:41:04 UTC - in response to Message 72404. I am also seeing database corruption. I am now seeing timed out tasks that I never had received on any of my hosts during the server disk failure disruption. Keith, Those are almost certainly tasks that effectively got "orphaned" because of a network timeout happening at an extremely unlucky point during a download request. Whether that constitutes a database corruption is a matter of opinion :-) I had 64 on my Ryzen and 17 on my Intel, both for requests around 18:30..18:45 UTC on 20th March; in both cases there was a network timeout message in the BOINC log associated with a transfer requested at around that time. They are currently still sitting there marked In Progress although I can tell that I never received them (wasted a lot of time doing task name checks to check what had happened...) I hope they'll time out this afternoon, as that's when their due date is. And yes, I tried a Reset on one of the machines fairly soon after noticing the MW page thought I'd got more tasks than I actually had but, possibly because of other timing issues, it didn't try to re-send them. As it only seemed to get round to dealing with tasks that were scheduled for a retry around then, perhaps no surprise... (And, as I currently have quite a few jobs buffered, I'm not going to try another reset to see if it works now...) Cheers - Al. Ok, not corrupted in the traditional sense I guess. But the database was carrying 70 tasks assigned to me that I never received sent out on the 19th and 20th. They have all timed out now. They were not even showing in my account till the hard drive rebuild completed. I had zero tasks in progress since I had set NNT on all my hosts till the hard drive situation got sorted out and I could believe the project had finally returned to some semblance of normality. I just started asking for work today. ID: 72406 · Rating: 0 · rate: / Reply Quote

San-Fernando-Valley Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0	Message 72410 - Posted: 1 Apr 2022, 13:13:49 UTC - in response to Message 72397. @Kiska: Would you prefer searing white? Ok, you got me wondering which is better. Sorry about me griping! Well, maybe I am a little dumb (or more), but how can I "enlarge" the image you sent, so that I can read the infos? Clicking on it does not do anything. It is not an URL. To me this is a "black" image with some "little/small" information (in tiny different colors) on it. Or, now, a "white" one. So please lead me in the right direction. With thanks and appreciation. cheers ID: 72410 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 378,253,749 RAC: 17,935	Message 72411 - Posted: 1 Apr 2022, 14:36:35 UTC - in response to Message 72396. Last modified: 1 Apr 2022, 14:38:24 UTC Because they're far too slow for the amount of data shifted today. They're too slow for a desktop let alone a server with 1000s of users. There's a reason SiDock, Universe, Rosetta, etc, etc, all use them. They all said they're brilliant. Why would you want to wait for a head to physically move from one track to another? It's an absurd amount of time to wait when 300 people want access simultaneously. Have you not seen how long the Milkyway server took to rebuild a disk while trying to cater for users? It was ridiculous. SSD wouldn't have made that much of a difference during the rebuilding process. Tom took the project offline and it finished in around 12 hours. Now the spinners are handing things just fine. There is more to validating tasks besides just moving data around and the HDD array isn't the bottleneck. It's the bottleneck for rebuilding. It's clearly the slowest thing in his setup, since one missing drive caused all this in the first place. A degraded RAID should not be too slow to run the server. Universe@Home recently changed to SSD and the admin says it's marvellous. Rosetta has 72 SSDs! Why on earth would you think it's good to use something that's 50 times slower?! Something that holds up my desktop, nevermind a server with 1000s of users! ID: 72411 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 378,253,749 RAC: 17,935	Message 72412 - Posted: 1 Apr 2022, 14:39:43 UTC - in response to Message 72397. Oh, OK. Well, that is also what the project status page says. No need for a very dark unreadable "picture"? Or am I missing something? Have you all a nice day ... Would you prefer searing white? Maybe they're using a CRT. ID: 72412 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 378,253,749 RAC: 17,935	Message 72413 - Posted: 1 Apr 2022, 14:41:40 UTC - in response to Message 72402. I shouldn't have any issues since I only carry 30 tasks on any host at any time. Even on a GPU? ID: 72413 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 378,253,749 RAC: 17,935	Message 72414 - Posted: 1 Apr 2022, 14:44:18 UTC - in response to Message 72410. Well, maybe I am a little dumb (or more), but how can I "enlarge" the image you sent, so that I can read the infos? Clicking on it does not do anything. It is not an URL. To me this is a "black" image with some "little/small" information (in tiny different colors) on it. Or, now, a "white" one. So please lead me in the right direction. With thanks and appreciation. cheers If I right click the image and open in new tab, it's opened as just the image, then I can zoom in and out with the browser. Or you can right click it and save it, then view it in a photo editor. ID: 72414 · Rating: 0 · rate: / Reply Quote

Kiska Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,225 RAC: 0	Message 72415 - Posted: 1 Apr 2022, 15:23:03 UTC - in response to Message 72410. Last modified: 1 Apr 2022, 15:23:24 UTC Well, maybe I am a little dumb (or more), but how can I "enlarge" the image you sent, so that I can read the infos? Clicking on it does not do anything. It is not an URL. To me this is a "black" image with some "little/small" information (in tiny different colors) on it. Or, now, a "white" one. So please lead me in the right direction. With thanks and appreciation. cheers Searing white: https://grafana.kiska.pw/goto/O8WQ-Lsnz?orgId=1 Dark theme: https://grafana.kiska.pw/goto/Urg_aLynz?orgId=1 ID: 72415 · Rating: 0 · rate: / Reply Quote

San-Fernando-Valley Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0	Message 72416 - Posted: 1 Apr 2022, 16:08:32 UTC - in response to Message 72414. @Peter: Thanks, I have never used these "methods"! It is awful learning new things while not wanting too ... cheers ID: 72416 · Rating: 0 · rate: / Reply Quote

San-Fernando-Valley Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0	Message 72417 - Posted: 1 Apr 2022, 16:10:28 UTC - in response to Message 72415. @ Kiska: Searing white: https://grafana.kiska.pw/goto/O8WQ-Lsnz?orgId=1 Dark theme: https://grafana.kiska.pw/goto/Urg_aLynz?orgId=1 OK, well now I am smarter ... Thanks and have a nice weekend! ID: 72417 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 739 Credit: 567,035,306 RAC: 33,923	Message 72420 - Posted: 1 Apr 2022, 17:34:13 UTC - in response to Message 72413. Last modified: 1 Apr 2022, 17:47:11 UTC I shouldn't have any issues since I only carry 30 tasks on any host at any time. Even on a GPU? Yes, why should you carry any more than what is immediately required with those projects that are stable in the server uptime and task generation. Up until just recently, I put Milkyway in that camp along with Einstein. Both projects normally run quite well on a send one in - get new one back strategy. I can't count 1 time that Einstein was ever out of work. Sets the benchmark as far as stable and dependable for BOINC projects. I run a custom team client that enables me to set a specific work unit target to be constantly maintained. project: http://milkyway.cs.rpi.edu/milkyway/ gpu_limit: 30 report_delay: 750 The report_delay gets around the MW server misconfiguration where you are unable to request work if reporting work in the same scheduler connection. The delay is greater than the MW 600 second timeout period after depleting a cache that normal clients incur. The scheduler connection is the default 91 seconds. So it asks for new work every 91 seconds to top off my cache. I always have my 30 task count maintained. ID: 72420 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 378,253,749 RAC: 17,935	Message 72422 - Posted: 1 Apr 2022, 18:37:08 UTC - in response to Message 72420. Last modified: 1 Apr 2022, 18:39:14 UTC project: http://milkyway.cs.rpi.edu/milkyway/ gpu_limit: 30 report_delay: 750 The report_delay gets around the MW server misconfiguration where you are unable to request work if reporting work in the same scheduler connection. The delay is greater than the MW 600 second timeout period after depleting a cache that normal clients incur. The scheduler connection is the default 91 seconds. So it asks for new work every 91 seconds to top off my cache. I always have my 30 task count maintained. I can't find "report_delay" in the Boinc configuration files. Where do I put it? This is why I currently let it get the maximum 300 per GPU because there will be a 10 minute gap at the end. ID: 72422 · Rating: 0 · rate: / Reply Quote

Kiska Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,225 RAC: 0	Message 72423 - Posted: 1 Apr 2022, 19:27:52 UTC And there goes all of the validation pending: Dashboard at: https://grafana.kiska.pw/goto/Urg_aLynz?orgId=1 ID: 72423 · Rating: 0 · rate: / Reply Quote

unixchick Send message Joined: 21 Feb 22 Posts: 66 Credit: 817,008 RAC: 0	Message 72424 - Posted: 1 Apr 2022, 20:17:40 UTC Thank you Kiska for a beautiful diagnostic display ! It is great news that validation pending is 0. The computer power (volunteers) is still steadily growing as the server can send out WUs reliably again. I'm getting WUs that were put in the queue late on March 24, so I'm not sure when the resends from the recent validation inconclusives will go out. Very nice to see the system recover. I hope Tom gets some time to not think about the system. ID: 72424 · Rating: 0 · rate: / Reply Quote

Kiska Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,225 RAC: 0	Message 72425 - Posted: 1 Apr 2022, 20:29:37 UTC - in response to Message 72424. Thank you Kiska for a beautiful diagnostic display ! It is great news that validation pending is 0. The computer power (volunteers) is still steadily growing as the server can send out WUs reliably again. I'm getting WUs that were put in the queue late on March 24, so I'm not sure when the resends from the recent validation inconclusives will go out. Very nice to see the system recover. I hope Tom gets some time to not think about the system. Probably a while more, I am getting tasks generated on the 16th of March :D ID: 72425 · Rating: 0 · rate: / Reply Quote

Kiska Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,225 RAC: 0	Message 72426 - Posted: 1 Apr 2022, 20:32:22 UTC Last modified: 1 Apr 2022, 20:33:37 UTC This is a note for Tom only: Do NOT turn the work generators(each subproject) back on until ready to send is under 500k, if you do you'll probably end up choking the database again. A note for your CS department, you need to get them to implement some function to pause generation of work when transitioner backlog or ready to send buffers are above a certain limit. ID: 72426 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 378,253,749 RAC: 17,935	Message 72427 - Posted: 1 Apr 2022, 20:33:01 UTC - in response to Message 72425. Probably a while more, I am getting tasks generated on the 16th of March :D Doesn't that mean the server has caught up, but we haven't? ID: 72427 · Rating: 0 · rate: / Reply Quote