Message boards :
News :
Server Downtime March 28, 2022 (12 hours starting 00:00 UTC)
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 15 · Next
Author | Message |
---|---|
Send message Joined: 24 Jan 11 Posts: 715 Credit: 556,390,875 RAC: 53,914 |
I still haven't gotten any WUs. Thanks Keith for the confirmation. I shouldn't have any issues since I only carry 30 tasks on any host at any time. Only this daily driver has recovered my 30 task cache currently. The other hosts are getting no work available responses. |
Send message Joined: 21 Feb 22 Posts: 66 Credit: 817,008 RAC: 0 |
The status has finally updated, and I'm getting WUs again. Not sure what changed, but I'm happy I have a few WUs to work on. I'm running with no cache so I'm only asking for 1 to 4 WUs at a time, so I'm not asking for much. It has been a bit slow on some requests, but this was a longer dry period than I have had lately. |
Send message Joined: 16 Mar 10 Posts: 213 Credit: 108,528,825 RAC: 13,125 |
I am also seeing database corruption. I am now seeing timed out tasks that I never had received on any of my hosts during the server disk failure disruption. Keith, Those are almost certainly tasks that effectively got "orphaned" because of a network timeout happening at an extremely unlucky point during a download request. Whether that constitutes a database corruption is a matter of opinion :-) I had 64 on my Ryzen and 17 on my Intel, both for requests around 18:30..18:45 UTC on 20th March; in both cases there was a network timeout message in the BOINC log associated with a transfer requested at around that time. They are currently still sitting there marked In Progress although I can tell that I never received them (wasted a lot of time doing task name checks to check what had happened...) I hope they'll time out this afternoon, as that's when their due date is. And yes, I tried a Reset on one of the machines fairly soon after noticing the MW page thought I'd got more tasks than I actually had but, possibly because of other timing issues, it didn't try to re-send them. As it only seemed to get round to dealing with tasks that were scheduled for a retry around then, perhaps no surprise... (And, as I currently have quite a few jobs buffered, I'm not going to try another reset to see if it works now...) Cheers - Al. |
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,900,464 RAC: 0 |
Every WU seems to be going into Validation Inconclusive, will see if I get credits for them before I do any more, some go back over 14 days , don’t know who is getting credits, it’s not me. |
Send message Joined: 24 Jan 11 Posts: 715 Credit: 556,390,875 RAC: 53,914 |
I am also seeing database corruption. I am now seeing timed out tasks that I never had received on any of my hosts during the server disk failure disruption. Ok, not corrupted in the traditional sense I guess. But the database was carrying 70 tasks assigned to me that I never received sent out on the 19th and 20th. They have all timed out now. They were not even showing in my account till the hard drive rebuild completed. I had zero tasks in progress since I had set NNT on all my hosts till the hard drive situation got sorted out and I could believe the project had finally returned to some semblance of normality. I just started asking for work today. |
Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0 |
@Kiska:
Ok, you got me wondering which is better. Sorry about me griping! Well, maybe I am a little dumb (or more), but how can I "enlarge" the image you sent, so that I can read the infos? Clicking on it does not do anything. It is not an URL. To me this is a "black" image with some "little/small" information (in tiny different colors) on it. Or, now, a "white" one. So please lead me in the right direction. With thanks and appreciation. cheers |
Send message Joined: 5 Jul 11 Posts: 990 Credit: 376,143,149 RAC: 0 |
It's the bottleneck for rebuilding. It's clearly the slowest thing in his setup, since one missing drive caused all this in the first place. A degraded RAID should not be too slow to run the server. Universe@Home recently changed to SSD and the admin says it's marvellous. Rosetta has 72 SSDs! Why on earth would you think it's good to use something that's 50 times slower?! Something that holds up my desktop, nevermind a server with 1000s of users!Because they're far too slow for the amount of data shifted today. They're too slow for a desktop let alone a server with 1000s of users. There's a reason SiDock, Universe, Rosetta, etc, etc, all use them. They all said they're brilliant. Why would you want to wait for a head to physically move from one track to another? It's an absurd amount of time to wait when 300 people want access simultaneously. Have you not seen how long the Milkyway server took to rebuild a disk while trying to cater for users? It was ridiculous. |
Send message Joined: 5 Jul 11 Posts: 990 Credit: 376,143,149 RAC: 0 |
Maybe they're using a CRT.Oh, OK.Would you prefer searing white? |
Send message Joined: 5 Jul 11 Posts: 990 Credit: 376,143,149 RAC: 0 |
I shouldn't have any issues since I only carry 30 tasks on any host at any time.Even on a GPU? |
Send message Joined: 5 Jul 11 Posts: 990 Credit: 376,143,149 RAC: 0 |
Well, maybe I am a little dumb (or more), but how can I "enlarge" the image you sent, so that I can read the infos?If I right click the image and open in new tab, it's opened as just the image, then I can zoom in and out with the browser. Or you can right click it and save it, then view it in a photo editor. |
Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,225 RAC: 1 |
Searing white: https://grafana.kiska.pw/goto/O8WQ-Lsnz?orgId=1 Dark theme: https://grafana.kiska.pw/goto/Urg_aLynz?orgId=1 |
Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0 |
@Peter: Thanks, I have never used these "methods"! It is awful learning new things while not wanting too ... cheers |
Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0 |
@ Kiska:
OK, well now I am smarter ... Thanks and have a nice weekend! |
Send message Joined: 24 Jan 11 Posts: 715 Credit: 556,390,875 RAC: 53,914 |
I shouldn't have any issues since I only carry 30 tasks on any host at any time.Even on a GPU? Yes, why should you carry any more than what is immediately required with those projects that are stable in the server uptime and task generation. Up until just recently, I put Milkyway in that camp along with Einstein. Both projects normally run quite well on a send one in - get new one back strategy. I can't count 1 time that Einstein was ever out of work. Sets the benchmark as far as stable and dependable for BOINC projects. I run a custom team client that enables me to set a specific work unit target to be constantly maintained. project: http://milkyway.cs.rpi.edu/milkyway/ gpu_limit: 30 report_delay: 750 The report_delay gets around the MW server misconfiguration where you are unable to request work if reporting work in the same scheduler connection. The delay is greater than the MW 600 second timeout period after depleting a cache that normal clients incur. The scheduler connection is the default 91 seconds. So it asks for new work every 91 seconds to top off my cache. I always have my 30 task count maintained. |
Send message Joined: 5 Jul 11 Posts: 990 Credit: 376,143,149 RAC: 0 |
project: http://milkyway.cs.rpi.edu/milkyway/I can't find "report_delay" in the Boinc configuration files. Where do I put it? This is why I currently let it get the maximum 300 per GPU because there will be a 10 minute gap at the end. |
Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,225 RAC: 1 |
And there goes all of the validation pending: Dashboard at: https://grafana.kiska.pw/goto/Urg_aLynz?orgId=1 |
Send message Joined: 21 Feb 22 Posts: 66 Credit: 817,008 RAC: 0 |
Thank you Kiska for a beautiful diagnostic display ! It is great news that validation pending is 0. The computer power (volunteers) is still steadily growing as the server can send out WUs reliably again. I'm getting WUs that were put in the queue late on March 24, so I'm not sure when the resends from the recent validation inconclusives will go out. Very nice to see the system recover. I hope Tom gets some time to not think about the system. |
Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,225 RAC: 1 |
Thank you Kiska for a beautiful diagnostic display ! Probably a while more, I am getting tasks generated on the 16th of March :D |
Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,225 RAC: 1 |
This is a note for Tom only: Do NOT turn the work generators(each subproject) back on until ready to send is under 500k, if you do you'll probably end up choking the database again. A note for your CS department, you need to get them to implement some function to pause generation of work when transitioner backlog or ready to send buffers are above a certain limit. |
Send message Joined: 5 Jul 11 Posts: 990 Credit: 376,143,149 RAC: 0 |
Probably a while more, I am getting tasks generated on the 16th of March :DDoesn't that mean the server has caught up, but we haven't? |
©2024 Astroinformatics Group