Welcome to MilkyWay@home

Server Downtime March 28, 2022 (12 hours starting 00:00 UTC)

Message boards : News : Server Downtime March 28, 2022 (12 hours starting 00:00 UTC)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 15 · Next

AuthorMessage
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 715
Credit: 556,390,875
RAC: 53,914
Message 72402 - Posted: 1 Apr 2022, 3:52:29 UTC - in response to Message 72400.  

I still haven't gotten any WUs. Thanks Keith for the confirmation.


Really? I guess its easier to fulfil smaller work requests

4/1/2022 2:37:30 PM | Milkyway@Home | [sched_op] Starting scheduler request
4/1/2022 2:37:30 PM | Milkyway@Home | Sending scheduler request: To report completed tasks.
4/1/2022 2:37:30 PM | Milkyway@Home | Reporting 1 completed tasks
4/1/2022 2:37:30 PM | Milkyway@Home | Requesting new tasks for CPU
4/1/2022 2:37:30 PM | Milkyway@Home | [sched_op] CPU work request: 75.25 seconds; 0.00 devices
4/1/2022 2:37:30 PM | Milkyway@Home | [sched_op] NVIDIA GPU work request: 0.00 seconds; 0.00 devices
4/1/2022 2:37:33 PM | Milkyway@Home | Scheduler request completed: got 1 new tasks
4/1/2022 2:37:33 PM | Milkyway@Home | [sched_op] Server version 713
4/1/2022 2:37:33 PM | Milkyway@Home | Project requested delay of 91 seconds

I shouldn't have any issues since I only carry 30 tasks on any host at any time. Only this daily driver has recovered my 30 task cache currently. The other hosts are getting no work available responses.
ID: 72402 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
unixchick
Avatar

Send message
Joined: 21 Feb 22
Posts: 66
Credit: 817,008
RAC: 0
Message 72403 - Posted: 1 Apr 2022, 5:12:10 UTC

The status has finally updated, and I'm getting WUs again. Not sure what changed, but I'm happy I have a few WUs to work on.

I'm running with no cache so I'm only asking for 1 to 4 WUs at a time, so I'm not asking for much. It has been a bit slow on some requests, but this was a longer dry period than I have had lately.
ID: 72403 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 213
Credit: 108,528,825
RAC: 13,125
Message 72404 - Posted: 1 Apr 2022, 6:01:00 UTC - in response to Message 72401.  

I am also seeing database corruption. I am now seeing timed out tasks that I never had received on any of my hosts during the server disk failure disruption.

Keith,

Those are almost certainly tasks that effectively got "orphaned" because of a network timeout happening at an extremely unlucky point during a download request. Whether that constitutes a database corruption is a matter of opinion :-)

I had 64 on my Ryzen and 17 on my Intel, both for requests around 18:30..18:45 UTC on 20th March; in both cases there was a network timeout message in the BOINC log associated with a transfer requested at around that time. They are currently still sitting there marked In Progress although I can tell that I never received them (wasted a lot of time doing task name checks to check what had happened...) I hope they'll time out this afternoon, as that's when their due date is.

And yes, I tried a Reset on one of the machines fairly soon after noticing the MW page thought I'd got more tasks than I actually had but, possibly because of other timing issues, it didn't try to re-send them. As it only seemed to get round to dealing with tasks that were scheduled for a retry around then, perhaps no surprise... (And, as I currently have quite a few jobs buffered, I'm not going to try another reset to see if it works now...)

Cheers - Al.
ID: 72404 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,900,464
RAC: 0
Message 72405 - Posted: 1 Apr 2022, 6:10:27 UTC - in response to Message 72404.  

Every WU seems to be going into Validation Inconclusive, will see if I get credits for them before I do any more, some go back over 14 days , don’t know who is getting credits, it’s not me.
ID: 72405 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 715
Credit: 556,390,875
RAC: 53,914
Message 72406 - Posted: 1 Apr 2022, 6:41:04 UTC - in response to Message 72404.  

I am also seeing database corruption. I am now seeing timed out tasks that I never had received on any of my hosts during the server disk failure disruption.

Keith,

Those are almost certainly tasks that effectively got "orphaned" because of a network timeout happening at an extremely unlucky point during a download request. Whether that constitutes a database corruption is a matter of opinion :-)

I had 64 on my Ryzen and 17 on my Intel, both for requests around 18:30..18:45 UTC on 20th March; in both cases there was a network timeout message in the BOINC log associated with a transfer requested at around that time. They are currently still sitting there marked In Progress although I can tell that I never received them (wasted a lot of time doing task name checks to check what had happened...) I hope they'll time out this afternoon, as that's when their due date is.

And yes, I tried a Reset on one of the machines fairly soon after noticing the MW page thought I'd got more tasks than I actually had but, possibly because of other timing issues, it didn't try to re-send them. As it only seemed to get round to dealing with tasks that were scheduled for a retry around then, perhaps no surprise... (And, as I currently have quite a few jobs buffered, I'm not going to try another reset to see if it works now...)

Cheers - Al.

Ok, not corrupted in the traditional sense I guess. But the database was carrying 70 tasks assigned to me that I never received sent out on the 19th and 20th. They have all timed out now.
They were not even showing in my account till the hard drive rebuild completed. I had zero tasks in progress since I had set NNT on all my hosts till the hard drive situation got sorted out and I could believe the project had finally returned to some semblance of normality. I just started asking for work today.
ID: 72406 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 256
Credit: 604,411,638
RAC: 0
Message 72410 - Posted: 1 Apr 2022, 13:13:49 UTC - in response to Message 72397.  

@Kiska:


Would you prefer searing white?

Ok, you got me wondering which is better.
Sorry about me griping!
Well, maybe I am a little dumb (or more), but how can I "enlarge" the image you sent, so that I can read the infos?
Clicking on it does not do anything.
It is not an URL.

To me this is a "black" image with some "little/small" information (in tiny different colors) on it.
Or, now, a "white" one.

So please lead me in the right direction.
With thanks and appreciation.
cheers
ID: 72410 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 990
Credit: 376,143,149
RAC: 0
Message 72411 - Posted: 1 Apr 2022, 14:36:35 UTC - in response to Message 72396.  
Last modified: 1 Apr 2022, 14:38:24 UTC

Because they're far too slow for the amount of data shifted today. They're too slow for a desktop let alone a server with 1000s of users. There's a reason SiDock, Universe, Rosetta, etc, etc, all use them. They all said they're brilliant. Why would you want to wait for a head to physically move from one track to another? It's an absurd amount of time to wait when 300 people want access simultaneously. Have you not seen how long the Milkyway server took to rebuild a disk while trying to cater for users? It was ridiculous.


SSD wouldn't have made that much of a difference during the rebuilding process. Tom took the project offline and it finished in around 12 hours. Now the spinners are handing things just fine. There is more to validating tasks besides just moving data around and the HDD array isn't the bottleneck.
It's the bottleneck for rebuilding. It's clearly the slowest thing in his setup, since one missing drive caused all this in the first place. A degraded RAID should not be too slow to run the server. Universe@Home recently changed to SSD and the admin says it's marvellous. Rosetta has 72 SSDs! Why on earth would you think it's good to use something that's 50 times slower?! Something that holds up my desktop, nevermind a server with 1000s of users!
ID: 72411 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 990
Credit: 376,143,149
RAC: 0
Message 72412 - Posted: 1 Apr 2022, 14:39:43 UTC - in response to Message 72397.  

Oh, OK.
Well, that is also what the project status page says.
No need for a very dark unreadable "picture"?
Or am I missing something?

Have you all a nice day ...
Would you prefer searing white?
Maybe they're using a CRT.
ID: 72412 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 990
Credit: 376,143,149
RAC: 0
Message 72413 - Posted: 1 Apr 2022, 14:41:40 UTC - in response to Message 72402.  

I shouldn't have any issues since I only carry 30 tasks on any host at any time.
Even on a GPU?
ID: 72413 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 990
Credit: 376,143,149
RAC: 0
Message 72414 - Posted: 1 Apr 2022, 14:44:18 UTC - in response to Message 72410.  

Well, maybe I am a little dumb (or more), but how can I "enlarge" the image you sent, so that I can read the infos?
Clicking on it does not do anything.
It is not an URL.

To me this is a "black" image with some "little/small" information (in tiny different colors) on it.
Or, now, a "white" one.

So please lead me in the right direction.
With thanks and appreciation.
cheers
If I right click the image and open in new tab, it's opened as just the image, then I can zoom in and out with the browser. Or you can right click it and save it, then view it in a photo editor.
ID: 72414 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kiska

Send message
Joined: 31 Mar 12
Posts: 96
Credit: 152,502,225
RAC: 1
Message 72415 - Posted: 1 Apr 2022, 15:23:03 UTC - in response to Message 72410.  
Last modified: 1 Apr 2022, 15:23:24 UTC


Well, maybe I am a little dumb (or more), but how can I "enlarge" the image you sent, so that I can read the infos?
Clicking on it does not do anything.
It is not an URL.

To me this is a "black" image with some "little/small" information (in tiny different colors) on it.
Or, now, a "white" one.

So please lead me in the right direction.
With thanks and appreciation.
cheers


Searing white: https://grafana.kiska.pw/goto/O8WQ-Lsnz?orgId=1
Dark theme: https://grafana.kiska.pw/goto/Urg_aLynz?orgId=1
ID: 72415 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 256
Credit: 604,411,638
RAC: 0
Message 72416 - Posted: 1 Apr 2022, 16:08:32 UTC - in response to Message 72414.  

@Peter:

Thanks, I have never used these "methods"!
It is awful learning new things while not wanting too ...
cheers
ID: 72416 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 256
Credit: 604,411,638
RAC: 0
Message 72417 - Posted: 1 Apr 2022, 16:10:28 UTC - in response to Message 72415.  

@ Kiska:

Searing white: https://grafana.kiska.pw/goto/O8WQ-Lsnz?orgId=1
Dark theme: https://grafana.kiska.pw/goto/Urg_aLynz?orgId=1

OK, well now I am smarter ...
Thanks and have a nice weekend!
ID: 72417 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 715
Credit: 556,390,875
RAC: 53,914
Message 72420 - Posted: 1 Apr 2022, 17:34:13 UTC - in response to Message 72413.  
Last modified: 1 Apr 2022, 17:47:11 UTC

I shouldn't have any issues since I only carry 30 tasks on any host at any time.
Even on a GPU?

Yes, why should you carry any more than what is immediately required with those projects that are stable in the server uptime and task generation.
Up until just recently, I put Milkyway in that camp along with Einstein. Both projects normally run quite well on a send one in - get new one back strategy.
I can't count 1 time that Einstein was ever out of work. Sets the benchmark as far as stable and dependable for BOINC projects.
I run a custom team client that enables me to set a specific work unit target to be constantly maintained.

project: http://milkyway.cs.rpi.edu/milkyway/
gpu_limit: 30
report_delay: 750

The report_delay gets around the MW server misconfiguration where you are unable to request work if reporting work in the same scheduler connection.
The delay is greater than the MW 600 second timeout period after depleting a cache that normal clients incur.
The scheduler connection is the default 91 seconds. So it asks for new work every 91 seconds to top off my cache.
I always have my 30 task count maintained.
ID: 72420 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 990
Credit: 376,143,149
RAC: 0
Message 72422 - Posted: 1 Apr 2022, 18:37:08 UTC - in response to Message 72420.  
Last modified: 1 Apr 2022, 18:39:14 UTC

project: http://milkyway.cs.rpi.edu/milkyway/
gpu_limit: 30
report_delay: 750

The report_delay gets around the MW server misconfiguration where you are unable to request work if reporting work in the same scheduler connection.
The delay is greater than the MW 600 second timeout period after depleting a cache that normal clients incur.
The scheduler connection is the default 91 seconds. So it asks for new work every 91 seconds to top off my cache.
I always have my 30 task count maintained.
I can't find "report_delay" in the Boinc configuration files. Where do I put it? This is why I currently let it get the maximum 300 per GPU because there will be a 10 minute gap at the end.
ID: 72422 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kiska

Send message
Joined: 31 Mar 12
Posts: 96
Credit: 152,502,225
RAC: 1
Message 72423 - Posted: 1 Apr 2022, 19:27:52 UTC

And there goes all of the validation pending:
Dashboard at: https://grafana.kiska.pw/goto/Urg_aLynz?orgId=1
ID: 72423 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
unixchick
Avatar

Send message
Joined: 21 Feb 22
Posts: 66
Credit: 817,008
RAC: 0
Message 72424 - Posted: 1 Apr 2022, 20:17:40 UTC

Thank you Kiska for a beautiful diagnostic display !

It is great news that validation pending is 0. The computer power (volunteers) is still steadily growing as the server can send out WUs reliably again.
I'm getting WUs that were put in the queue late on March 24, so I'm not sure when the resends from the recent validation inconclusives will go out.

Very nice to see the system recover. I hope Tom gets some time to not think about the system.
ID: 72424 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kiska

Send message
Joined: 31 Mar 12
Posts: 96
Credit: 152,502,225
RAC: 1
Message 72425 - Posted: 1 Apr 2022, 20:29:37 UTC - in response to Message 72424.  

Thank you Kiska for a beautiful diagnostic display !

It is great news that validation pending is 0. The computer power (volunteers) is still steadily growing as the server can send out WUs reliably again.
I'm getting WUs that were put in the queue late on March 24, so I'm not sure when the resends from the recent validation inconclusives will go out.

Very nice to see the system recover. I hope Tom gets some time to not think about the system.


Probably a while more, I am getting tasks generated on the 16th of March :D
ID: 72425 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kiska

Send message
Joined: 31 Mar 12
Posts: 96
Credit: 152,502,225
RAC: 1
Message 72426 - Posted: 1 Apr 2022, 20:32:22 UTC
Last modified: 1 Apr 2022, 20:33:37 UTC

This is a note for Tom only:

Do NOT turn the work generators(each subproject) back on until ready to send is under 500k, if you do you'll probably end up choking the database again.

A note for your CS department, you need to get them to implement some function to pause generation of work when transitioner backlog or ready to send buffers are above a certain limit.
ID: 72426 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 990
Credit: 376,143,149
RAC: 0
Message 72427 - Posted: 1 Apr 2022, 20:33:01 UTC - in response to Message 72425.  

Probably a while more, I am getting tasks generated on the 16th of March :D
Doesn't that mean the server has caught up, but we haven't?
ID: 72427 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 15 · Next

Message boards : News : Server Downtime March 28, 2022 (12 hours starting 00:00 UTC)

©2024 Astroinformatics Group