Welcome to MilkyWay@home

Server Downtime March 28, 2022 (12 hours starting 00:00 UTC)


Advanced search

Message boards : News : Server Downtime March 28, 2022 (12 hours starting 00:00 UTC)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 15 · Next

AuthorMessage
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 160
Credit: 128,336,112
RAC: 10,115
100 million credit badge5 year member badgeextraordinary contributions badge
Message 72306 - Posted: 29 Mar 2022, 14:39:22 UTC - in response to Message 72305.  

@ Tom:

no problem ...
glad you replied ...
try to have a nice day ....
cheers
ID: 72306 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 140
Credit: 1,812,714
RAC: 4,717
1 million credit badge10 year member badge
Message 72307 - Posted: 29 Mar 2022, 14:42:09 UTC - in response to Message 72306.  

I did eventually get WU's about 30 Mins ago...obviously a big lag for some reason.
ID: 72307 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kiska

Send message
Joined: 31 Mar 12
Posts: 55
Credit: 90,739,444
RAC: 9,389
50 million credit badge10 year member badge
Message 72308 - Posted: 29 Mar 2022, 16:46:00 UTC - in response to Message 72305.  
Last modified: 29 Mar 2022, 16:56:51 UTC

Hmm. That's really strange. I hope that things didn't get "desynced" with the DB when the drive failed. I'm not sure exactly what that would mean in terms of tech, but if the server thinks that there are 1.6M WUs that are ready to be sent out, but they are actually stuck somewhere, then it probably wouldn't make more WUs that could actually get sent out.

Maybe I can clear out the WUs that are waiting in the queue somehow. I suppose this could also happen with the WUs in the validation queue, which could explain the huge amount of WUs waiting for validation that apparently aren't getting sent out either.

I'll have to dig around. If I end up purging everything I'll have to figure out a way to give out the proper credits to everyone who has WUs that have been crunched and are sitting in limbo.

Today is busy for me, I have to train students on our observatory telescope, have several meetings, and am about to teach class. But I'll try to look into things when I get a chance.


This isn't strange... the server status page is correct, you need to reduce the time it takes to look up tasks ready to be sent. You are asking the database too much with over 18M tasks ready to be sent. The scheduler process doesn't hit the database every time someone asks for work, it instead looks in a shared memory pool that is fed by the feeder. This feeder process hits the database for tasks marked ready to send(up til the shared memory size), so you need to reduce the number with this as it does take the database some time to return an output and the bigger the database the longer it takes to return said output especially given there are lots of processes hitting it(ie web forum like this response, scheduler setting tasks to sent, feeder requesting tasks to be put into scheduler, stats generation, etc etc).

If this is a problem, try increasing the shared memory size and it should store more in the buffer
Or the other option is to turn off the workunit generator only, and you can see if jobs are still being sent out.
ID: 72308 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 347
Credit: 83,409,419
RAC: 139,082
50 million credit badge3 year member badge
Message 72313 - Posted: 29 Mar 2022, 18:52:52 UTC

Thanks for the insight Kiska, that is very helpful. I am going to shut down the milkyway processes for a bit in order to remove the pile-up of unsent WUs, and then I will also look at increasing the shared memory size.
ID: 72313 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 5 Jul 11
Posts: 722
Credit: 305,491,398
RAC: 623,765
300 million credit badge11 year member badge
Message 72314 - Posted: 29 Mar 2022, 19:04:47 UTC - in response to Message 72265.  
Last modified: 29 Mar 2022, 19:05:28 UTC

I've just been notified that the server rebuild has completed, so hopefully we can get things running as per normal soon! I'll be paying attention to things throughout the day in order to fix any weirdness that comes along.
I'm guessing your rebuild took ages because the disks were under heavy load. Yes I know I've mentioned it before, but please get some SSDs from our donations. And ask for donations on the home page!
ID: 72314 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 347
Credit: 83,409,419
RAC: 139,082
50 million credit badge3 year member badge
Message 72316 - Posted: 29 Mar 2022, 19:29:56 UTC - in response to Message 72313.  

Our shared memory limit appears to be the max value it can be for 64 bit machines (SHMMAX = 18446744073692774399 bytes), so I don't think that the shared memory allowance is the issue here, unless I'm missing something. I did kill all jobs that hadn't been sent out yet though.
ID: 72316 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 142
Credit: 79,231,758
RAC: 71,244
50 million credit badge12 year member badgeextraordinary contributions badge
Message 72317 - Posted: 29 Mar 2022, 19:33:04 UTC - in response to Message 72313.  
Last modified: 29 Mar 2022, 19:41:44 UTC

Thanks for the insight Kiska, that is very helpful. I am going to shut down the milkyway processes for a bit in order to remove the pile-up of unsent WUs, and then I will also look at increasing the shared memory size.

Tom,

I see you've turned off almost everything... Surely if you turn off the feeder there is no way work can be sent out as nothing will be put in the shared memory to be sent out! Also, if the transitioner isn't running result data won't move from one state to another. Kiska's recommendation was to turn off the work generator(s) only...

I'm confused - hopefully Kiska will chip in again...

[Edit. Just seen your most recent comment regarding deleting unsent work. And the status page now shows everything turned on again, though it seems to be an older version of the status page...]

Cheers - Al.
ID: 72317 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 347
Credit: 83,409,419
RAC: 139,082
50 million credit badge3 year member badge
Message 72318 - Posted: 29 Mar 2022, 19:40:58 UTC - in response to Message 72317.  

I see you've turned off almost everything... Surely if you turn off the feeder there is no way work can be sent out as nothing will be put in the shared memory to be sent out! Also, if the transitioner isn't running result data won't move from one state to another. Kiska's recommendation was to turn off the work generator(s) only...

I'm confused - hopefully Kiska will chip in again...

Cheers - Al.


Everything is back on. I think you just happened to look at the server status page while things were coming back up, so some processes were still reporting as off. I've noticed that there's a little bit of lag there, and sometimes people catch it in weird moments while I'm doing maintenance, and then they get confused because it looks like I'm turning random things off and on.
ID: 72318 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Arjen

Send message
Joined: 20 Mar 12
Posts: 4
Credit: 1,054,569
RAC: 2
1 million credit badge10 year member badge
Message 72319 - Posted: 29 Mar 2022, 19:47:04 UTC - in response to Message 72318.  

Hi Tom,

At the moment I'm getting the following error:

29-3-2022 21:43:37 | Milkyway@Home | update requested by user
29-3-2022 21:43:37 | Milkyway@Home | Sending scheduler request: Requested by user.
29-3-2022 21:43:37 | Milkyway@Home | Requesting new tasks for NVIDIA GPU
29-3-2022 21:43:48 | Milkyway@Home | Scheduler request completed: got 0 new tasks
29-3-2022 21:43:48 | Milkyway@Home | Server error: feeder not running
29-3-2022 21:43:48 | Milkyway@Home | Project requested delay of 400 seconds

Is everything OK?

Regards,
Arjen
ID: 72319 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
unixchick
Avatar

Send message
Joined: 21 Feb 22
Posts: 66
Credit: 505,548
RAC: 110
500 thousand credit badge
Message 72320 - Posted: 29 Mar 2022, 19:48:27 UTC

Tue Mar 29 12:46:22 2022 | Milkyway@Home | Reporting 2 completed tasks
Tue Mar 29 12:46:22 2022 | Milkyway@Home | Requesting new tasks for CPU and Intel GPU
Tue Mar 29 12:46:33 2022 | Milkyway@Home | Scheduler request completed: got 0 new tasks
Tue Mar 29 12:46:33 2022 | Milkyway@Home | Server error: feeder not running
ID: 72320 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 5 Jul 11
Posts: 722
Credit: 305,491,398
RAC: 623,765
300 million credit badge11 year member badge
Message 72321 - Posted: 29 Mar 2022, 19:48:29 UTC - in response to Message 72318.  

Everything is back on. I think you just happened to look at the server status page while things were coming back up, so some processes were still reporting as off. I've noticed that there's a little bit of lag there, and sometimes people catch it in weird moments while I'm doing maintenance, and then they get confused because it looks like I'm turning random things off and on.
Just tried to get GPU work and got:

1606 Milkyway@Home 29-03-2022 08:46 PM Requesting new tasks for CPU and AMD/ATI GPU
1607 Milkyway@Home 29-03-2022 08:46 PM [sched_op] CPU work request: 280707.60 seconds; 0.00 devices
1608 Milkyway@Home 29-03-2022 08:46 PM [sched_op] AMD/ATI GPU work request: 21780.00 seconds; 1.00 devices
1609 Milkyway@Home 29-03-2022 08:46 PM Scheduler request completed: got 0 new tasks
1610 Milkyway@Home 29-03-2022 08:46 PM Server error: feeder not running
ID: 72321 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 347
Credit: 83,409,419
RAC: 139,082
50 million credit badge3 year member badge
Message 72322 - Posted: 29 Mar 2022, 19:53:19 UTC
Last modified: 29 Mar 2022, 19:54:12 UTC

Looks like the same thing as last time. I'll restart the server processes and see if the feeder comes back up.

No clue why it does that.
ID: 72322 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Arjen

Send message
Joined: 20 Mar 12
Posts: 4
Credit: 1,054,569
RAC: 2
1 million credit badge10 year member badge
Message 72323 - Posted: 29 Mar 2022, 19:56:27 UTC - in response to Message 72322.  

There's no 'feeder not running' error for me now:

29-3-2022 21:53:45 | Milkyway@Home | update requested by user
29-3-2022 21:53:45 | Milkyway@Home | Sending scheduler request: Requested by user.
29-3-2022 21:53:45 | Milkyway@Home | Requesting new tasks for NVIDIA GPU
29-3-2022 21:53:46 | Milkyway@Home | Scheduler request completed: got 0 new tasks
29-3-2022 21:53:46 | Milkyway@Home | Project requested delay of 91 seconds

So I'll just have to wait a bit I guess :)
ID: 72323 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileBill F
Avatar

Send message
Joined: 4 Jul 09
Posts: 53
Credit: 12,126,656
RAC: 1,670
10 million credit badge13 year member badge
Message 72324 - Posted: 29 Mar 2022, 20:01:33 UTC

Image from Stat's page now

Download server milkyway.cs.rpi.edu Running
Upload server milkyway.cs.rpi.edu Running
Scheduler milkyway Not Running
feeder milkyway Not Running
transitioner milkyway Not Running
db_purge milkyway Not Running
file_deleter milkyway Not Running
stream_fit_validator (milkyway) milkyway Not Running
stream_fit_assimilator (milkyway ) milkyway Not Running
stream_fit_work_generator (milkyway ) milkyway Not Running
nbody_validator (milkyway_nbody ) milkyway Not Running
nbody_assimilator (milkyway_nbody ) milkyway Not Running
nbody_work_generator (milkyway_nbody) milkyway Not Running

Milkyway@home N-Body Simulation 13966487 576029 0.17 (0.01 - 1.3) 185
Milkyway@home Separation 4961115 655282 0.6 (0.01 - 25.08) 919

Upstream server release: 1.0.4
Database schema version: 27028

Task data as of 29 Mar 2022, 18:24:00 UTC
ID: 72324 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 5 Jul 11
Posts: 722
Credit: 305,491,398
RAC: 623,765
300 million credit badge11 year member badge
Message 72325 - Posted: 29 Mar 2022, 20:03:43 UTC - in response to Message 72324.  
Last modified: 29 Mar 2022, 20:05:05 UTC

Image from Stat's page now

Download server milkyway.cs.rpi.edu Running
Upload server milkyway.cs.rpi.edu Running
Scheduler milkyway Not Running
feeder milkyway Not Running
transitioner milkyway Not Running
db_purge milkyway Not Running
file_deleter milkyway Not Running
stream_fit_validator (milkyway) milkyway Not Running
stream_fit_assimilator (milkyway ) milkyway Not Running
stream_fit_work_generator (milkyway ) milkyway Not Running
nbody_validator (milkyway_nbody ) milkyway Not Running
nbody_assimilator (milkyway_nbody ) milkyway Not Running
nbody_work_generator (milkyway_nbody) milkyway Not Running

Milkyway@home N-Body Simulation 13966487 576029 0.17 (0.01 - 1.3) 185
Milkyway@home Separation 4961115 655282 0.6 (0.01 - 25.08) 919

Upstream server release: 1.0.4
Database schema version: 27028

Task data as of 29 Mar 2022, 18:24:00 UTC
Weird, I see the same 18:24 UTC, but everything says running. But Boinc returns "Milkyway@Home 29-03-2022 09:00 PM Project is temporarily shut down for maintenance" (I'm on UTC+1 here)
ID: 72325 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kiska

Send message
Joined: 31 Mar 12
Posts: 55
Credit: 90,739,444
RAC: 9,389
50 million credit badge10 year member badge
Message 72326 - Posted: 29 Mar 2022, 20:08:38 UTC - in response to Message 72316.  

Our shared memory limit appears to be the max value it can be for 64 bit machines (SHMMAX = 18446744073692774399 bytes), so I don't think that the shared memory allowance is the issue here, unless I'm missing something. I did kill all jobs that hadn't been sent out yet though.


No no no, please don't set that, something more sensible please? Unless you actually have that amount of RAM its not going to help.

You also need to consider that the database also needs to use memory for its own caching ability.

Maybe something like 512MiB or 1GiB should be sufficient for shared memory.

If you use /etc/sysctl.conf for system config, just edit then use sysctl -p to reload the changes

As for the turning thing off ONLY these 2 should be turned off to test if the database tasks table is just really full:
stream_fit_work_generator (milkyway )
nbody_work_generator (milkyway_nbody)
ID: 72326 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 142
Credit: 79,231,758
RAC: 71,244
50 million credit badge12 year member badgeextraordinary contributions badge
Message 72327 - Posted: 29 Mar 2022, 20:12:24 UTC - in response to Message 72318.  
Last modified: 29 Mar 2022, 20:12:52 UTC

Tom - thanks for the clarification.

I'd just edited my earlier message to acknowledge seeing your "killed tasks" comment, but I'll leave the edit as is...

The task numbers appear unchanged (as the page on display at 20:00 UTC is apparently from before your most recent activity) -- I presume the server status page task data won't get updated again today - do you know how often it should update those numbers? (It didn't seem to be updated more than once a day whilst the server was struggling...)

Thanks for your efforts, especially given the other calls on your time! I just hope it sorts out soon, and you can get a bit of relative peace and quiet...

Cheers - Al.
ID: 72327 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kiska

Send message
Joined: 31 Mar 12
Posts: 55
Credit: 90,739,444
RAC: 9,389
50 million credit badge10 year member badge
Message 72328 - Posted: 29 Mar 2022, 20:23:24 UTC - in response to Message 72327.  


The task numbers appear unchanged (as the page on display at 20:00 UTC is apparently from before your most recent activity) -- I presume the server status page task data won't get updated again today - do you know how often it should update those numbers? (It didn't seem to be updated more than once a day whilst the server was struggling...)

Cheers - Al.


Seems to update once in a while see https://grafana.kiska.pw/goto/8ChLths7k?orgId=1

Click on one of the fields on the right hand side of the graph:

or
ID: 72328 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Arjen

Send message
Joined: 20 Mar 12
Posts: 4
Credit: 1,054,569
RAC: 2
1 million credit badge10 year member badge
Message 72329 - Posted: 29 Mar 2022, 20:58:30 UTC

29-3-2022 22:57:44 | Milkyway@Home | Sending scheduler request: To fetch work.
29-3-2022 22:57:44 | Milkyway@Home | Requesting new tasks for NVIDIA GPU
29-3-2022 22:57:46 | Milkyway@Home | Scheduler request completed: got 0 new tasks
29-3-2022 22:57:46 | Milkyway@Home | No work available
29-3-2022 22:57:46 | Milkyway@Home | Project requested delay of 91 seconds
ID: 72329 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 5 Jul 11
Posts: 722
Credit: 305,491,398
RAC: 623,765
300 million credit badge11 year member badge
Message 72330 - Posted: 29 Mar 2022, 21:04:54 UTC - in response to Message 72329.  
Last modified: 29 Mar 2022, 21:05:34 UTC

29-3-2022 22:57:44 | Milkyway@Home | Sending scheduler request: To fetch work.
29-3-2022 22:57:44 | Milkyway@Home | Requesting new tasks for NVIDIA GPU
29-3-2022 22:57:46 | Milkyway@Home | Scheduler request completed: got 0 new tasks
29-3-2022 22:57:46 | Milkyway@Home | No work available
29-3-2022 22:57:46 | Milkyway@Home | Project requested delay of 91 seconds
I keep getting CPU Nbody work, but there seems to be a problem with server keeping up with the GPUs. If there was a way to send out bigger GPU tasks that would be nice. I think the huge number of them being sent back and forth could be overloading things?
ID: 72330 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 15 · Next

Message boards : News : Server Downtime March 28, 2022 (12 hours starting 00:00 UTC)

©2022 Astroinformatics Group