Server Downtime March 28, 2022 (12 hours starting 00:00 UTC)

Author	Message
San-Fernando-Valley Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0	Message 72306 - Posted: 29 Mar 2022, 14:39:22 UTC - in response to Message 72305. @ Tom: no problem ... glad you replied ... try to have a nice day .... cheers ID: 72306 · Rating: 0 · rate: / Reply Quote

Septimus Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,900,464 RAC: 0	Message 72307 - Posted: 29 Mar 2022, 14:42:09 UTC - in response to Message 72306. I did eventually get WU's about 30 Mins ago...obviously a big lag for some reason. ID: 72307 · Rating: 0 · rate: / Reply Quote

Kiska Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,177 RAC: 0	Message 72308 - Posted: 29 Mar 2022, 16:46:00 UTC - in response to Message 72305. Last modified: 29 Mar 2022, 16:56:51 UTC Hmm. That's really strange. I hope that things didn't get "desynced" with the DB when the drive failed. I'm not sure exactly what that would mean in terms of tech, but if the server thinks that there are 1.6M WUs that are ready to be sent out, but they are actually stuck somewhere, then it probably wouldn't make more WUs that could actually get sent out. Maybe I can clear out the WUs that are waiting in the queue somehow. I suppose this could also happen with the WUs in the validation queue, which could explain the huge amount of WUs waiting for validation that apparently aren't getting sent out either. I'll have to dig around. If I end up purging everything I'll have to figure out a way to give out the proper credits to everyone who has WUs that have been crunched and are sitting in limbo. Today is busy for me, I have to train students on our observatory telescope, have several meetings, and am about to teach class. But I'll try to look into things when I get a chance. This isn't strange... the server status page is correct, you need to reduce the time it takes to look up tasks ready to be sent. You are asking the database too much with over 18M tasks ready to be sent. The scheduler process doesn't hit the database every time someone asks for work, it instead looks in a shared memory pool that is fed by the feeder. This feeder process hits the database for tasks marked ready to send(up til the shared memory size), so you need to reduce the number with this as it does take the database some time to return an output and the bigger the database the longer it takes to return said output especially given there are lots of processes hitting it(ie web forum like this response, scheduler setting tasks to sent, feeder requesting tasks to be put into scheduler, stats generation, etc etc). If this is a problem, try increasing the shared memory size and it should store more in the buffer Or the other option is to turn off the workunit generator only, and you can see if jobs are still being sent out. ID: 72308 · Rating: 0 · rate: / Reply Quote

Tom Donlon Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0	Message 72313 - Posted: 29 Mar 2022, 18:52:52 UTC Thanks for the insight Kiska, that is very helpful. I am going to shut down the milkyway processes for a bit in order to remove the pile-up of unsent WUs, and then I will also look at increasing the shared memory size. ID: 72313 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 990 Credit: 376,143,149 RAC: 0	Message 72314 - Posted: 29 Mar 2022, 19:04:47 UTC - in response to Message 72265. Last modified: 29 Mar 2022, 19:05:28 UTC I've just been notified that the server rebuild has completed, so hopefully we can get things running as per normal soon! I'll be paying attention to things throughout the day in order to fix any weirdness that comes along. I'm guessing your rebuild took ages because the disks were under heavy load. Yes I know I've mentioned it before, but please get some SSDs from our donations. And ask for donations on the home page! ID: 72314 · Rating: 0 · rate: / Reply Quote

Tom Donlon Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0	Message 72316 - Posted: 29 Mar 2022, 19:29:56 UTC - in response to Message 72313. Our shared memory limit appears to be the max value it can be for 64 bit machines (SHMMAX = 18446744073692774399 bytes), so I don't think that the shared memory allowance is the issue here, unless I'm missing something. I did kill all jobs that hadn't been sent out yet though. ID: 72316 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 213 Credit: 108,368,627 RAC: 4,272	Message 72317 - Posted: 29 Mar 2022, 19:33:04 UTC - in response to Message 72313. Last modified: 29 Mar 2022, 19:41:44 UTC Thanks for the insight Kiska, that is very helpful. I am going to shut down the milkyway processes for a bit in order to remove the pile-up of unsent WUs, and then I will also look at increasing the shared memory size. Tom, I see you've turned off almost everything... Surely if you turn off the feeder there is no way work can be sent out as nothing will be put in the shared memory to be sent out! Also, if the transitioner isn't running result data won't move from one state to another. Kiska's recommendation was to turn off the work generator(s) only... ~~I'm confused - hopefully Kiska will chip in again...~~ [Edit. Just seen your most recent comment regarding deleting unsent work. And the status page now shows everything turned on again, though it seems to be an older version of the status page...] Cheers - Al. ID: 72317 · Rating: 0 · rate: / Reply Quote

Tom Donlon Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0	Message 72318 - Posted: 29 Mar 2022, 19:40:58 UTC - in response to Message 72317. I see you've turned off almost everything... Surely if you turn off the feeder there is no way work can be sent out as nothing will be put in the shared memory to be sent out! Also, if the transitioner isn't running result data won't move from one state to another. Kiska's recommendation was to turn off the work generator(s) only... I'm confused - hopefully Kiska will chip in again... Cheers - Al. Everything is back on. I think you just happened to look at the server status page while things were coming back up, so some processes were still reporting as off. I've noticed that there's a little bit of lag there, and sometimes people catch it in weird moments while I'm doing maintenance, and then they get confused because it looks like I'm turning random things off and on. ID: 72318 · Rating: 0 · rate: / Reply Quote

43WpokpwhskTx3k8i5XvfeSQfAU2 Send message Joined: 20 Mar 12 Posts: 4 Credit: 5,174,898 RAC: 0	Message 72319 - Posted: 29 Mar 2022, 19:47:04 UTC - in response to Message 72318. Hi Tom, At the moment I'm getting the following error: 29-3-2022 21:43:37 \| Milkyway@Home \| update requested by user 29-3-2022 21:43:37 \| Milkyway@Home \| Sending scheduler request: Requested by user. 29-3-2022 21:43:37 \| Milkyway@Home \| Requesting new tasks for NVIDIA GPU 29-3-2022 21:43:48 \| Milkyway@Home \| Scheduler request completed: got 0 new tasks 29-3-2022 21:43:48 \| Milkyway@Home \| Server error: feeder not running 29-3-2022 21:43:48 \| Milkyway@Home \| Project requested delay of 400 seconds Is everything OK? Regards, Arjen ID: 72319 · Rating: 0 · rate: / Reply Quote

unixchick Send message Joined: 21 Feb 22 Posts: 66 Credit: 817,008 RAC: 0	Message 72320 - Posted: 29 Mar 2022, 19:48:27 UTC Tue Mar 29 12:46:22 2022 \| Milkyway@Home \| Reporting 2 completed tasks Tue Mar 29 12:46:22 2022 \| Milkyway@Home \| Requesting new tasks for CPU and Intel GPU Tue Mar 29 12:46:33 2022 \| Milkyway@Home \| Scheduler request completed: got 0 new tasks Tue Mar 29 12:46:33 2022 \| Milkyway@Home \| Server error: feeder not running ID: 72320 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 990 Credit: 376,143,149 RAC: 0	Message 72321 - Posted: 29 Mar 2022, 19:48:29 UTC - in response to Message 72318. Everything is back on. I think you just happened to look at the server status page while things were coming back up, so some processes were still reporting as off. I've noticed that there's a little bit of lag there, and sometimes people catch it in weird moments while I'm doing maintenance, and then they get confused because it looks like I'm turning random things off and on. Just tried to get GPU work and got: 1606 Milkyway@Home 29-03-2022 08:46 PM Requesting new tasks for CPU and AMD/ATI GPU 1607 Milkyway@Home 29-03-2022 08:46 PM [sched_op] CPU work request: 280707.60 seconds; 0.00 devices 1608 Milkyway@Home 29-03-2022 08:46 PM [sched_op] AMD/ATI GPU work request: 21780.00 seconds; 1.00 devices 1609 Milkyway@Home 29-03-2022 08:46 PM Scheduler request completed: got 0 new tasks 1610 Milkyway@Home 29-03-2022 08:46 PM Server error: feeder not running ID: 72321 · Rating: 0 · rate: / Reply Quote

Tom Donlon Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0	Message 72322 - Posted: 29 Mar 2022, 19:53:19 UTC Last modified: 29 Mar 2022, 19:54:12 UTC Looks like the same thing as last time. I'll restart the server processes and see if the feeder comes back up. No clue why it does that. ID: 72322 · Rating: 0 · rate: / Reply Quote

43WpokpwhskTx3k8i5XvfeSQfAU2 Send message Joined: 20 Mar 12 Posts: 4 Credit: 5,174,898 RAC: 0	Message 72323 - Posted: 29 Mar 2022, 19:56:27 UTC - in response to Message 72322. There's no 'feeder not running' error for me now: 29-3-2022 21:53:45 \| Milkyway@Home \| update requested by user 29-3-2022 21:53:45 \| Milkyway@Home \| Sending scheduler request: Requested by user. 29-3-2022 21:53:45 \| Milkyway@Home \| Requesting new tasks for NVIDIA GPU 29-3-2022 21:53:46 \| Milkyway@Home \| Scheduler request completed: got 0 new tasks 29-3-2022 21:53:46 \| Milkyway@Home \| Project requested delay of 91 seconds So I'll just have to wait a bit I guess :) ID: 72323 · Rating: 0 · rate: / Reply Quote

Bill F Send message Joined: 4 Jul 09 Posts: 97 Credit: 17,382,546 RAC: 1,427	Message 72324 - Posted: 29 Mar 2022, 20:01:33 UTC Image from Stat's page now Download server milkyway.cs.rpi.edu Running Upload server milkyway.cs.rpi.edu Running Scheduler milkyway Not Running feeder milkyway Not Running transitioner milkyway Not Running db_purge milkyway Not Running file_deleter milkyway Not Running stream_fit_validator (milkyway) milkyway Not Running stream_fit_assimilator (milkyway ) milkyway Not Running stream_fit_work_generator (milkyway ) milkyway Not Running nbody_validator (milkyway_nbody ) milkyway Not Running nbody_assimilator (milkyway_nbody ) milkyway Not Running nbody_work_generator (milkyway_nbody) milkyway Not Running Milkyway@home N-Body Simulation 13966487 576029 0.17 (0.01 - 1.3) 185 Milkyway@home Separation 4961115 655282 0.6 (0.01 - 25.08) 919 Upstream server release: 1.0.4 Database schema version: 27028 Task data as of 29 Mar 2022, 18:24:00 UTC ID: 72324 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 990 Credit: 376,143,149 RAC: 0	Message 72325 - Posted: 29 Mar 2022, 20:03:43 UTC - in response to Message 72324. Last modified: 29 Mar 2022, 20:05:05 UTC Image from Stat's page now Download server milkyway.cs.rpi.edu Running Upload server milkyway.cs.rpi.edu Running Scheduler milkyway Not Running feeder milkyway Not Running transitioner milkyway Not Running db_purge milkyway Not Running file_deleter milkyway Not Running stream_fit_validator (milkyway) milkyway Not Running stream_fit_assimilator (milkyway ) milkyway Not Running stream_fit_work_generator (milkyway ) milkyway Not Running nbody_validator (milkyway_nbody ) milkyway Not Running nbody_assimilator (milkyway_nbody ) milkyway Not Running nbody_work_generator (milkyway_nbody) milkyway Not Running Milkyway@home N-Body Simulation 13966487 576029 0.17 (0.01 - 1.3) 185 Milkyway@home Separation 4961115 655282 0.6 (0.01 - 25.08) 919 Upstream server release: 1.0.4 Database schema version: 27028 Task data as of 29 Mar 2022, 18:24:00 UTC Weird, I see the same 18:24 UTC, but everything says running. But Boinc returns "Milkyway@Home 29-03-2022 09:00 PM Project is temporarily shut down for maintenance" (I'm on UTC+1 here) ID: 72325 · Rating: 0 · rate: / Reply Quote

Kiska Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,177 RAC: 0	Message 72326 - Posted: 29 Mar 2022, 20:08:38 UTC - in response to Message 72316. Our shared memory limit appears to be the max value it can be for 64 bit machines (SHMMAX = 18446744073692774399 bytes), so I don't think that the shared memory allowance is the issue here, unless I'm missing something. I did kill all jobs that hadn't been sent out yet though. No no no, please don't set that, something more sensible please? Unless you actually have that amount of RAM its not going to help. You also need to consider that the database also needs to use memory for its own caching ability. Maybe something like 512MiB or 1GiB should be sufficient for shared memory. If you use /etc/sysctl.conf for system config, just edit then use sysctl -p to reload the changes As for the turning thing off ONLY these 2 should be turned off to test if the database tasks table is just really full: stream_fit_work_generator (milkyway ) nbody_work_generator (milkyway_nbody) ID: 72326 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 213 Credit: 108,368,627 RAC: 4,272	Message 72327 - Posted: 29 Mar 2022, 20:12:24 UTC - in response to Message 72318. Last modified: 29 Mar 2022, 20:12:52 UTC Tom - thanks for the clarification. I'd just edited my earlier message to acknowledge seeing your "killed tasks" comment, but I'll leave the edit as is... The task numbers appear unchanged (as the page on display at 20:00 UTC is apparently from before your most recent activity) -- I presume the server status page task data won't get updated again today - do you know how often it should update those numbers? (It didn't seem to be updated more than once a day whilst the server was struggling...) Thanks for your efforts, especially given the other calls on your time! I just hope it sorts out soon, and you can get a bit of relative peace and quiet... Cheers - Al. ID: 72327 · Rating: 0 · rate: / Reply Quote

Kiska Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,177 RAC: 0	Message 72328 - Posted: 29 Mar 2022, 20:23:24 UTC - in response to Message 72327. The task numbers appear unchanged (as the page on display at 20:00 UTC is apparently from before your most recent activity) -- I presume the server status page task data won't get updated again today - do you know how often it should update those numbers? (It didn't seem to be updated more than once a day whilst the server was struggling...) Cheers - Al. Seems to update once in a while see https://grafana.kiska.pw/goto/8ChLths7k?orgId=1 Click on one of the fields on the right hand side of the graph: or ID: 72328 · Rating: 0 · rate: / Reply Quote

43WpokpwhskTx3k8i5XvfeSQfAU2 Send message Joined: 20 Mar 12 Posts: 4 Credit: 5,174,898 RAC: 0	Message 72329 - Posted: 29 Mar 2022, 20:58:30 UTC 29-3-2022 22:57:44 \| Milkyway@Home \| Sending scheduler request: To fetch work. 29-3-2022 22:57:44 \| Milkyway@Home \| Requesting new tasks for NVIDIA GPU 29-3-2022 22:57:46 \| Milkyway@Home \| Scheduler request completed: got 0 new tasks 29-3-2022 22:57:46 \| Milkyway@Home \| No work available 29-3-2022 22:57:46 \| Milkyway@Home \| Project requested delay of 91 seconds ID: 72329 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 990 Credit: 376,143,149 RAC: 0	Message 72330 - Posted: 29 Mar 2022, 21:04:54 UTC - in response to Message 72329. Last modified: 29 Mar 2022, 21:05:34 UTC 29-3-2022 22:57:44 \| Milkyway@Home \| Sending scheduler request: To fetch work. 29-3-2022 22:57:44 \| Milkyway@Home \| Requesting new tasks for NVIDIA GPU 29-3-2022 22:57:46 \| Milkyway@Home \| Scheduler request completed: got 0 new tasks 29-3-2022 22:57:46 \| Milkyway@Home \| No work available 29-3-2022 22:57:46 \| Milkyway@Home \| Project requested delay of 91 seconds I keep getting CPU Nbody work, but there seems to be a problem with server keeping up with the GPUs. If there was a way to send out bigger GPU tasks that would be nice. I think the huge number of them being sent back and forth could be overloading things? ID: 72330 · Rating: 0 · rate: / Reply Quote