Message boards :
News :
Server Downtime March 28, 2022 (12 hours starting 00:00 UTC)
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 15 · Next
Author | Message |
---|---|
Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0 |
@ Tom: no problem ... glad you replied ... try to have a nice day .... cheers |
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,900,464 RAC: 0 |
I did eventually get WU's about 30 Mins ago...obviously a big lag for some reason. |
Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,177 RAC: 0 |
Hmm. That's really strange. I hope that things didn't get "desynced" with the DB when the drive failed. I'm not sure exactly what that would mean in terms of tech, but if the server thinks that there are 1.6M WUs that are ready to be sent out, but they are actually stuck somewhere, then it probably wouldn't make more WUs that could actually get sent out. This isn't strange... the server status page is correct, you need to reduce the time it takes to look up tasks ready to be sent. You are asking the database too much with over 18M tasks ready to be sent. The scheduler process doesn't hit the database every time someone asks for work, it instead looks in a shared memory pool that is fed by the feeder. This feeder process hits the database for tasks marked ready to send(up til the shared memory size), so you need to reduce the number with this as it does take the database some time to return an output and the bigger the database the longer it takes to return said output especially given there are lots of processes hitting it(ie web forum like this response, scheduler setting tasks to sent, feeder requesting tasks to be put into scheduler, stats generation, etc etc). If this is a problem, try increasing the shared memory size and it should store more in the buffer Or the other option is to turn off the workunit generator only, and you can see if jobs are still being sent out. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Thanks for the insight Kiska, that is very helpful. I am going to shut down the milkyway processes for a bit in order to remove the pile-up of unsent WUs, and then I will also look at increasing the shared memory size. |
Send message Joined: 5 Jul 11 Posts: 990 Credit: 376,143,149 RAC: 0 |
I've just been notified that the server rebuild has completed, so hopefully we can get things running as per normal soon! I'll be paying attention to things throughout the day in order to fix any weirdness that comes along.I'm guessing your rebuild took ages because the disks were under heavy load. Yes I know I've mentioned it before, but please get some SSDs from our donations. And ask for donations on the home page! |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Our shared memory limit appears to be the max value it can be for 64 bit machines (SHMMAX = 18446744073692774399 bytes), so I don't think that the shared memory allowance is the issue here, unless I'm missing something. I did kill all jobs that hadn't been sent out yet though. |
Send message Joined: 16 Mar 10 Posts: 213 Credit: 108,368,627 RAC: 4,272 |
Thanks for the insight Kiska, that is very helpful. I am going to shut down the milkyway processes for a bit in order to remove the pile-up of unsent WUs, and then I will also look at increasing the shared memory size. Tom, I see you've turned off almost everything... Surely if you turn off the feeder there is no way work can be sent out as nothing will be put in the shared memory to be sent out! Also, if the transitioner isn't running result data won't move from one state to another. Kiska's recommendation was to turn off the work generator(s) only... [Edit. Just seen your most recent comment regarding deleting unsent work. And the status page now shows everything turned on again, though it seems to be an older version of the status page...] Cheers - Al. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
I see you've turned off almost everything... Surely if you turn off the feeder there is no way work can be sent out as nothing will be put in the shared memory to be sent out! Also, if the transitioner isn't running result data won't move from one state to another. Kiska's recommendation was to turn off the work generator(s) only... Everything is back on. I think you just happened to look at the server status page while things were coming back up, so some processes were still reporting as off. I've noticed that there's a little bit of lag there, and sometimes people catch it in weird moments while I'm doing maintenance, and then they get confused because it looks like I'm turning random things off and on. |
Send message Joined: 20 Mar 12 Posts: 4 Credit: 5,174,898 RAC: 0 |
Hi Tom, At the moment I'm getting the following error: 29-3-2022 21:43:37 | Milkyway@Home | update requested by user 29-3-2022 21:43:37 | Milkyway@Home | Sending scheduler request: Requested by user. 29-3-2022 21:43:37 | Milkyway@Home | Requesting new tasks for NVIDIA GPU 29-3-2022 21:43:48 | Milkyway@Home | Scheduler request completed: got 0 new tasks 29-3-2022 21:43:48 | Milkyway@Home | Server error: feeder not running 29-3-2022 21:43:48 | Milkyway@Home | Project requested delay of 400 seconds Is everything OK? Regards, Arjen |
Send message Joined: 21 Feb 22 Posts: 66 Credit: 817,008 RAC: 0 |
Tue Mar 29 12:46:22 2022 | Milkyway@Home | Reporting 2 completed tasks Tue Mar 29 12:46:22 2022 | Milkyway@Home | Requesting new tasks for CPU and Intel GPU Tue Mar 29 12:46:33 2022 | Milkyway@Home | Scheduler request completed: got 0 new tasks Tue Mar 29 12:46:33 2022 | Milkyway@Home | Server error: feeder not running |
Send message Joined: 5 Jul 11 Posts: 990 Credit: 376,143,149 RAC: 0 |
Everything is back on. I think you just happened to look at the server status page while things were coming back up, so some processes were still reporting as off. I've noticed that there's a little bit of lag there, and sometimes people catch it in weird moments while I'm doing maintenance, and then they get confused because it looks like I'm turning random things off and on.Just tried to get GPU work and got: 1606 Milkyway@Home 29-03-2022 08:46 PM Requesting new tasks for CPU and AMD/ATI GPU 1607 Milkyway@Home 29-03-2022 08:46 PM [sched_op] CPU work request: 280707.60 seconds; 0.00 devices 1608 Milkyway@Home 29-03-2022 08:46 PM [sched_op] AMD/ATI GPU work request: 21780.00 seconds; 1.00 devices 1609 Milkyway@Home 29-03-2022 08:46 PM Scheduler request completed: got 0 new tasks 1610 Milkyway@Home 29-03-2022 08:46 PM Server error: feeder not running |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Looks like the same thing as last time. I'll restart the server processes and see if the feeder comes back up. No clue why it does that. |
Send message Joined: 20 Mar 12 Posts: 4 Credit: 5,174,898 RAC: 0 |
There's no 'feeder not running' error for me now: 29-3-2022 21:53:45 | Milkyway@Home | update requested by user 29-3-2022 21:53:45 | Milkyway@Home | Sending scheduler request: Requested by user. 29-3-2022 21:53:45 | Milkyway@Home | Requesting new tasks for NVIDIA GPU 29-3-2022 21:53:46 | Milkyway@Home | Scheduler request completed: got 0 new tasks 29-3-2022 21:53:46 | Milkyway@Home | Project requested delay of 91 seconds So I'll just have to wait a bit I guess :) |
Send message Joined: 4 Jul 09 Posts: 97 Credit: 17,382,546 RAC: 1,427 |
Image from Stat's page now Download server milkyway.cs.rpi.edu Running Upload server milkyway.cs.rpi.edu Running Scheduler milkyway Not Running feeder milkyway Not Running transitioner milkyway Not Running db_purge milkyway Not Running file_deleter milkyway Not Running stream_fit_validator (milkyway) milkyway Not Running stream_fit_assimilator (milkyway ) milkyway Not Running stream_fit_work_generator (milkyway ) milkyway Not Running nbody_validator (milkyway_nbody ) milkyway Not Running nbody_assimilator (milkyway_nbody ) milkyway Not Running nbody_work_generator (milkyway_nbody) milkyway Not Running Milkyway@home N-Body Simulation 13966487 576029 0.17 (0.01 - 1.3) 185 Milkyway@home Separation 4961115 655282 0.6 (0.01 - 25.08) 919 Upstream server release: 1.0.4 Database schema version: 27028 Task data as of 29 Mar 2022, 18:24:00 UTC |
Send message Joined: 5 Jul 11 Posts: 990 Credit: 376,143,149 RAC: 0 |
Image from Stat's page nowWeird, I see the same 18:24 UTC, but everything says running. But Boinc returns "Milkyway@Home 29-03-2022 09:00 PM Project is temporarily shut down for maintenance" (I'm on UTC+1 here) |
Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,177 RAC: 0 |
Our shared memory limit appears to be the max value it can be for 64 bit machines (SHMMAX = 18446744073692774399 bytes), so I don't think that the shared memory allowance is the issue here, unless I'm missing something. I did kill all jobs that hadn't been sent out yet though. No no no, please don't set that, something more sensible please? Unless you actually have that amount of RAM its not going to help. You also need to consider that the database also needs to use memory for its own caching ability. Maybe something like 512MiB or 1GiB should be sufficient for shared memory. If you use /etc/sysctl.conf for system config, just edit then use sysctl -p to reload the changes As for the turning thing off ONLY these 2 should be turned off to test if the database tasks table is just really full: stream_fit_work_generator (milkyway ) nbody_work_generator (milkyway_nbody) |
Send message Joined: 16 Mar 10 Posts: 213 Credit: 108,368,627 RAC: 4,272 |
Tom - thanks for the clarification. I'd just edited my earlier message to acknowledge seeing your "killed tasks" comment, but I'll leave the edit as is... The task numbers appear unchanged (as the page on display at 20:00 UTC is apparently from before your most recent activity) -- I presume the server status page task data won't get updated again today - do you know how often it should update those numbers? (It didn't seem to be updated more than once a day whilst the server was struggling...) Thanks for your efforts, especially given the other calls on your time! I just hope it sorts out soon, and you can get a bit of relative peace and quiet... Cheers - Al. |
Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,177 RAC: 0 |
Seems to update once in a while see https://grafana.kiska.pw/goto/8ChLths7k?orgId=1 Click on one of the fields on the right hand side of the graph: or |
Send message Joined: 20 Mar 12 Posts: 4 Credit: 5,174,898 RAC: 0 |
29-3-2022 22:57:44 | Milkyway@Home | Sending scheduler request: To fetch work. 29-3-2022 22:57:44 | Milkyway@Home | Requesting new tasks for NVIDIA GPU 29-3-2022 22:57:46 | Milkyway@Home | Scheduler request completed: got 0 new tasks 29-3-2022 22:57:46 | Milkyway@Home | No work available 29-3-2022 22:57:46 | Milkyway@Home | Project requested delay of 91 seconds |
Send message Joined: 5 Jul 11 Posts: 990 Credit: 376,143,149 RAC: 0 |
29-3-2022 22:57:44 | Milkyway@Home | Sending scheduler request: To fetch work.I keep getting CPU Nbody work, but there seems to be a problem with server keeping up with the GPUs. If there was a way to send out bigger GPU tasks that would be nice. I think the huge number of them being sent back and forth could be overloading things? |
©2024 Astroinformatics Group