Message boards :
News :
Server Maintenance 12:00 PM ET (16:00 UTC) 9/23/2022
Message board moderation
Author | Message |
---|---|
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Hey Everyone, We've been noticing some occasional long loading times and brief connection disruptions with the server, so I'm going to take things down at noon ET tomorrow (4PM UTC) and reboot the server. It should be back up pretty quickly. Hopefully this fixes the issues we're having. Best, Tom |
Send message Joined: 30 Dec 09 Posts: 2 Credit: 26,265,301 RAC: 0 |
Having run servers before retirement you might run a drive clean up and defrag as well. Just a thought. Phil |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Well, I guess MilkyWay really wanted to be restarted, because it went down in the middle of the night last night. According to the server room admins, the machine was doing some funky stuff like not being able to turn off swap or unmount any partitions, but it force rebooted and came back up again. I'll take the server down for maintenance at the scheduled time, but it may be a few hours now because I want to make a manual backup of the DB in case anything else happens. I guess it's good that we're planning on migrating to new hardware soon! |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
After looking at what's going on in the DB, I think I might keep things up for a little while. There are a few long (~4 hr) processes running and I want to see if they complete or if they need to be terminated. I'll keep you all posted on what I plan on doing - it could be that milkyway stays up for a day and then I restart things tomorrow instead. |
Send message Joined: 22 Aug 22 Posts: 4 Credit: 1,136,950 RAC: 0 |
thanks for the update. i think i may have figured some things out too. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
I think the server has caught up to all the work that you all did while it was down. The transitioner backlog is down to 0 hours, and there are a few hundred thousand returned tasks in the queue to go back out. I'm keeping an eye on things again in order to make sure that this number doesn't explode like it did when we had the drive fail. If the number of workunits waiting to go out starts rising quickly, I'll turn off the WU generators until that backlog is crunched. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Looks like the counts are dropping. Will continue monitoring but I think we're okay for now. |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
Looks like the counts are dropping. Will continue monitoring but I think we're okay for now. WOO HOO!! Thanks Tom!! |
Send message Joined: 1 Dec 10 Posts: 82 Credit: 15,452,009,012 RAC: 0 |
:) Thanks Tom! |
Send message Joined: 22 Aug 22 Posts: 4 Credit: 1,136,950 RAC: 0 |
cool yeah i had a little down time on a server farm so could have contributed to the issue. my apologies if so. |
Send message Joined: 24 Jan 11 Posts: 712 Credit: 553,982,065 RAC: 59,137 |
Web site is slow and laggy today. More than usual. |
Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0 |
Hey Everyone, Hmmm, loading times are getting too long again. And under "server staus" I notice: Workunits waiting for validation 1519530 is going up. Also noticed Milkyway@home N-Body Simulation 128813 is also going up. Should we ignore this? |
Send message Joined: 12 Nov 21 Posts: 236 Credit: 575,038,236 RAC: 0 |
And, if we shouldn't ignore it, should we dog-pile onto n body to clear out the backlog? (that's been done before.....) |
Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0 |
... (that's been done before.....) I think that was a different problem, than we are having now? |
Send message Joined: 12 Nov 21 Posts: 236 Credit: 575,038,236 RAC: 0 |
Hopefully! That last one was a nightmare! But I'm willing to move over if need be.... (that's been done before.....) |
Send message Joined: 16 Mar 10 Posts: 211 Credit: 108,210,146 RAC: 5,134 |
Regarding the high counts for N-body... Over the last few days there has been an enormous backlog of stuff waiting for validation! Bearing in mind that at present the N-body tasks are sent out with initial quorum 1 (like Separation tasks) but don't ever seem to validate without a wingman (unlike Separation!) a very large proportion of the backlog would be N-body task initial result returns, which would promptly generate a retry when eventually validated :-) So a day or so after the validation backlog started to grow the number of tasks waiting to be sent started to shoot up as well, and it seems to have settled out around the 110,000 to 120,000 mark... Until they work out why there's such a huge validation backlog, it's likely to stay as it is... I'm wondering if it's something to do with the extra code in the validator that backs off interesting results (using something called Toolkit for Asynchronous Optimization, I believe) Cheers - Al. P.S. The comments about sluggishness (in various guises) are still applicable -- this post is being made the first time I've managed to get at the server since some time yesterday, and my client has only just managed to report old results (again, stalled since yesterday...) And as for trying to download work -- not a chance! |
Send message Joined: 16 Mar 10 Posts: 211 Credit: 108,210,146 RAC: 5,134 |
[As it's too late to edit the previous post...] Work downloading seems to have just started again :-). However, the number of work units waiting for validation is still climbing :-(. I wonder how long it will be before it all bungs up again (or maybe even crashes [again?]) Cheers - Al. |
Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,177 RAC: 11 |
[As it's too late to edit the previous post...] Things seem to improve then degrade very hard... As evidenced in the below image |
Send message Joined: 12 Jun 10 Posts: 57 Credit: 6,171,817 RAC: 41 |
And, if we shouldn't ignore it, should we dog-pile onto n body to clear out the backlog? (that's been done before.....) As this will help the server I am more than happy to lend a hand. Has it helped in the past? |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
And, if we shouldn't ignore it, should we dog-pile onto n body to clear out the backlog? (that's been done before.....) Yes it has but it really helps if you have the file name listed and then crunch the ones with the highest number at the end first, ie Name de_modfit_70_bundle5_3s_south_pt2_2_1663946992_2335911_1 and Name de_modfit_70_bundle5_3s_south_pt2_3_1663946992_2258518_2 So you should crunch the 2nd one first as it's been in the system longer meaning you would be the wingman clearing out the backlog as opposed to crunching a task with a _1 or even and _0 as the numbers. YES those are Separation tasks but they are just an example, in the Boinc Manager you can set it to show the file Name under Options, Select columns and tick the box for Name |
©2024 Astroinformatics Group