Welcome to MilkyWay@home

Server Trouble

Message boards : News : Server Trouble
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 22 · Next

AuthorMessage
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71790 - Posted: 21 Feb 2022, 19:44:42 UTC

Hey Everyone,

The server appears to be having some connectivity issues. Additionally, one of the drives on the server appears to have failed - luckily things are mirrored so we haven't lost any data. However, I need to make a backup of things, which could take several hours. Once everything is backed up I'll try to clear out this transitioner/validator backlog and get things up to speed again.

Best,
Tom
ID: 71790 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71791 - Posted: 22 Feb 2022, 19:11:15 UTC

The problem with the bad drive has been handled, but things are struggling to come back up. I'm working on it currently.
ID: 71791 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile alk44
Avatar

Send message
Joined: 2 Mar 20
Posts: 131
Credit: 317,012,875
RAC: 29,480
Message 71792 - Posted: 22 Feb 2022, 19:49:23 UTC

Thanks a lot Tom. Really appreciate you letting us know when things are not running properly.
ID: 71792 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PecosRiverM

Send message
Joined: 25 Aug 17
Posts: 12
Credit: 1,229,638,066
RAC: 31,197
Message 71793 - Posted: 22 Feb 2022, 20:59:55 UTC - in response to Message 71792.  

Thanks a lot Tom. Really appreciate you letting us know when things are not running properly.


Double for me. I've a few that need to come back home..
ID: 71793 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile HRFMguy

Send message
Joined: 12 Nov 21
Posts: 236
Credit: 575,038,236
RAC: 234
Message 71794 - Posted: 22 Feb 2022, 21:07:32 UTC - in response to Message 71793.  

Things are looking up. I was able to send in 47 tasks and get one back for my slowest cpu.
ID: 71794 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71795 - Posted: 22 Feb 2022, 21:39:45 UTC

Looks like things are stable for the moment, I'm going to keep an eye on the validator and transitioner backlogs. Hopefully they will start decreasing as things begin to flow again.

I also think the server is running slower than usual because it is re-configuring itself after losing the broken hard drive.

We will have to take the server down again in a few days to replace the bad drive, but I'll let you know when that happens. I'll also keep in touch in case any more maintenance is needed before then.
ID: 71795 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Cliff

Send message
Joined: 2 Oct 09
Posts: 3
Credit: 5,977,766
RAC: 3,791
Message 71796 - Posted: 22 Feb 2022, 21:58:40 UTC - in response to Message 71790.  

Tom,

No worries. Have you tried hitting it with a very large hammer? :)

Seriously though, all of us "data crunchers" truly appreciate all of the infrastructure work and support that you provide.

Kind Regards,
Cliff
Philadelphia, PA
ID: 71796 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cameron

Send message
Joined: 16 Dec 07
Posts: 37
Credit: 24,675,139
RAC: 6,766
Message 71797 - Posted: 23 Feb 2022, 1:30:10 UTC

Thanks for the hard work Tom.
One of the benefits of the 14 day deadlines is the results can have a few days till they're returned if something unexpected happens.
ID: 71797 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 990
Credit: 376,143,149
RAC: 0
Message 71801 - Posted: 23 Feb 2022, 6:50:21 UTC - in response to Message 71790.  
Last modified: 23 Feb 2022, 6:53:46 UTC

Hey Everyone,

The server appears to be having some connectivity issues. Additionally, one of the drives on the server appears to have failed - luckily things are mirrored so we haven't lost any data. However, I need to make a backup of things, which could take several hours. Once everything is backed up I'll try to clear out this transitioner/validator backlog and get things up to speed again.

Best,
Tom
When I ran servers we had RAID 6. Yes mirroring isn't exactly the same, but surely you can just pull out the bad drive while it's running and shove in another and it rebuilds it in the background? I even used this function to upgrade to larger disks without interrupting the users, I just did one at a time.

Another question, sorry. How come whenever it recovers from a problem, the server ends up with 2 million tasks queued to send out instead of the usual 11000?

Anyway it was nice to walk into my garage this morning to feel the temperature of a butterfly house. I knew what that meant! My Amazon parrots thank you.
ID: 71801 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Luciferius Infernalis Vel Tohu

Send message
Joined: 3 May 18
Posts: 7
Credit: 45,954
RAC: 0
Message 71803 - Posted: 23 Feb 2022, 11:57:06 UTC - in response to Message 71790.  

Hallo Tom,

Can I get my points for the last work,because in the same time the server have had troubles.

Best: Luciferius Infernalis Vel Tohu
[color=red][size=12][color=red]
ID: 71803 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bobzilla

Send message
Joined: 27 Aug 20
Posts: 6
Credit: 39,683,314
RAC: 0
Message 71804 - Posted: 23 Feb 2022, 12:21:20 UTC - in response to Message 71790.  

Time to break out the old Tonia Harding and giver her a good whack.......
ID: 71804 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 990
Credit: 376,143,149
RAC: 0
Message 71805 - Posted: 23 Feb 2022, 15:46:33 UTC - in response to Message 71804.  

Time to break out the old Tonia Harding and giver her a good whack.......
Is that Cockney Rhyming Slang?
I had to look up who she is, my god how can a cute thing become so repulsive?
ID: 71805 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 990
Credit: 376,143,149
RAC: 0
Message 71806 - Posted: 23 Feb 2022, 15:48:05 UTC - in response to Message 71803.  

Luciferius Infernalis Vel Tohu
My Latin isn't so good. Infernal devil?
ID: 71806 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Baja Sandor
Avatar

Send message
Joined: 4 Mar 20
Posts: 10
Credit: 10,794,354
RAC: 0
Message 71807 - Posted: 23 Feb 2022, 16:35:14 UTC

Thanks
ID: 71807 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71810 - Posted: 23 Feb 2022, 21:31:15 UTC - in response to Message 71801.  

Another question, sorry. How come whenever it recovers from a problem, the server ends up with 2 million tasks queued to send out instead of the usual 11000?


This can be due to connection issues, so the server keeps accumulating jobs but isn't able to establish connections to volunteers to hand them out. It can also be due to server memory problems, where it doesn't have the memory available to query the database fast enough to hand everything out in time.
ID: 71810 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 990
Credit: 376,143,149
RAC: 0
Message 71811 - Posted: 23 Feb 2022, 21:36:49 UTC - in response to Message 71810.  

Another question, sorry. How come whenever it recovers from a problem, the server ends up with 2 million tasks queued to send out instead of the usual 11000?
This can be due to connection issues, so the server keeps accumulating jobs but isn't able to establish connections to volunteers to hand them out. It can also be due to server memory problems, where it doesn't have the memory available to query the database fast enough to hand everything out in time.
I thought you'd set a limit, as it's usually precisely 10000 seperation and 1000 Nbody.
ID: 71811 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71816 - Posted: 24 Feb 2022, 3:54:18 UTC - in response to Message 71811.  

I believe the server tries to keep at least that many tasks queued up to be ready to send out, but sometimes it increases if the server is struggling.
ID: 71816 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dataman

Send message
Joined: 14 Mar 09
Posts: 1
Credit: 11,711,203
RAC: 5
Message 71847 - Posted: 2 Mar 2022, 16:30:10 UTC - in response to Message 71791.  

Thanks for all you do
ID: 71847 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
3man001

Send message
Joined: 1 Jan 22
Posts: 1
Credit: 26,628,503
RAC: 995
Message 71848 - Posted: 2 Mar 2022, 17:34:51 UTC

Server not fixed yet?
ID: 71848 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71849 - Posted: 3 Mar 2022, 4:16:14 UTC

There is a large number of workunits that are waiting to be validated. I think that this is because the server is running more slowly while it is degraded. When we get the drive back we can restore full functionality of the server. Hopefully that isn't too long, but in the meantime this big drop in RAC might just be how things are at the moment.
ID: 71849 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 22 · Next

Message boards : News : Server Trouble

©2024 Astroinformatics Group