Welcome to MilkyWay@home

Work should be flowing

Message boards : News : Work should be flowing
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 41484 - Posted: 15 Aug 2010, 22:05:35 UTC - in response to Message 41476.  

As noted elsewhere, the workaround is to set up a automatic process which stop/starts the various processes or does a full server down/restart to clear out the problems (temporarily) -- but this is only a workaround, as it is fairly clear that there is an underlying root cause problem which needs attention.


I'm pretty sure it's because it's the summer and RPI is doing rewiring and moving machines around so labstaff keep shutting the milkyway server down.
ID: 41484 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 520
Credit: 302,524,931
RAC: 15
Message 41487 - Posted: 15 Aug 2010, 22:45:17 UTC - in response to Message 41484.  
Last modified: 15 Aug 2010, 22:49:55 UTC

Travis, I have to disagree -- this problem has been going on for months now, every to days to a week, the validator and work generator stop validating and generating. Someone then goes in and either stop/starts processes or does a full server restart and the problem 'goes away' -- until it resurfaces. This not a power outage issue (though those have happened), it is an underlying process problem (memory leak, whatever) that is going on and only gets treated symptomatically by a restart.

The situation -- this instant is a demonstration of this -- the server has reported 'all green', yet no work is available and the awaiting validation population is growing rapidly as work is reporting in with the validator not working. This has been the case since early this morning and this situation has been repeating itself exactly this way every three days to a week for at least a couple of months -- perhaps longer.


As noted elsewhere, the workaround is to set up a automatic process which stop/starts the various processes or does a full server down/restart to clear out the problems (temporarily) -- but this is only a workaround, as it is fairly clear that there is an underlying root cause problem which needs attention.


I'm pretty sure it's because it's the summer and RPI is doing rewiring and moving machines around so labstaff keep shutting the milkyway server down.

ID: 41487 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 41489 - Posted: 15 Aug 2010, 23:21:27 UTC - in response to Message 41487.  

Travis, I have to disagree -- this problem has been going on for months now, every to days to a week, the validator and work generator stop validating and generating. Someone then goes in and either stop/starts processes or does a full server restart and the problem 'goes away' -- until it resurfaces. This not a power outage issue (though those have happened), it is an underlying process problem (memory leak, whatever) that is going on and only gets treated symptomatically by a restart.

The situation -- this instant is a demonstration of this -- the server has reported 'all green', yet no work is available and the awaiting validation population is growing rapidly as work is reporting in with the validator not working. This has been the case since early this morning and this situation has been repeating itself exactly this way every three days to a week for at least a couple of months -- perhaps longer.


There is a memory leak in the new assimilator, which I have been actively trying to debug. The problem is that it takes about 4-5 days for it to leak enough memory for it to actually crash anything.

I'm also working on a way to get the new assimilator to show up on the server status page, but that's kind of on the back burner until we have the nbody simulation stuff up and running (as it's more of a cosmetic problem).
ID: 41489 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 520
Credit: 302,524,931
RAC: 15
Message 41490 - Posted: 15 Aug 2010, 23:39:06 UTC - in response to Message 41489.  

OK -- so you are aware of the memory leak -- and seemingly when it goes boom, that means validation stops and new work generation stops. And, as you now stated, it takes 4 to 5 days (actually it varies from about 3 to 7 days perhaps depending on overall traffic).

So, until this dealt with, the suggestion from the peanut gallery is to set up as a stop gap, an automatic server stop, down, and restart process to clear the leak (temporarily) and have this automated process run automatically every 3 days or so.

For now though, at this moment, the memory leak HAS surfaced again and no work has been available or validated for the past 8 to 10 hours or more.



There is a memory leak in the new assimilator, which I have been actively trying to debug. The problem is that it takes about 4-5 days for it to leak enough memory for it to actually crash anything.

I'm also working on a way to get the new assimilator to show up on the server status page, but that's kind of on the back burner until we have the nbody simulation stuff up and running (as it's more of a cosmetic problem).


ID: 41490 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 41491 - Posted: 15 Aug 2010, 23:41:45 UTC - in response to Message 41490.  

OK -- so you are aware of the memory leak -- and seemingly when it goes boom, that means validation stops and new work generation stops. And, as you now stated, it takes 4 to 5 days (actually it varies from about 3 to 7 days perhaps depending on overall traffic).

So, until this dealt with, the suggestion from the peanut gallery is to set up as a stop gap, an automatic server stop, down, and restart process to clear the leak (temporarily) and have this automated process run automatically every 3 days or so.

For now though, at this moment, the memory leak HAS surfaced again and no work has been available or validated for the past 8 to 10 hours or more.



Actually this was a bug because of the updated assimilator code (to get it working with both the separation workunits and the nbody simulation workunits). It's fixed now and it looks like the assimilator is happily generating work again.
ID: 41491 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 520
Credit: 302,524,931
RAC: 15
Message 41495 - Posted: 16 Aug 2010, 1:07:43 UTC - in response to Message 41491.  

OK - I see that things are running again -- this is good. Now if we don't see this resurface in the next 3 to 10 days, we can all be happy campers.
ID: 41495 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 41496 - Posted: 16 Aug 2010, 1:10:14 UTC - in response to Message 41495.  

OK - I see that things are running again -- this is good. Now if we don't see this resurface in the next 3 to 10 days, we can all be happy campers.


Well at least in the next few days we should be getting the nbody simulation binaries out there, so if one assimilator crashes hopefully the other will stay up. :)
ID: 41496 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 520
Credit: 302,524,931
RAC: 15
Message 41498 - Posted: 16 Aug 2010, 6:29:21 UTC - in response to Message 41496.  
Last modified: 16 Aug 2010, 6:35:00 UTC

Oops -- not sure if this is planned or not -- but there is (11:30PM PDT) maintenance going on -- processes are offline again.

data-driven web pages milkyway Running
upload/download server milkyway Running
scheduler milkyway Running
feeder milkyway Not Running
transitioner milkyway Not Running
milkyway_purge milkyway Not Running
file_deleter milkyway Not Running
ID: 41498 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 41500 - Posted: 16 Aug 2010, 8:10:43 UTC - in response to Message 41498.  

Should be back. I've been watching it while working on the nbody assimilator.
ID: 41500 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : News : Work should be flowing

©2024 Astroinformatics Group