Welcome to MilkyWay@home

server down?


Advanced search

Message boards : Number crunching : server down?
Message board moderation

To post messages, you must log in.

AuthorMessage
Benzini

Send message
Joined: 9 Jul 10
Posts: 4
Credit: 1,887,565
RAC: 0
1 million credit badge9 year member badge
Message 41215 - Posted: 3 Aug 2010, 10:35:40 UTC

Are the servers down? I can't receive any new wu's...
ID: 41215 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileDavid Glogau*
Avatar

Send message
Joined: 12 Aug 09
Posts: 172
Credit: 645,240,165
RAC: 0
500 million credit badge10 year member badge
Message 41216 - Posted: 3 Aug 2010, 10:46:40 UTC - in response to Message 41215.  

Nope, the server is up. It just does not have anything to send out. I too have no work!
ID: 41216 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileIan
Avatar

Send message
Joined: 9 Feb 10
Posts: 7
Credit: 185,642,798
RAC: 0
100 million credit badge10 year member badge
Message 41218 - Posted: 3 Aug 2010, 11:08:42 UTC - in response to Message 41215.  

Perhaps time to head over and crunch with the mathimatical catz at collatz...
ID: 41218 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileBruce
Avatar

Send message
Joined: 28 Apr 08
Posts: 1415
Credit: 2,716,428
RAC: 0
2 million credit badge10 year member badge
Message 41465 - Posted: 15 Aug 2010, 12:03:21 UTC

Server status at this time is

feeder milkyway Not Running
transitioner milkyway Not Running
milkyway_purge milkyway Not Running
file_deleter milkyway Not Running

I think its time to Kick the server one more time.
ID: 41465 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John Clark

Send message
Joined: 4 Oct 08
Posts: 1734
Credit: 64,228,409
RAC: 0
50 million credit badge10 year member badge
Message 41467 - Posted: 15 Aug 2010, 12:33:19 UTC

With the server status showing red on certain ones, it looks like someone is in the process of kicking but more than a simple reboot.
Go away, I was asleep


ID: 41467 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilebanditwolf
Avatar

Send message
Joined: 12 Nov 07
Posts: 2425
Credit: 524,164
RAC: 0
500 thousand credit badge10 year member badge
Message 41472 - Posted: 15 Aug 2010, 17:21:02 UTC

Green now, but the validator is still stuck. 42k+ results.
Doesn't expecting the unexpected make the unexpected the expected?
If it makes sense, DON'T do it.
ID: 41472 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 519
Credit: 283,263,673
RAC: 3,504
200 million credit badge10 year member badgeextraordinary contributions badge
Message 41475 - Posted: 15 Aug 2010, 18:47:24 UTC

There is pretty clearly a root cause problem which results in a 'stuck validator' and no new work. The periodic work around (which apparently has not been implemented) would be to simply batch stop all processes and restart them on a daily basis (or something similar to that). But this is just that a work around to address the symptom and not a real fix to this recurring issue.

During the past week this problem was masked by the power outage issues - which clearly trump all other issues.

But frankly, I have seen no discussion regarding this coming from the project side -- so it really isn't clear to me that the project side folks are alert to the existance of a root cause problem and, not being aware of it, are not looking into real problem solving.

So instead, we either lament here (in seeming isolation from the project folks) or inundate the admins via email to have them manually do the workaround of stop/starting or rebooting the server.

One would hope that at some point (this root cause issue has been around for months) the proverbial light with go on back at the shop so we can see an effort to resolve the root cause. Until that is done, we all get to make sure we have set up alternative *more reliable* projects to work with.
ID: 41475 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 519
Credit: 283,263,673
RAC: 3,504
200 million credit badge10 year member badgeextraordinary contributions badge
Message 41477 - Posted: 15 Aug 2010, 20:24:11 UTC - in response to Message 41467.  
Last modified: 15 Aug 2010, 20:28:45 UTC

For me, yes the repetitive root cause problem is troublesome, but even more so is that this problem, which has persisted for months, seems to be something only those in the 'user community' are aware of. Informational vacuums are really frustrating. I begin to feel like there is no 'there' there.

I mean, a simple, 'yes we have seen the message traffic, and yes we are aware that there is a core problem, and yes we are looking into it, but are still puzzled by it' would go a long way toward reducing my concern. As it is, at this juncture, based on the lack of information, I don't know that any of those of us who linger here have any assurance that a) MW project folks are aware of a problem, b) that they are aware that there is a repetitive 'root cause' issue and so, that since they may not be aware of a problem the only thing done is 'reaction' reboots and restarts, when an email storm hits.

With the server status showing red on certain ones, it looks like someone is in the process of kicking but more than a simple reboot.

ID: 41477 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileThe Gas Giant
Avatar

Send message
Joined: 24 Dec 07
Posts: 1947
Credit: 240,884,648
RAC: 0
200 million credit badge10 year member badge
Message 41479 - Posted: 15 Aug 2010, 20:48:27 UTC

In all honesty, the server needs to be rebooted every 3-4 days. How about every Friday @ midday and Monday @ midday.
ID: 41479 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 519
Credit: 283,263,673
RAC: 3,504
200 million credit badge10 year member badgeextraordinary contributions badge
Message 41481 - Posted: 15 Aug 2010, 21:23:25 UTC - in response to Message 41479.  

As a workaround, that would be a start for sure.

But again, unless and until *someone* at RPI is actually aware of the problem (and they may be, we just can't tell from the message board traffic over the past few months), deploying even this workaround is probably not going to happen.

Informational vacuums -- sort of like breathing in deep space.


In all honesty, the server needs to be rebooted every 3-4 days. How about every Friday @ midday and Monday @ midday.


ID: 41481 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 12 Aug 09
Posts: 262
Credit: 92,631,041
RAC: 3
50 million credit badge10 year member badge
Message 41488 - Posted: 15 Aug 2010, 22:52:42 UTC

A reboot regularly will not always solve problems. We (research center) have often servers running for mounths without any issue and sometimes we need a reboot 3 or 4 times in a row.

This is a science project where we donate calculation posibilities. It is not that we are clients of them whom they have to keep happy. Of course they try to do so as they need the calculation power. But at universities there are sometimes other priorities and technical support with enough knowledge is not always available either. So give them some space(time). There are interesting projects enough. Okay for the credit-hunters Milkyway is a must... happy crunching.
Greetings from,
TJ
ID: 41488 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilemdhittle*
Avatar

Send message
Joined: 25 Jun 10
Posts: 284
Credit: 260,490,091
RAC: 0
200 million credit badge10 year member badge
Message 41492 - Posted: 15 Aug 2010, 23:44:09 UTC
Last modified: 15 Aug 2010, 23:44:21 UTC

I am downloading new work units as I type this. Go get them....
ID: 41492 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 519
Credit: 283,263,673
RAC: 3,504
200 million credit badge10 year member badgeextraordinary contributions badge
Message 41493 - Posted: 15 Aug 2010, 23:57:13 UTC - in response to Message 41488.  

Oh, I agree that a reboot will not always solve the problem here -- *even as a workaround*. I am happy to report that Travis has noted he is aware of a memory leak problem which might well be the root cause here. That is good news.

And that reduces my angst -- if the folks are RPI are aware of a root cause issue then they might be able to 1) automate a workaround (even knowing it isn't a 100% albeit temporary solution and 2) work toward resolving the root cause issue as well.

Also, as I (and others) noted elsewhere, there are these days at least a couple of other projects for ATI GPU crunchers. Interestingly enough, those projects seem to run more reliably with significantly less resources than this one. Of course they run with far less user traffic (Collatz has 1/6 the users and Dnetc has 1/30 the users -- though both those projects might well have more workstations per user since they don't require double precision GPU's).

For me personally, for quite a while MW was my number one running project, these days in terms of total credit it is behind Collatz and at a guess, within two months it will be behind Dnetc. That is a function of both the higher reliability of the other two projects and their configuration which allows use of single precision GPU's, even though on a per GPU cycle basis, MW is seemingly more efficient (in terms of credits).

Also, as others have noted as well, with Travis to at least some degree, 'moving on' with his goals and objectives, the kind of active project handholding that this project (like most BOINC projects) needs on an ongoing basis is clearly less available. Other folks at RPI are interested in keeping the ball rolling, but they (for various good reasons no doubt) are not as committed to the care and feeding of the project. I do get that.

One further note, to the extent that the ongoing data being produced is of value to researchers at RPI (and I don't know how valuable ongoing new results are for them these days), the reliability issues are reducing the new data flow for them. If that flow is of interest, then reliability troubleshooting will be of higher value to them than if it is not.



A reboot regularly will not always solve problems. We (research center) have often servers running for mounths without any issue and sometimes we need a reboot 3 or 4 times in a row.

This is a science project where we donate calculation posibilities. It is not that we are clients of them whom they have to keep happy. Of course they try to do so as they need the calculation power. But at universities there are sometimes other priorities and technical support with enough knowledge is not always available either. So give them some space(time). There are interesting projects enough. Okay for the credit-hunters Milkyway is a must... happy crunching.


ID: 41493 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 519
Credit: 283,263,673
RAC: 3,504
200 million credit badge10 year member badgeextraordinary contributions badge
Message 41499 - Posted: 16 Aug 2010, 6:39:20 UTC

Well the server processes were restarted and running -- for about 6 hours.



data-driven web pages milkyway Running
upload/download server milkyway Running
scheduler milkyway Running

feeder milkyway Not Running
transitioner milkyway Not Running
milkyway_purge milkyway Not Running
file_deleter milkyway Not Running
ID: 41499 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : server down?

©2020 Astroinformatics Group