server down?

Author	Message
Benzini Send message Joined: 9 Jul 10 Posts: 4 Credit: 1,887,565 RAC: 0	Message 41215 - Posted: 3 Aug 2010, 10:35:40 UTC Are the servers down? I can't receive any new wu's... ID: 41215 · Rating: 0 · rate: / Reply Quote

David Glogau* Send message Joined: 12 Aug 09 Posts: 172 Credit: 645,240,165 RAC: 0	Message 41216 - Posted: 3 Aug 2010, 10:46:40 UTC - in response to Message 41215. Nope, the server is up. It just does not have anything to send out. I too have no work! ID: 41216 · Rating: 0 · rate: / Reply Quote

Ian Send message Joined: 9 Feb 10 Posts: 7 Credit: 185,642,798 RAC: 0	Message 41218 - Posted: 3 Aug 2010, 11:08:42 UTC - in response to Message 41215. Perhaps time to head over and crunch with the mathimatical catz at collatz... ID: 41218 · Rating: 0 · rate: / Reply Quote

Bruce Send message Joined: 28 Apr 08 Posts: 1415 Credit: 2,716,428 RAC: 0	Message 41465 - Posted: 15 Aug 2010, 12:03:21 UTC Server status at this time is feeder milkyway Not Running transitioner milkyway Not Running milkyway_purge milkyway Not Running file_deleter milkyway Not Running I think its time to Kick the server one more time. ID: 41465 · Rating: 0 · rate: / Reply Quote

John Clark Send message Joined: 4 Oct 08 Posts: 1734 Credit: 64,228,409 RAC: 0	Message 41467 - Posted: 15 Aug 2010, 12:33:19 UTC With the server status showing red on certain ones, it looks like someone is in the process of kicking but more than a simple reboot. Go away, I was asleep ID: 41467 · Rating: 0 · rate: / Reply Quote

banditwolf Send message Joined: 12 Nov 07 Posts: 2425 Credit: 524,164 RAC: 0	Message 41472 - Posted: 15 Aug 2010, 17:21:02 UTC Green now, but the validator is still stuck. 42k+ results. Doesn't expecting the unexpected make the unexpected the expected? If it makes sense, DON'T do it. ID: 41472 · Rating: 0 · rate: / Reply Quote

BarryAZ Send message Joined: 1 Sep 08 Posts: 520 Credit: 302,528,469 RAC: 184	Message 41475 - Posted: 15 Aug 2010, 18:47:24 UTC There is pretty clearly a root cause problem which results in a 'stuck validator' and no new work. The periodic work around (which apparently has not been implemented) would be to simply batch stop all processes and restart them on a daily basis (or something similar to that). But this is just that a work around to address the symptom and not a real fix to this recurring issue. During the past week this problem was masked by the power outage issues - which clearly trump all other issues. But frankly, I have seen no discussion regarding this coming from the project side -- so it really isn't clear to me that the project side folks are alert to the existance of a root cause problem and, not being aware of it, are not looking into real problem solving. So instead, we either lament here (in seeming isolation from the project folks) or inundate the admins via email to have them manually do the workaround of stop/starting or rebooting the server. One would hope that at some point (this root cause issue has been around for months) the proverbial light with go on back at the shop so we can see an effort to resolve the root cause. Until that is done, we all get to make sure we have set up alternative more reliable projects to work with. ID: 41475 · Rating: 0 · rate: / Reply Quote

BarryAZ Send message Joined: 1 Sep 08 Posts: 520 Credit: 302,528,469 RAC: 184	Message 41477 - Posted: 15 Aug 2010, 20:24:11 UTC - in response to Message 41467. Last modified: 15 Aug 2010, 20:28:45 UTC For me, yes the repetitive root cause problem is troublesome, but even more so is that this problem, which has persisted for months, seems to be something only those in the 'user community' are aware of. Informational vacuums are really frustrating. I begin to feel like there is no 'there' there. I mean, a simple, 'yes we have seen the message traffic, and yes we are aware that there is a core problem, and yes we are looking into it, but are still puzzled by it' would go a long way toward reducing my concern. As it is, at this juncture, based on the lack of information, I don't know that any of those of us who linger here have any assurance that a) MW project folks are aware of a problem, b) that they are aware that there is a repetitive 'root cause' issue and so, that since they may not be aware of a problem the only thing done is 'reaction' reboots and restarts, when an email storm hits. With the server status showing red on certain ones, it looks like someone is in the process of kicking but more than a simple reboot. ID: 41477 · Rating: 0 · rate: / Reply Quote

The Gas Giant Send message Joined: 24 Dec 07 Posts: 1947 Credit: 240,884,648 RAC: 0	Message 41479 - Posted: 15 Aug 2010, 20:48:27 UTC In all honesty, the server needs to be rebooted every 3-4 days. How about every Friday @ midday and Monday @ midday. ID: 41479 · Rating: 0 · rate: / Reply Quote

BarryAZ Send message Joined: 1 Sep 08 Posts: 520 Credit: 302,528,469 RAC: 184	Message 41481 - Posted: 15 Aug 2010, 21:23:25 UTC - in response to Message 41479. As a workaround, that would be a start for sure. But again, unless and until someone at RPI is actually aware of the problem (and they may be, we just can't tell from the message board traffic over the past few months), deploying even this workaround is probably not going to happen. Informational vacuums -- sort of like breathing in deep space. In all honesty, the server needs to be rebooted every 3-4 days. How about every Friday @ midday and Monday @ midday. ID: 41481 · Rating: 0 · rate: / Reply Quote

TJ Send message Joined: 12 Aug 09 Posts: 262 Credit: 92,631,041 RAC: 0	Message 41488 - Posted: 15 Aug 2010, 22:52:42 UTC A reboot regularly will not always solve problems. We (research center) have often servers running for mounths without any issue and sometimes we need a reboot 3 or 4 times in a row. This is a science project where we donate calculation posibilities. It is not that we are clients of them whom they have to keep happy. Of course they try to do so as they need the calculation power. But at universities there are sometimes other priorities and technical support with enough knowledge is not always available either. So give them some space(time). There are interesting projects enough. Okay for the credit-hunters Milkyway is a must... happy crunching. Greetings from, TJ ID: 41488 · Rating: 0 · rate: / Reply Quote

mdhittle* Send message Joined: 25 Jun 10 Posts: 284 Credit: 260,490,091 RAC: 0	Message 41492 - Posted: 15 Aug 2010, 23:44:09 UTC Last modified: 15 Aug 2010, 23:44:21 UTC I am downloading new work units as I type this. Go get them.... ID: 41492 · Rating: 0 · rate: / Reply Quote

BarryAZ Send message Joined: 1 Sep 08 Posts: 520 Credit: 302,528,469 RAC: 184	Message 41493 - Posted: 15 Aug 2010, 23:57:13 UTC - in response to Message 41488. Oh, I agree that a reboot will not always solve the problem here -- even as a workaround. I am happy to report that Travis has noted he is aware of a memory leak problem which might well be the root cause here. That is good news. And that reduces my angst -- if the folks are RPI are aware of a root cause issue then they might be able to 1) automate a workaround (even knowing it isn't a 100% albeit temporary solution and 2) work toward resolving the root cause issue as well. Also, as I (and others) noted elsewhere, there are these days at least a couple of other projects for ATI GPU crunchers. Interestingly enough, those projects seem to run more reliably with significantly less resources than this one. Of course they run with far less user traffic (Collatz has 1/6 the users and Dnetc has 1/30 the users -- though both those projects might well have more workstations per user since they don't require double precision GPU's). For me personally, for quite a while MW was my number one running project, these days in terms of total credit it is behind Collatz and at a guess, within two months it will be behind Dnetc. That is a function of both the higher reliability of the other two projects and their configuration which allows use of single precision GPU's, even though on a per GPU cycle basis, MW is seemingly more efficient (in terms of credits). Also, as others have noted as well, with Travis to at least some degree, 'moving on' with his goals and objectives, the kind of active project handholding that this project (like most BOINC projects) needs on an ongoing basis is clearly less available. Other folks at RPI are interested in keeping the ball rolling, but they (for various good reasons no doubt) are not as committed to the care and feeding of the project. I do get that. One further note, to the extent that the ongoing data being produced is of value to researchers at RPI (and I don't know how valuable ongoing new results are for them these days), the reliability issues are reducing the new data flow for them. If that flow is of interest, then reliability troubleshooting will be of higher value to them than if it is not. A reboot regularly will not always solve problems. We (research center) have often servers running for mounths without any issue and sometimes we need a reboot 3 or 4 times in a row. This is a science project where we donate calculation posibilities. It is not that we are clients of them whom they have to keep happy. Of course they try to do so as they need the calculation power. But at universities there are sometimes other priorities and technical support with enough knowledge is not always available either. So give them some space(time). There are interesting projects enough. Okay for the credit-hunters Milkyway is a must... happy crunching. ID: 41493 · Rating: 0 · rate: / Reply Quote

BarryAZ Send message Joined: 1 Sep 08 Posts: 520 Credit: 302,528,469 RAC: 184	Message 41499 - Posted: 16 Aug 2010, 6:39:20 UTC Well the server processes were restarted and running -- for about 6 hours. data-driven web pages milkyway Running upload/download server milkyway Running scheduler milkyway Running feeder milkyway Not Running transitioner milkyway Not Running milkyway_purge milkyway Not Running file_deleter milkyway Not Running ID: 41499 · Rating: 0 · rate: / Reply Quote