Welcome to MilkyWay@home

Server Crash November 10

Message boards : Number crunching : Server Crash November 10
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile Zanth
Avatar

Send message
Joined: 18 Feb 09
Posts: 158
Credit: 110,699,054
RAC: 0
Message 33281 - Posted: 14 Nov 2009, 23:26:10 UTC - in response to Message 33273.  

Those that are using ATI GPU's do have ONE current alternative. I'd love there to be more.

Yes, wouldn't we all. But how many alternatives were there when I invested in ATI cards in the early days, not so far back at the beginning of this year?

NONE. ZILCH.

No other projects. Just MilkyWay and no promise at all that there would be any other ATI projects.

It's lucky that we have Collatz, and unlucky that I can't hammer MW right now ;)



GPU Grid is possibly weeks away from ATI WUs now as well. But its looking as if they may only accept 5000 series cards. They've begun tests but only had a 4850 to test with and decided with OpenCL having to create virtual memory they were just too slow. They expect to receive a 5870 this week to begin testing.
ID: 33281 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Silvers

Send message
Joined: 21 Aug 08
Posts: 625
Credit: 558,425
RAC: 0
Message 33282 - Posted: 14 Nov 2009, 23:34:01 UTC - in response to Message 33270.  
Last modified: 14 Nov 2009, 23:45:23 UTC

I think you may have the more realistic assessment here -- certainly the 1 or 2 days optimism NEVER HAD A CHANCE of happening.


The news announcment did not say "The hard drives will be here in 1 or 2 days...". It actually said that they would "hopefully" have the drives in 1 or 2 days.

There is likely a Purchase Order process that has to be followed inside the university. If the PO got signed, then it probably was not signed until the 11th, so the order couldn't happen until at least the 11th. After that, it would depend upon availability of the drives at whoever the order was placed with and the method of delivery chosen (ground, next day, 2 day, 3 day). An order shipped on the 12th for 2-day delivery via UPS or FedEx would be delivered on Monday, as you have to specify Saturday Delivery with both of them.

After shipment, depending on where the drives came from, then any delivery could be problematic. USPS was not picking up / delivering packages on the 11th (Veteran's Day), so a small backlog would've happened. I do not know about UPS or FedEx. Additionally, parts of the Eastern Seaboard were under threat of flooding and/or strong winds, which could've delayed air/truck transportation times.

Long story made short...there are multiple reasons why ordering hardware can take longer than 1-2 days. Should Travis not have mentioned a timeframe? Most definitely. At most he should've said "as soon as possible", but even with that it would likely have been translated into 1-7 days... Once any numeric timeframe is stated, people rigidly stick to it and expect it, even if there are genuine reasons why the timeframe slipped. I saw it as a programmer. If we stated that we thought something would take 3 weeks, but once we got into it we realized it was more complex than we thought and we wanted another 1-2 weeks, we would get told that it had to be done in the original estimate, even if it meant we had to work 12-16 hour days (or longer) and not have the weekend off to make it happen. Yes, I really was told that one time...that if something wasn't done, I should consider the weekend to be regular working days... If it hadn't gotten done by the original estimate, it would not have been a "life or death" scenario, but it was treated as such...
ID: 33282 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile David Glogau*
Avatar

Send message
Joined: 12 Aug 09
Posts: 172
Credit: 645,240,165
RAC: 0
Message 33283 - Posted: 14 Nov 2009, 23:47:46 UTC

For what its worth, I can attach To Collatz so my 4850 is just sitting here atm.

I have also emailed Travis to fwd to the Professors, stating that the project should stay down until after his thesis defense, with a front page notice of same. I expect to read an update on Monday about where things are.

In the meantime, I am still playing with Linux trying to get Boinc running faster than the Vista version. No luck so far. Today I will try Fedora and see how that goes.

Meanwhile, the Cudas are happily crunching Seti.
ID: 33283 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 520
Credit: 302,524,931
RAC: 2
Message 33284 - Posted: 15 Nov 2009, 4:25:32 UTC - in response to Message 33282.  

Which would, based on your explanation of the process, have been excessively optimistic at that. I'm familiar with the procurement process, in my earlier life I was a lab administrator back at Yale, then later was in materials management at corporations. My own sense is that if a prospective timeframe were offered (and I think providing that sort of information is a good idea), then something like two or three weeks (but we hope it is less than that) would have avoided setting unrealistic expectations.

Again, MW has become a project where under performance to expectation has become increasingly the norm -- be it credit handling (where DA appears to have a significant influence), communications (which have become rather thin of late) or server reliability.

That being said, I understand and accept that for Travis, the focus is elsewhere (my wife is, in her second job, on faculty at the state university and at the same time is preparing for comps on yet another advanced degree for her -- (you'd think an MD, PsyD, and Masters degree would suffice <smile>0.

The thing is, of course, a project which offered support for CPU and GPU (especially ATI GPU), and also provided frequent regular information as MW did a year ago, now has moved in many ways back in the pack of projects where user frustration becomes an issue.

I don't really blame Travis for this, it is, as I noted before, a not unusual 'biological' element -- the life cycle of BOINC projects.

There are projects which demonstrate an extended life cycle -- I noted some of those elsewhere. Frankly, I suspect that MW will not be one of those. Should Travis be successful in his Thesis defense (and I HOPE he IS), then his focus will be on completing that work with a view toward getting his formal degree in the next semester, and then, absent a Postdoc at RPI, he'll be moving on. We've no evidence that MW has the committed support of anyone else at RPI to continue pushing the project forward.

I'm not whining here, just looking realistically down the road for this project. That is why I really hope other GPU (CUDA and ATI) projects surface. I particularly hope that GPU projects which can work effectively with lower power GPU's (like Collatz does), show up. I hope that happens particularly in the face of the prospects for this project, along with the significant dis-incentives coming forth from Berkeley regarding credit schemas which they plan to impose.



The news announcment did not say "The hard drives will be here in 1 or 2 days...". It actually said that they would "hopefully" have the drives in 1 or 2 days.



ID: 33284 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Silvers

Send message
Joined: 21 Aug 08
Posts: 625
Credit: 558,425
RAC: 0
Message 33286 - Posted: 15 Nov 2009, 4:59:05 UTC - in response to Message 33284.  

Which would, based on your explanation of the process, have been excessively optimistic at that. I'm familiar with the procurement process, in my earlier life I was a lab administrator back at Yale, then later was in materials management at corporations. My own sense is that if a prospective timeframe were offered (and I think providing that sort of information is a good idea), then something like two or three weeks (but we hope it is less than that) would have avoided setting unrealistic expectations.


If two or three weeks had been said, then this same kind of talk would've happened at 14 days if either nothing had happened or nothing had been perceived to have happened.

If you want to see totally poor management, go look around at Cosmology. There are issues which have not been fixed in over a year. Admins continually bungle SQL scripts. The server crash in February/March of this year still has not been completely "fixed", meaning there are newer issues than those that are more than a year old, so they're not even back to where they were before the crash. The scientist had some health problems, but not a single person came onto the forum or the main page to even mention it. He then comes back, posts a handful of messages, then is gone again for another 5-6 weeks, until just this past week the transitioner failed. That's it...one piece of the BOINC server-side software. They've had 3 business days (Wednesday, Thursday, and Friday), as well as today to just get that going. Based on past experience, it won't get fixed until at least middle of next week, and possibly not until Thanksgiving or later.


Again, MW has become a project where under performance to expectation has become increasingly the norm -- be it credit handling (where DA appears to have a significant influence), communications (which have become rather thin of late) or server reliability.


Let's see. We have an admission that Dave is no longer participating, Travis said that he is working on his Thesis and that he has shown some scientists some basics of what to do, they have said in the past that they don't control the actual physical hardware, Travis just had to go to Spain for the BOINC Workshop, etc, etc, etc... Collatz goes through the same type of server problems and Jon over there made a post at one point that demonstrated how demanding people are and what it would take (him quitting jobs and working full time and then some on the project) just so that certain competitive people could have a hobby...

Maybe, just maybe, your expectations are a bit too high...

ID: 33286 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile David Glogau*
Avatar

Send message
Joined: 12 Aug 09
Posts: 172
Credit: 645,240,165
RAC: 0
Message 33287 - Posted: 15 Nov 2009, 5:05:09 UTC

With regards to communication here, interersingly enough, the cosmo server went down about the same time. 93 hours later we got a one line response from Ben to say they were working on the problem, and not even on the main page at that.

We are now 135 hours in and no further communication has been forthcoming!

Edit: RE my previous 33283 ... can attach ... should have read ... can't attach .... Such a small change, such a big difference in outcome!

Regards
David
ID: 33287 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 520
Credit: 302,524,931
RAC: 2
Message 33288 - Posted: 15 Nov 2009, 7:07:31 UTC - in response to Message 33286.  

True enough -- and perhaps it is a case of too high expectations -- as seemingly a no new work and no news update in excess of three weeks would be ok with you.

It may well be, as this project is dependent on one person (Travis) who has things which are legitimately higher up the food chain for him, that we will be in a long cold spell here -- perhaps an indefinite to permanent cold spell (depending on events and priorities that Travis has).

Other projects go that route as well.

I suppose I would have had no problems had the timing post said something like, best case a few days, but it perhaps a few weeks, and possibly longer than that.

For me, MW is a CPU project anyway, and there are plenty of other CPU projects out there. They have somewhat lower credit payouts (though over the past year, the difference has gotten smaller as Travis has responded to concerns from DA), but there are certainly a fair number of options there.

For folks with CUDA cards, there are some alternatives and there may be more.

For folks with ATI cards, there is only one current alternative and it (even more so than MW) is dependent on the efforts of one person (lacking any institutional support -- which makes that project even more impressive).

Sadly, to the extent other ATI projects might happen they are going to need higher end ATI cards than even MW requires.

In my project management days, I learned that setting expectations is important, but if your baseline expectations are low, then I suppose that isn't as much of an issue.





If two or three weeks had been said, then this same kind of talk would've happened at 14 days if either nothing had happened or nothing had been perceived to have happened.



ID: 33288 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 520
Credit: 302,524,931
RAC: 2
Message 33289 - Posted: 15 Nov 2009, 7:10:37 UTC - in response to Message 33287.  

Figured you meant that -- I've not added new workstations in the past week, but the current ones are posting (and getting) work.

For support over there, you will definitely need 6.10.x plus a current Catalyst driver.



Edit: RE my previous 33283 ... can attach ... should have read ... can't attach .... Such a small change, such a big difference in outcome!

Regards
David


ID: 33289 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile David Glogau*
Avatar

Send message
Joined: 12 Aug 09
Posts: 172
Credit: 645,240,165
RAC: 0
Message 33292 - Posted: 15 Nov 2009, 10:30:30 UTC - in response to Message 33289.  

Barry, I can't set up an account to attach to. That's the problem. As you are an existing user you should be OK.

Seti, currently has no work for the ATI client so my 4850 x 2 waits. I was running F@H on it, but was getting conflicts with screen refresh, so I had to stop it.
ID: 33292 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 520
Credit: 302,524,931
RAC: 2
Message 33293 - Posted: 15 Nov 2009, 15:43:24 UTC - in response to Message 33292.  

You should be able to set up a new account -- I just did that temporarily on a workstation without a problem. When you go to attach, are you using this URL:

http://boinc.thesonntags.com/collatz/

I think some search engines return a location below that root and that will mess up an attach.


Barry, I can't set up an account to attach to. That's the problem. As you are an existing user you should be OK.

Seti, currently has no work for the ATI client so my 4850 x 2 waits. I was running F@H on it, but was getting conflicts with screen refresh, so I had to stop it.


ID: 33293 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 520
Credit: 302,524,931
RAC: 2
Message 33294 - Posted: 15 Nov 2009, 15:54:39 UTC - in response to Message 33286.  

Brian, I think you are quite right regarding expectations. You also are quite right in pointing out that there are other projects which are truly 'bad citizens' in the BOINC firmament. Cosmology -- as you noted, or even worse, Predictor. And there are others which have had extended trouble times. Malaria had no new work for several months from December to March -- but they did announce that and provided an explanation. Then again, they were/are a CPU project, so folks had alternatives.

I guess for me, the reliability of a project while important, is secondary to project communications, with a component of communications having to do with setting expectations (the 'we hope to be back in one or two days' versus 'it may be one or two months, but we hope it will be less than that'). Also, for MW, I do appreciate that it isn't a case of bad attitude (Predictor and Cosmology fit that description (and to a lesser degree some of the folks over at SETI have that), so my response to MW is more a case of frustration, and not one of anger.

The added tension here is that MW is a high credit zone that provides GPU support -- which means there are folks here that have fewer and less satisfactory alternatives.

Because of what MW offered, it developed a bit of a 'dependency' with some users, If there were multiple solid alternatives, then the stress on seeing MW running all the time would be reduced. That of course is NOT MW's 'fault'.



If you want to see totally poor management, go look around at Cosmology. There are issues which have not been fixed in over a year. Admins continually bungle SQL scripts. The server crash in February/March of this year still has not been completely "fixed", meaning there are newer issues than those that are more than a year old, so they're not even back to where they were before the crash. The scientist had some health problems, but not a single person came onto the forum or the main page to even mention it. He then comes back, posts a handful of messages, then is gone again for another 5-6 weeks, until just this past week the transitioner failed. That's it...one piece of the BOINC server-side software. They've had 3 business days (Wednesday, Thursday, and Friday), as well as today to just get that going. Based on past experience, it won't get fixed until at least middle of next week, and possibly not until Thanksgiving or later.



ID: 33294 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bobgoblin

Send message
Joined: 8 Dec 07
Posts: 60
Credit: 67,028,931
RAC: 0
Message 33297 - Posted: 15 Nov 2009, 21:53:45 UTC - in response to Message 33286.  

If you want to see totally poor management, go look around at Cosmology. There are issues which have not been fixed in over a year. Admins continually bungle SQL scripts. The server crash in February/March of this year still has not been completely "fixed", meaning there are newer issues than those that are more than a year old, so they're not even back to where they were before the crash.


I crunch for several projects and there are always ups and downs with all of them. Some just vanish with no word as to why. But Cosmo... that's the perfect example of how NOT to run a project.


ID: 33297 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Dan T. Morris
Avatar

Send message
Joined: 17 Mar 08
Posts: 165
Credit: 410,228,216
RAC: 0
Message 33298 - Posted: 15 Nov 2009, 21:57:58 UTC - in response to Message 33297.  

If you want to see totally poor management, go look around at Cosmology. There are issues which have not been fixed in over a year. Admins continually bungle SQL scripts. The server crash in February/March of this year still has not been completely "fixed", meaning there are newer issues than those that are more than a year old, so they're not even back to where they were before the crash.


I crunch for several projects and there are always ups and downs with all of them. Some just vanish with no word as to why. But Cosmo... that's the perfect example of how NOT to run a project.





You hit that nail on the head on the first drive of the hammer. That project needs some real tlc.

DD,

ID: 33298 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Silvers

Send message
Joined: 21 Aug 08
Posts: 625
Credit: 558,425
RAC: 0
Message 33299 - Posted: 15 Nov 2009, 23:39:05 UTC - in response to Message 33288.  

True enough -- and perhaps it is a case of too high expectations -- as seemingly a no new work and no news update in excess of three weeks would be ok with you.


That's a hypothetical situation, as that hasn't happened here, but again, those of us participating on a regular basis at Cosmology go through periods of 3-6 months without anything at all coming from the project except for tasks.

To further illustrate the problems there, even just to get those tasks there are times where you have to babysit the download queue because some new user has freaked out at the amount of time the work takes (in some cases, more than 24 hours CPU time), or the amount of memory they take (700MB - 1GB per task, so if you have a quad core you're looking at up to 4GB), so said new user issues a reset project and strands the tasks, then when the next person comes along to pick the task up when it hits the timeout, the file is not sitting on the server so the download gets stuck in retry mode until possibly your whole download queue is full of retries. I've personally had to abort 10 transfers to get 1 task to work on. That's the most severe instance of that issue for me. Most of the time I only have to abort 3-5 transfers to get 1 task...


In my project management days, I learned that setting expectations is important, but if your baseline expectations are low, then I suppose that isn't as much of an issue.


The important thing is setting realistic expectations. Even then though, if a situation exists where the users, which in this case are us, do not tolerate any reevaluations of the estimated timeline, then that is not a good situation. That's what happened in the job I had, where I was told that I was required to work the weekends if I felt that I needed more time. I wasn't the only one told that either. Business users were allowed to change things in their requirements, all the way up to the point of User Acceptance Testing. If it meant that we had to work even harder, well, that was just the breaks...

That's what I see when I see several of the more demanding people here start ranting about how their needs are not being met...
ID: 33299 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Silvers

Send message
Joined: 21 Aug 08
Posts: 625
Credit: 558,425
RAC: 0
Message 33301 - Posted: 15 Nov 2009, 23:49:24 UTC - in response to Message 33294.  
Last modified: 15 Nov 2009, 23:53:30 UTC

Brian, I think you are quite right regarding expectations. You also are quite right in pointing out that there are other projects which are truly 'bad citizens' in the BOINC firmament. Cosmology -- as you noted, or even worse, Predictor.


Cosmology is not a "bad citizen". It's just that they are out of their depth and don't want to pay enough money to get an appropriately talented and dedicated administrator.

Predictor...well...there were mistakes made on both sides of that issue...
ID: 33301 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 520
Credit: 302,524,931
RAC: 2
Message 33303 - Posted: 16 Nov 2009, 1:21:18 UTC - in response to Message 33299.  

I am hoping that you'd not be characterizing my screeds as rants -- about the only 'rantish' aspect of my posts is regarding setting expectations. Like you, if there is a two week (or longer) outage here I can readily accept that. What puts me off my feed is when (as we know now) an overly optimistic 'hope' of only a couple or few days outage is set.

What raises the temperature here is the relatively non-communicative prior few months here -- though the explanations regarding what Travis is prioritizing explains that well enough.



That's what I see when I see several of the more demanding people here start ranting about how their needs are not being met...


ID: 33303 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
arcturus

Send message
Joined: 20 Nov 07
Posts: 54
Credit: 2,663,789
RAC: 0
Message 33304 - Posted: 16 Nov 2009, 1:25:31 UTC - in response to Message 33301.  

[quote]Predictor...well...there were mistakes made on both sides of that issue...

What 'other' side was there? That project's lack of communication truly sucked.
ID: 33304 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 520
Credit: 302,524,931
RAC: 2
Message 33305 - Posted: 16 Nov 2009, 1:27:36 UTC - in response to Message 33301.  

Again, perhaps a case of expectation and definition. I had heard enough about Cosmology to not dabble in those waters. After all, there are options.

Regarding Predictor, yes, some end users started playing games over there partly in response to some project issues. Then it got much worse. But the last 12 months of foolishness over there -- that's on the administration (or total lack of it).

You realize their site is back up -- it surfaced about a month ago. No comment whatsoever from the project, it web pages and messages boards simply went live again.

I figure that the web pages were hijacked by some psych grad students who decided to bring the message boards back up (with the same level of project participation that marred Predictors demise) just to see what sort of message traffic would happen.

To me, about the worst insult I can make to BOINC project administrators is to suggest they are engaging in 'Predictor-like' behavior.



Cosmology is not a "bad citizen". It's just that they are out of their depth and don't want to pay enough money to get an appropriately talented and dedicated administrator.

Predictor...well...there were mistakes made on both sides of that issue...


ID: 33305 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile David Glogau*
Avatar

Send message
Joined: 12 Aug 09
Posts: 172
Credit: 645,240,165
RAC: 0
Message 33306 - Posted: 16 Nov 2009, 1:47:57 UTC - in response to Message 33293.  

You should be able to set up a new account -- I just did that temporarily on a workstation without a problem. When you go to attach, are you using this URL:

http://boinc.thesonntags.com/collatz/

I think some search engines return a location below that root and that will mess up an attach.


Thanks Barry, that worked well. up and running.

That gives me:
28 cpu's running CPDN;
10 gpu's running Seti;
02 gpu's running Collatz; and
05 gpu's doing nothing, cause I am too dumb to work out how to get them to run on SuSe 11.2, which is a system I have never used before, and thought I would have a play with in the downtime!

Regards
David
ID: 33306 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 520
Credit: 302,524,931
RAC: 2
Message 33307 - Posted: 16 Nov 2009, 3:35:43 UTC - in response to Message 33306.  

Excellent, glad to help out.



Thanks Barry, that worked well. up and running.

That gives me:
28 cpu's running CPDN;
10 gpu's running Seti;
02 gpu's running Collatz; and
05 gpu's doing nothing, cause I am too dumb to work out how to get them to run on SuSe 11.2, which is a system I have never used before, and thought I would have a play with in the downtime!

Regards
David


ID: 33307 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Server Crash November 10

©2024 Astroinformatics Group