Welcome to MilkyWay@home

Aaargh! Server out of new work!

Message boards : Number crunching : Aaargh! Server out of new work!
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 22 · Next

AuthorMessage
Profile banditwolf
Avatar

Send message
Joined: 12 Nov 07
Posts: 2425
Credit: 524,164
RAC: 0
Message 41788 - Posted: 27 Aug 2010, 20:25:32 UTC - in response to Message 41783.  

Definitely!

DNETC is unobtainable still

Milkyway is not dishing work as the most of the servers are down

Collatz is obtainable but intermittent and the site is slow.

That's the main GPU projects gone south.


Rosetta has not had work either.

I now have no tasks to do. Oh well.
Doesn't expecting the unexpected make the unexpected the expected?
If it makes sense, DON'T do it.
ID: 41788 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Haris Dublas

Send message
Joined: 25 Feb 10
Posts: 49
Credit: 10,137,837
RAC: 0
Message 41789 - Posted: 27 Aug 2010, 20:50:07 UTC

Collatz probably went down due to the mass switching of people from MW and DNETC.
ID: 41789 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 520
Credit: 302,524,931
RAC: 15
Message 41790 - Posted: 27 Aug 2010, 21:32:08 UTC - in response to Message 41789.  

I wonder if MW will be offline for the entire weekend. Is Travis the only person at RPI who can figure this one out?
ID: 41790 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 520
Credit: 302,524,931
RAC: 15
Message 41792 - Posted: 27 Aug 2010, 21:40:07 UTC

Dnetc will be offline for at least another day -- during their software update/upgrade one of the HD's on the RAID failed -- they are in rebuild mode.

I hope they were at least in RAID 5 mode and not RAID 0 as THAT would be seriously ugly. Ideally folks running multi-drive RAID arrays are running controllers that handle RAID 5 plus a hot spare. The drives are not that expensive, but server class RAID 5 + hot spare controllers can be a bit pricey.
ID: 41792 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile The Gas Giant
Avatar

Send message
Joined: 24 Dec 07
Posts: 1947
Credit: 240,884,648
RAC: 0
Message 41795 - Posted: 27 Aug 2010, 22:28:02 UTC

Wooohoo! I got a full MW cache. :)
ID: 41795 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 520
Credit: 302,524,931
RAC: 15
Message 41796 - Posted: 27 Aug 2010, 23:29:53 UTC - in response to Message 41795.  

It's alive!!
ID: 41796 · Rating: 0 · rate: Rate + / Rate - Report as offensive
John Clark

Send message
Joined: 4 Oct 08
Posts: 1734
Credit: 64,228,409
RAC: 0
Message 41797 - Posted: 27 Aug 2010, 23:44:10 UTC

Yes, mine are now filling up since I suspended Collatz. Now need to work off the Collatz cache between Milkyway sessions.
Go away, I was asleep


ID: 41797 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile The Gas Giant
Avatar

Send message
Joined: 24 Dec 07
Posts: 1947
Credit: 240,884,648
RAC: 0
Message 41801 - Posted: 28 Aug 2010, 6:00:14 UTC

Validator needs a kick. It's only validating wu's that are paired. All single wu's are not being validated.
ID: 41801 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TJ

Send message
Joined: 12 Aug 09
Posts: 262
Credit: 92,631,041
RAC: 0
Message 41846 - Posted: 31 Aug 2010, 7:04:27 UTC

Well it is zero again. The validater has "overheathed".
Time to let my rigs to cool as well (and saving some energy costs).
Greetings from,
TJ
ID: 41846 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 520
Credit: 302,524,931
RAC: 15
Message 41847 - Posted: 31 Aug 2010, 7:27:01 UTC - in response to Message 41846.  

Dnetc may be back and running by the end of the week. The dreaded 'software upgrade' -- stress tested their server and they had a 'mid upgrade' RAID drive failure as well as a memory module failure. They are pretty much in a full rebuild mode for now.
ID: 41847 · Rating: 0 · rate: Rate + / Rate - Report as offensive
CTAPbIi

Send message
Joined: 4 Jan 10
Posts: 86
Credit: 51,753,924
RAC: 0
Message 41848 - Posted: 31 Aug 2010, 10:33:52 UTC

f.ck, again... I've got used to shutdown every week on weekends, but it's Tuesday only. common guys, you might be kidding me - one day of work and then one day off.

Could you PLS fix the server???
ID: 41848 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Chris S
Avatar

Send message
Joined: 20 Sep 08
Posts: 1391
Credit: 203,563,566
RAC: 0
Message 41853 - Posted: 31 Aug 2010, 14:43:08 UTC

Dnetc may be back and running by the end of the week. The dreaded 'software upgrade' -- stress tested their server and they had a 'mid upgrade' RAID drive failure as well as a memory module failure. They are pretty much in a full rebuild mode for now.


That is good to hear :-)

This place is just unreliable so what we need is more Boinc ATI projects!!!!
Don't drink water, that's the stuff that rusts pipes
ID: 41853 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 520
Credit: 302,524,931
RAC: 15
Message 41855 - Posted: 31 Aug 2010, 17:17:57 UTC - in response to Message 41853.  

One thing that seems odd to me. The problem appears fairly straightforward (at least the symptoms are pretty obvious and *repetitive*). The workaround resolution (either stop/start processes or a full server stop/restart) also seems reasonably straightforward.

Actually a couple of questions (although I realize that RPI folks rarely clock in over here)

Why does it take so long (12 hours or more) to go from symptom to restart?

Wouldn't it be possible to auotmate the stop/restart process and run it say every 48 hours?


I figure since this problem has been going on for months, in addition to efforts to track down the root cause, efforts to implement a workaround would be 'resource appropriate' and would have been in place by now.
ID: 41855 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Fred J. Verster

Send message
Joined: 22 Apr 09
Posts: 38
Credit: 27,377,932
RAC: 0
Message 41856 - Posted: 31 Aug 2010, 18:59:42 UTC - in response to Message 41855.  

That sure would be nice as there are still very little projects using ATI GPU's
MilkyWay and Collatz C. are the only known to use ATI GPU's.

SETI@Home , also at SETI BÊTA, a usergroup The LUNATICS are testing ATI GPU's for AP computing, they already have some working app.'s.
Here is the latest installer for FERMI GPU's.
And you'll find one of the ATI GPU app.'s as well.



Knight Who says Ni
ID: 41856 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Werkstatt

Send message
Joined: 19 Feb 08
Posts: 350
Credit: 141,284,369
RAC: 0
Message 41857 - Posted: 31 Aug 2010, 19:18:14 UTC - in response to Message 41855.  

One thing that seems odd to me. The problem appears fairly straightforward (at least the symptoms are pretty obvious and *repetitive*). The workaround resolution (either stop/start processes or a full server stop/restart) also seems reasonably straightforward.

Actually a couple of questions (although I realize that RPI folks rarely clock in over here)

Why does it take so long (12 hours or more) to go from symptom to restart?

Wouldn't it be possible to auotmate the stop/restart process and run it say every 48 hours?


I figure since this problem has been going on for months, in addition to efforts to track down the root cause, efforts to implement a workaround would be 'resource appropriate' and would have been in place by now.


It looks like there is something going on at RPI. Can you remember the posting 'Screensaver coming soon' ? Or can you remember the project DNA@HOME ? Milkyway3 ?

It should not be a problem to detect that the validator stops validating. And of course, they do detect that because they stop producing wu's. But we all miss the next step which should be a rework of the validator, be it hardware, software or setup or what else. Or at least a quick restart.

It looks like nobody is responsible there. This is the best way to kill not only a project but the whole idea of distributed computing. Project responsibles should be serious in their handling of the project issues.

Alexander
ID: 41857 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 520
Credit: 302,524,931
RAC: 15
Message 41858 - Posted: 31 Aug 2010, 19:19:23 UTC - in response to Message 41856.  
Last modified: 31 Aug 2010, 19:21:24 UTC

Dnetc -- when running, also supports ATI GPU's -- they hope to be back up and running again later this week (they encountered something of the worst case scenario -- memory and hard drive failure while in the middle of a software upgrade). They are in recovery mode for now.

I'm still sort of bemused by the informational (and response time) black hole we often encounter here. To a certain degree (good new/bad news I suppose) it seems that folks here have gotten acclimated to the non response (or delayed response) here.


That sure would be nice as there are still very little projects using ATI GPU's
MilkyWay and Collatz C. are the only known to use ATI GPU's.

SETI@Home , also at SETI BÊTA, a usergroup The LUNATICS are testing ATI GPU's for AP computing, they already have some working app.'s.
Here is the latest installer for FERMI GPU's.
And you'll find one of the ATI GPU app.'s as well.


ID: 41858 · Rating: 0 · rate: Rate + / Rate - Report as offensive
John Clark

Send message
Joined: 4 Oct 08
Posts: 1734
Credit: 64,228,409
RAC: 0
Message 41861 - Posted: 31 Aug 2010, 20:59:18 UTC

Barry

All the DNETC web pages have returned (Home, account and forums). The only bit missing ATM is new work (a few hours yet) and the servers accepting crunched work "waiting to report"
Go away, I was asleep


ID: 41861 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Tretboot

Send message
Joined: 20 Aug 10
Posts: 10
Credit: 63,514,783
RAC: 0
Message 41862 - Posted: 31 Aug 2010, 22:22:51 UTC

It really should not be so hard to automaticly restart what is failling every day or 2 days or so. Most MMOs that i played over the years also have a daily downtime to prevent stuff like this. (totally different application, but i guess the problem is sort of the same).
ID: 41862 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile The Gas Giant
Avatar

Send message
Joined: 24 Dec 07
Posts: 1947
Credit: 240,884,648
RAC: 0
Message 41864 - Posted: 31 Aug 2010, 22:56:30 UTC
Last modified: 31 Aug 2010, 22:58:35 UTC

Memroy leak kills system every 2 to 3 days, therefore reboot every 1 to 2 days until the source of the memory leak is found. Simple really.

So reboot every Monday, Wedensday and Friday morning.

ps. I think the system is conspiring to slow down my obtainment of major milestone...in this case 100mill cobblers on MW. Last time it was 100mill overall.
ID: 41864 · Rating: 0 · rate: Rate + / Rate - Report as offensive
John Clark

Send message
Joined: 4 Oct 08
Posts: 1734
Credit: 64,228,409
RAC: 0
Message 41865 - Posted: 31 Aug 2010, 23:21:44 UTC

Work available again
Go away, I was asleep


ID: 41865 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 22 · Next

Message boards : Number crunching : Aaargh! Server out of new work!

©2024 Astroinformatics Group