Welcome to MilkyWay@home

Could We Get a Paper On This?


Advanced search

Message boards : Number crunching : Could We Get a Paper On This?
Message board moderation

To post messages, you must log in.

AuthorMessage
voltron
Avatar

Send message
Joined: 30 Mar 08
Posts: 50
Credit: 11,593,755
RAC: 0
10 million credit badge10 year member badge
Message 3575 - Posted: 29 May 2008, 16:13:40 UTC

With all the scientific rigor attached to the project, I believe an analysis of the recent clusterf#@K of poison work units is due the loyal crunching community.

It should certainly explain the astrological utility of "NAN".

The issue of testing certainly deserves mention.

The outage of the 27th that supposedly purged the poison workunits begs for illumination.

Mandatory should be a full chapter on how the project is run so poorly that this post needs to be made at all.

I am getting sick of babysitting poison WU's on individual computers when this is a project issue and can be solved for x number of crunchers at a single point-the server.

Nothing brings joy like seeing 6 hours of wasted crunch time on a 3 ghz quad stuck with 4 of the "nannys" overnight.

The problem has not gone away. I am close to shutting down.

Voltron
ID: 3575 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
-ShEm-

Send message
Joined: 27 Nov 07
Posts: 12
Credit: 543,138
RAC: 0
500 thousand credit badge10 year member badge
Message 3577 - Posted: 29 May 2008, 17:27:48 UTC - in response to Message 3575.  

Seems you missed it somehow: The project is still an alpha-project, so problems should be expected rather than being a surprise.
ID: 3577 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTravis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
10 thousand credit badge10 year member badge
Message 3579 - Posted: 29 May 2008, 18:01:30 UTC - in response to Message 3577.  

We've finally gotten our database problem under control -- the purge is still running after a day and a half, but things are fast enough now where i was finally able to log in and remove the bad workunits. If you guys see anymore gs_3737... workunits let me know -- they're all removed from the database so there shouldn't be sent out anymore.

once the database is purged we're going to try and track down what went wrong. i'm suspecting that the database was so full and that was what caused the problems. either way we're trying to figure out what nate did differently to cause all the bad workunits.
ID: 3579 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileChertseyAl
Avatar

Send message
Joined: 31 Aug 07
Posts: 66
Credit: 1,002,668
RAC: 0
1 million credit badge10 year member badge
Message 3581 - Posted: 29 May 2008, 18:30:31 UTC - in response to Message 3575.  

clusterf#@K


Excellent rant! ;)

A couple of points (and bear in mind that I'm not crunching MW ATM, so am slightly out of touch):

The project is still in Alpha - As another project said on it's front page "Be Not Surprised". I've lost plenty of credit on other aplha projects, and just lost half a day yesterday on one humble host on what I regarded as a stable application. Stuff happens.

I think Travis and Dave have done a great job on this project, and have always been responsive to problems, despite limitations in the amount of IT support that they receive, they are always polite and helpful, and to be honest this project is certainly in the top 3 of projects with good admin.

It would be a great loss to the project if you quit now: 2.75M credit and a 48K RAC? Impressive stats. Seems to have worked for you pretty well up until now, so give it another chance.

I quit Cosmo ages ago when the problems became too much to justify the electricity costs on it. It came back to being a reasonable project though. I currently run a bunch of pre-alpha projects that don't even export stats. Really frustrating when 'schoolboy errors' trash hours or days of crunching, but I like the fun aspect of trying something new and, well, 'dangerous' ;)

MW is a very interesting project, and it's going to get even more interesting once the 'real stuff' starts getting crunched. If nothing else, enjoy the decent credit rate and put the odd few thousand lost credits down to experience.

But then you knew all of above already - Just thought I'd chip in to support Travis and Dave :)

Al.
ID: 3581 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
voltron
Avatar

Send message
Joined: 30 Mar 08
Posts: 50
Credit: 11,593,755
RAC: 0
10 million credit badge10 year member badge
Message 3583 - Posted: 29 May 2008, 20:15:11 UTC - in response to Message 3581.  

[/quote]



I think Travis and Dave have done a great job on this project, and have always been responsive to problems, despite limitations in the amount of IT support that they receive, they are always polite and helpful, and to be honest this project is certainly in the top 3 of projects with good admin.

MW is a very interesting project, and it's going to get even more interesting once the 'real stuff' starts getting crunched. If nothing else, enjoy the decent credit rate and put the odd few thousand lost credits down to experience.

[/quote]

Al; thanks for the kind words. I ran some errands today and decided to shut down each of my rigs that was stuck on a"nanny" when I returned home. It is 3pm
CST and I have two rigs left running. One is a sympathy rig I use to surf, check mail, and burn Linux distros. Shut down is not an efficient option. The other is one of my quads I call "hope". I will monitor the project and the name should be self expanatory. I agree with your comments about Travis and Dave and the merits of the project. I sympathize with the admins. It is the boneheads who believe a computer can compensate for any "alpha" excuse they can produce that tarnishes the reputation of the sponsoring institution.


I have reset every rig that was stuck on a "nanny" regardless of work in progress. Time for the Aussies to step up to the leaderboard.

Voltron
ID: 3583 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTravis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
10 thousand credit badge10 year member badge
Message 3584 - Posted: 29 May 2008, 20:40:37 UTC - in response to Message 3583.  





I think Travis and Dave have done a great job on this project, and have always been responsive to problems, despite limitations in the amount of IT support that they receive, they are always polite and helpful, and to be honest this project is certainly in the top 3 of projects with good admin.

MW is a very interesting project, and it's going to get even more interesting once the 'real stuff' starts getting crunched. If nothing else, enjoy the decent credit rate and put the odd few thousand lost credits down to experience.

[/quote]

Al; thanks for the kind words. I ran some errands today and decided to shut down each of my rigs that was stuck on a"nanny" when I returned home. It is 3pm
CST and I have two rigs left running. One is a sympathy rig I use to surf, check mail, and burn Linux distros. Shut down is not an efficient option. The other is one of my quads I call "hope". I will monitor the project and the name should be self expanatory. I agree with your comments about Travis and Dave and the merits of the project. I sympathize with the admins. It is the boneheads who believe a computer can compensate for any "alpha" excuse they can produce that tarnishes the reputation of the sponsoring institution.


I have reset every rig that was stuck on a "nanny" regardless of work in progress. Time for the Aussies to step up to the leaderboard.

Voltron[/quote]


Just another update as to whats going on. I'm still having some database problems -- when i went into mysql to manually delete the bad workunits my connection kept timing out. so i'm starting back up the purge program which should hopefully fix the problem. it's probably not the answer anyone is looking for but it might be a day or two before the purge is done and i can manually remove the remaining bad WUs from the database.
ID: 3584 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Emanuel

Send message
Joined: 18 Nov 07
Posts: 280
Credit: 2,442,757
RAC: 0
2 million credit badge10 year member badge
Message 3586 - Posted: 29 May 2008, 21:06:49 UTC - in response to Message 3584.  

Seems like these purges involve a massive amount of computation/reading from and writing to memory.. I'm curious, how many WUs are we talking about, and how much space do they take up collectively (or individually.. whichever is simpler to determine)?
ID: 3586 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileDave Przybylo
Avatar

Send message
Joined: 5 Feb 08
Posts: 236
Credit: 49,648
RAC: 0
10 thousand credit badge10 year member badge
Message 3587 - Posted: 29 May 2008, 21:12:05 UTC - in response to Message 3586.  

Seems like these purges involve a massive amount of computation/reading from and writing to memory.. I'm curious, how many WUs are we talking about, and how much space do they take up collectively (or individually.. whichever is simpler to determine)?


The database grew to 45gb in about a month. It's a lot to store on in a handful of tables.
Dave Przybylo
MilkyWay@home Developer
Department of Computer Science
Rensselaer Polytechnic Institute
ID: 3587 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Emanuel

Send message
Joined: 18 Nov 07
Posts: 280
Credit: 2,442,757
RAC: 0
2 million credit badge10 year member badge
Message 3588 - Posted: 29 May 2008, 21:19:41 UTC - in response to Message 3587.  

The database grew to 45gb in about a month. It's a lot to store on in a handful of tables.


Ew, yeah. Do you need all that data, or are you looking into making it garbage collect more frequently?
ID: 3588 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTravis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
10 thousand credit badge10 year member badge
Message 3589 - Posted: 29 May 2008, 21:28:18 UTC - in response to Message 3588.  

The database grew to 45gb in about a month. It's a lot to store on in a handful of tables.


Ew, yeah. Do you need all that data, or are you looking into making it garbage collect more frequently?


We actually thought we had a script set up to clean the database every few days. Part of this problem let us realize it wasn't quite working, lol. After this is all fixed this shouldnt be a problem again.
ID: 3589 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Misfit
Avatar

Send message
Joined: 27 Aug 07
Posts: 915
Credit: 1,503,319
RAC: 0
1 million credit badge10 year member badge
Message 3593 - Posted: 30 May 2008, 5:28:52 UTC

After uFluids and Astropulse this is a cake walk. No rants here.
me@rescam.org
ID: 3593 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[B^S]ST47

Send message
Joined: 9 Feb 08
Posts: 3
Credit: 126,332
RAC: 0
100 thousand credit badge10 year member badge
Message 3597 - Posted: 30 May 2008, 10:30:15 UTC

Welcome to the internet. Sometimes things go wrong. If you think you can run a project, then please do so, we'd all love to have a perfectly optimized BOINC project with no downtime and work always available...
ID: 3597 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileDingo
Avatar

Send message
Joined: 28 Aug 07
Posts: 35
Credit: 64,598,057
RAC: 1,063
50 million credit badge10 year member badge
Message 3613 - Posted: 31 May 2008, 15:01:45 UTC

I am still getting the occasional bad work unit. I aborted about six just then when I checked my pc's. I have 13 pc's and only check them once a day. I would estimate, (using the SWAG principle) over 150 hours of crunching has been lost on those wu's, on my machines.

Is there going to be any credit for them ? I hate to think that I have waisted all that processing time. I stopped crunching when the problem was discovered, and returned after I thought it was fixed, but they seem to be still coming.





Proud Founder and member of



Have a look at my WebCam
ID: 3613 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilealijay

Send message
Joined: 15 Apr 08
Posts: 55
Credit: 24,047
RAC: 0
10 thousand credit badge10 year member badge
Message 3616 - Posted: 31 May 2008, 16:47:12 UTC - in response to Message 3613.  

I am still getting the occasional bad work unit. I aborted about six just then when I checked my pc's. I have 13 pc's and only check them once a day. I would estimate, (using the SWAG principle) over 150 hours of crunching has been lost on those wu's, on my machines.

Is there going to be any credit for them ? I hate to think that I have waisted all that processing time. I stopped crunching when the problem was discovered, and returned after I thought it was fixed, but they seem to be still coming.





Just a thought but if I KNOW there has been a problem and THINK it MAY have been fixed I check my machine more often than once a day. Or I crunch another project until I Know it has been fixed.
ID: 3616 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cameron

Send message
Joined: 16 Dec 07
Posts: 25
Credit: 1,202,149
RAC: 2
1 million credit badge10 year member badge
Message 3623 - Posted: 1 Jun 2008, 0:54:44 UTC - in response to Message 3616.  

I am still getting the occasional bad work unit. I aborted about six just then when I checked my pc's. I have 13 pc's and only check them once a day. I would estimate, (using the SWAG principle) over 150 hours of crunching has been lost on those wu's, on my machines.

Is there going to be any credit for them ? I hate to think that I have waisted all that processing time. I stopped crunching when the problem was discovered, and returned after I thought it was fixed, but they seem to be still coming.





Just a thought but if I KNOW there has been a problem and THINK it MAY have been fixed I check my machine more often than once a day. Or I crunch another project until I Know it has been fixed.


Well I'm only cruching MW one WU at a time when I'm connected and then only if I can get a WU. I have 2 or 3 other projects I can run when I'm offline. It's enough to give me a satisfying MW total and I can ride out any fluxes in
server/WU availability.
ID: 3623 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileDingo
Avatar

Send message
Joined: 28 Aug 07
Posts: 35
Credit: 64,598,057
RAC: 1,063
50 million credit badge10 year member badge
Message 3634 - Posted: 1 Jun 2008, 16:32:00 UTC
Last modified: 1 Jun 2008, 16:33:08 UTC

Well they are still getting through, but not as many as before. I checked my PC's this morning and there was one wu on seven machines that had run for over 6 hours. That's almost another 50 hours of dead time that could be spent processing science for this or another project. My original estimate was way off and could be double the 150 hours I stated before.

How hard is it to purge a database when you know that all the wu's start with the same numbers ??

Proud Founder and member of



Have a look at my WebCam
ID: 3634 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Emanuel

Send message
Joined: 18 Nov 07
Posts: 280
Credit: 2,442,757
RAC: 0
2 million credit badge10 year member badge
Message 3637 - Posted: 1 Jun 2008, 22:39:29 UTC - in response to Message 3634.  

How hard is it to purge a database when you know that all the wu's start with the same numbers ??

Well, Travis did say he was having trouble accessing the database at all.
ID: 3637 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
AnRM

Send message
Joined: 6 Mar 08
Posts: 15
Credit: 3,006,602
RAC: 0
3 million credit badge10 year member badge
Message 3638 - Posted: 2 Jun 2008, 0:27:34 UTC - in response to Message 3637.  

How hard is it to purge a database when you know that all the wu's start with the same numbers ??

Well, Travis did say he was having trouble accessing the database at all.

It must be very tricky as MW isn't the only project that has had trouble purging their db of rogue WUs.. :( ....
ID: 3638 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileLabbie
Avatar

Send message
Joined: 29 Aug 07
Posts: 327
Credit: 116,463,193
RAC: 0
100 million credit badge10 year member badge
Message 3639 - Posted: 2 Jun 2008, 0:44:41 UTC
Last modified: 2 Jun 2008, 0:47:00 UTC

I just got another one, but I'm on Windows, so no big deal. This one has been sitting with the same host since it was issued and just timed out today.

http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=30453588

[EDIT]

And one more, but it has been around for a while, I should be the last cruncher this one goes to.

http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=30558957

[/EDIT]

Calm Chaos Forum...Join Calm Chaos Now
ID: 3639 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Could We Get a Paper On This?

©2019 Astroinformatics Group