Message boards :
Number crunching :
Could We Get a Paper On This?
Message board moderation
Author | Message |
---|---|
![]() Send message Joined: 30 Mar 08 Posts: 50 Credit: 11,593,755 RAC: 0 ![]() ![]() |
With all the scientific rigor attached to the project, I believe an analysis of the recent clusterf#@K of poison work units is due the loyal crunching community. It should certainly explain the astrological utility of "NAN". The issue of testing certainly deserves mention. The outage of the 27th that supposedly purged the poison workunits begs for illumination. Mandatory should be a full chapter on how the project is run so poorly that this post needs to be made at all. I am getting sick of babysitting poison WU's on individual computers when this is a project issue and can be solved for x number of crunchers at a single point-the server. Nothing brings joy like seeing 6 hours of wasted crunch time on a 3 ghz quad stuck with 4 of the "nannys" overnight. The problem has not gone away. I am close to shutting down. Voltron |
Send message Joined: 27 Nov 07 Posts: 12 Credit: 543,138 RAC: 0 ![]() ![]() |
Seems you missed it somehow: The project is still an alpha-project, so problems should be expected rather than being a surprise. |
![]() Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 ![]() ![]() |
We've finally gotten our database problem under control -- the purge is still running after a day and a half, but things are fast enough now where i was finally able to log in and remove the bad workunits. If you guys see anymore gs_3737... workunits let me know -- they're all removed from the database so there shouldn't be sent out anymore. once the database is purged we're going to try and track down what went wrong. i'm suspecting that the database was so full and that was what caused the problems. either way we're trying to figure out what nate did differently to cause all the bad workunits. ![]() |
![]() ![]() Send message Joined: 31 Aug 07 Posts: 66 Credit: 1,002,668 RAC: 0 ![]() ![]() |
clusterf#@K Excellent rant! ;) A couple of points (and bear in mind that I'm not crunching MW ATM, so am slightly out of touch): The project is still in Alpha - As another project said on it's front page "Be Not Surprised". I've lost plenty of credit on other aplha projects, and just lost half a day yesterday on one humble host on what I regarded as a stable application. Stuff happens. I think Travis and Dave have done a great job on this project, and have always been responsive to problems, despite limitations in the amount of IT support that they receive, they are always polite and helpful, and to be honest this project is certainly in the top 3 of projects with good admin. It would be a great loss to the project if you quit now: 2.75M credit and a 48K RAC? Impressive stats. Seems to have worked for you pretty well up until now, so give it another chance. I quit Cosmo ages ago when the problems became too much to justify the electricity costs on it. It came back to being a reasonable project though. I currently run a bunch of pre-alpha projects that don't even export stats. Really frustrating when 'schoolboy errors' trash hours or days of crunching, but I like the fun aspect of trying something new and, well, 'dangerous' ;) MW is a very interesting project, and it's going to get even more interesting once the 'real stuff' starts getting crunched. If nothing else, enjoy the decent credit rate and put the odd few thousand lost credits down to experience. But then you knew all of above already - Just thought I'd chip in to support Travis and Dave :) Al. |
![]() Send message Joined: 30 Mar 08 Posts: 50 Credit: 11,593,755 RAC: 0 ![]() ![]() |
[/quote] I think Travis and Dave have done a great job on this project, and have always been responsive to problems, despite limitations in the amount of IT support that they receive, they are always polite and helpful, and to be honest this project is certainly in the top 3 of projects with good admin. MW is a very interesting project, and it's going to get even more interesting once the 'real stuff' starts getting crunched. If nothing else, enjoy the decent credit rate and put the odd few thousand lost credits down to experience. [/quote] Al; thanks for the kind words. I ran some errands today and decided to shut down each of my rigs that was stuck on a"nanny" when I returned home. It is 3pm CST and I have two rigs left running. One is a sympathy rig I use to surf, check mail, and burn Linux distros. Shut down is not an efficient option. The other is one of my quads I call "hope". I will monitor the project and the name should be self expanatory. I agree with your comments about Travis and Dave and the merits of the project. I sympathize with the admins. It is the boneheads who believe a computer can compensate for any "alpha" excuse they can produce that tarnishes the reputation of the sponsoring institution. I have reset every rig that was stuck on a "nanny" regardless of work in progress. Time for the Aussies to step up to the leaderboard. Voltron |
![]() Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 ![]() ![]() |
I think Travis and Dave have done a great job on this project, and have always been responsive to problems, despite limitations in the amount of IT support that they receive, they are always polite and helpful, and to be honest this project is certainly in the top 3 of projects with good admin. MW is a very interesting project, and it's going to get even more interesting once the 'real stuff' starts getting crunched. If nothing else, enjoy the decent credit rate and put the odd few thousand lost credits down to experience. [/quote] Al; thanks for the kind words. I ran some errands today and decided to shut down each of my rigs that was stuck on a"nanny" when I returned home. It is 3pm CST and I have two rigs left running. One is a sympathy rig I use to surf, check mail, and burn Linux distros. Shut down is not an efficient option. The other is one of my quads I call "hope". I will monitor the project and the name should be self expanatory. I agree with your comments about Travis and Dave and the merits of the project. I sympathize with the admins. It is the boneheads who believe a computer can compensate for any "alpha" excuse they can produce that tarnishes the reputation of the sponsoring institution. I have reset every rig that was stuck on a "nanny" regardless of work in progress. Time for the Aussies to step up to the leaderboard. Voltron[/quote] Just another update as to whats going on. I'm still having some database problems -- when i went into mysql to manually delete the bad workunits my connection kept timing out. so i'm starting back up the purge program which should hopefully fix the problem. it's probably not the answer anyone is looking for but it might be a day or two before the purge is done and i can manually remove the remaining bad WUs from the database. ![]() |
Send message Joined: 18 Nov 07 Posts: 280 Credit: 2,442,757 RAC: 0 ![]() ![]() |
Seems like these purges involve a massive amount of computation/reading from and writing to memory.. I'm curious, how many WUs are we talking about, and how much space do they take up collectively (or individually.. whichever is simpler to determine)? |
![]() ![]() Send message Joined: 5 Feb 08 Posts: 236 Credit: 49,648 RAC: 0 ![]() ![]() |
Seems like these purges involve a massive amount of computation/reading from and writing to memory.. I'm curious, how many WUs are we talking about, and how much space do they take up collectively (or individually.. whichever is simpler to determine)? The database grew to 45gb in about a month. It's a lot to store on in a handful of tables. Dave Przybylo MilkyWay@home Developer Department of Computer Science Rensselaer Polytechnic Institute |
Send message Joined: 18 Nov 07 Posts: 280 Credit: 2,442,757 RAC: 0 ![]() ![]() |
The database grew to 45gb in about a month. It's a lot to store on in a handful of tables. Ew, yeah. Do you need all that data, or are you looking into making it garbage collect more frequently? |
![]() Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 ![]() ![]() |
The database grew to 45gb in about a month. It's a lot to store on in a handful of tables. We actually thought we had a script set up to clean the database every few days. Part of this problem let us realize it wasn't quite working, lol. After this is all fixed this shouldnt be a problem again. ![]() |
![]() Send message Joined: 27 Aug 07 Posts: 915 Credit: 1,503,319 RAC: 0 ![]() ![]() |
After uFluids and Astropulse this is a cake walk. No rants here. me@rescam.org |
Send message Joined: 9 Feb 08 Posts: 3 Credit: 126,332 RAC: 0 ![]() ![]() |
Welcome to the internet. Sometimes things go wrong. If you think you can run a project, then please do so, we'd all love to have a perfectly optimized BOINC project with no downtime and work always available... |
![]() ![]() Send message Joined: 28 Aug 07 Posts: 35 Credit: 89,086,005 RAC: 742 ![]() ![]() |
I am still getting the occasional bad work unit. I aborted about six just then when I checked my pc's. I have 13 pc's and only check them once a day. I would estimate, (using the SWAG principle) over 150 hours of crunching has been lost on those wu's, on my machines. Is there going to be any credit for them ? I hate to think that I have waisted all that processing time. I stopped crunching when the problem was discovered, and returned after I thought it was fixed, but they seem to be still coming. ![]() Proud Founder and member of ![]() Have a look at my WebCam |
![]() Send message Joined: 15 Apr 08 Posts: 55 Credit: 24,047 RAC: 0 ![]() ![]() |
I am still getting the occasional bad work unit. I aborted about six just then when I checked my pc's. I have 13 pc's and only check them once a day. I would estimate, (using the SWAG principle) over 150 hours of crunching has been lost on those wu's, on my machines. Just a thought but if I KNOW there has been a problem and THINK it MAY have been fixed I check my machine more often than once a day. Or I crunch another project until I Know it has been fixed. |
Send message Joined: 16 Dec 07 Posts: 37 Credit: 26,399,956 RAC: 3,742 ![]() ![]() ![]() |
I am still getting the occasional bad work unit. I aborted about six just then when I checked my pc's. I have 13 pc's and only check them once a day. I would estimate, (using the SWAG principle) over 150 hours of crunching has been lost on those wu's, on my machines. Well I'm only cruching MW one WU at a time when I'm connected and then only if I can get a WU. I have 2 or 3 other projects I can run when I'm offline. It's enough to give me a satisfying MW total and I can ride out any fluxes in server/WU availability. |
![]() ![]() Send message Joined: 28 Aug 07 Posts: 35 Credit: 89,086,005 RAC: 742 ![]() ![]() |
Well they are still getting through, but not as many as before. I checked my PC's this morning and there was one wu on seven machines that had run for over 6 hours. That's almost another 50 hours of dead time that could be spent processing science for this or another project. My original estimate was way off and could be double the 150 hours I stated before. How hard is it to purge a database when you know that all the wu's start with the same numbers ?? ![]() Proud Founder and member of ![]() Have a look at my WebCam |
Send message Joined: 18 Nov 07 Posts: 280 Credit: 2,442,757 RAC: 0 ![]() ![]() |
How hard is it to purge a database when you know that all the wu's start with the same numbers ?? Well, Travis did say he was having trouble accessing the database at all. |
Send message Joined: 6 Mar 08 Posts: 15 Credit: 3,006,602 RAC: 0 ![]() ![]() |
How hard is it to purge a database when you know that all the wu's start with the same numbers ?? It must be very tricky as MW isn't the only project that has had trouble purging their db of rogue WUs.. :( .... |
![]() ![]() Send message Joined: 29 Aug 07 Posts: 327 Credit: 116,463,193 RAC: 0 ![]() ![]() |
I just got another one, but I'm on Windows, so no big deal. This one has been sitting with the same host since it was issued and just timed out today. http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=30453588 [EDIT] And one more, but it has been around for a while, I should be the last cruncher this one goes to. http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=30558957 [/EDIT] ![]() Calm Chaos Forum...Join Calm Chaos Now |
©2025 Astroinformatics Group