Welcome to MilkyWay@home

Problem with new W/Us

Message boards : Number crunching : Problem with new W/Us
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Webmaster Yoda
Avatar

Send message
Joined: 21 Dec 07
Posts: 69
Credit: 7,048,412
RAC: 0
Message 3573 - Posted: 29 May 2008, 15:14:57 UTC

Sure is an ongoing plague. Did not have time to check my computers for an hour today. Sure enough, when I got back to them all 26 cores were hanging on those nasty things.

Will have to suspend MW overnight so I don't waste another 150-200 hours of CPU time on useless work units.


Join the #1 Aussie Alliance on MilkyWay!
ID: 3573 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
AnRM

Send message
Joined: 6 Mar 08
Posts: 15
Credit: 3,006,602
RAC: 0
Message 3574 - Posted: 29 May 2008, 15:36:01 UTC - in response to Message 3572.  

Since Windows WUs error out quickly, rather than taking many hours to get nowhere as they do on Linux, I've enabled Work Fetch on 2 of my boxes to help clear them out. Apart from lost credit, I'm hoping there won't be any problems.

You see, I don't only crunch for credits...

Rod

Good idea, Rod. We have added the rest of our Win machines to the fray....Cheers, Rog.
ID: 3574 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Odd-Rod

Send message
Joined: 7 Sep 07
Posts: 444
Credit: 5,712,523
RAC: 0
Message 3578 - Posted: 29 May 2008, 18:01:28 UTC - in response to Message 3574.  

Since Windows WUs error out quickly, rather than taking many hours to get nowhere as they do on Linux, I've enabled Work Fetch on 2 of my boxes to help clear them out. Apart from lost credit, I'm hoping there won't be any problems.

You see, I don't only crunch for credits...

Rod

Good idea, Rod. We have added the rest of our Win machines to the fray....Cheers, Rog.


Ah, spoken like a true alpha tester!

I have now put all my hosts back on again.

Since putting on those first 2 hosts, I have completed 9 WUs with only 1 error. I don't think I got much less credit for the time spent on those 9 than for the same time on some other projects. So credit crunchers, it might not be so bad... :-)

Rod
ID: 3578 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 3580 - Posted: 29 May 2008, 18:04:11 UTC - in response to Message 3578.  

Our database has been mostly purged so it's a bit more manageable. I'm running the SQL to delete the workunits related to the bad search from the database right now. i'm not quite sure how long it will take, but just to give you an idea -- it took over an hour to just do a GET on the WUs from the bad search. We've updated our daily scripts so we're running the correct script to clean out the database -- i think this is what might have caused the problems. Let me know if you get any more of the bad workunits, i'm pretty sure they should all be cleared out of the database within an hour or so.
ID: 3580 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
AnRM

Send message
Joined: 6 Mar 08
Posts: 15
Credit: 3,006,602
RAC: 0
Message 3582 - Posted: 29 May 2008, 19:21:04 UTC

Thanks for the update, Travis. You guys do a great job communicating. It's appreciated.....Cheers, Rog.
ID: 3582 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 3585 - Posted: 29 May 2008, 20:43:26 UTC - in response to Message 3582.  

Thanks for the update, Travis. You guys do a great job communicating. It's appreciated.....Cheers, Rog.


another update on the problem :( i've been trying to manually remove the bad workunits from the database, but the connection keeps timing out. we might have to drag labstaff into this to try and fix the database. either way the server shouldn't be generating any more bad workunits, so i'm hoping they'll work their way out of the system even if i can't manually get them out of the database.
ID: 3585 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Odd-Rod

Send message
Joined: 7 Sep 07
Posts: 444
Credit: 5,712,523
RAC: 0
Message 3590 - Posted: 29 May 2008, 21:35:34 UTC - in response to Message 3585.  

Thanks for the update, Travis. You guys do a great job communicating. It's appreciated.....Cheers, Rog.


I second that!


another update on the problem :( i've been trying to manually remove the bad workunits from the database, but the connection keeps timing out. we might have to drag labstaff into this to try and fix the database. either way the server shouldn't be generating any more bad workunits, so i'm hoping they'll work their way out of the system even if i can't manually get them out of the database.


I fully appreciate crunchers putting Linux hosts on No New Work (NNW) because their systems get stuck for hours. However, I will leave all my Windows hosts fetching work and accept the occasional error WU. After all, it's only minutes wasted, not hours. (Well, on my slowest host it IS almost an hour! Just over 56min!)

Rod
ID: 3590 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Wang Solutions
Avatar

Send message
Joined: 22 Dec 07
Posts: 13
Credit: 46,606,530
RAC: 0
Message 3591 - Posted: 30 May 2008, 3:12:41 UTC
Last modified: 30 May 2008, 3:14:17 UTC

It looks like we have new work starting to come through, hopefully all clean this time. Thanks Travis.

EDIT: But no, the first one I got was a dud. :(
Proud member of BOINC@AUSTRALIA
ID: 3591 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
stefsaber

Send message
Joined: 2 Apr 08
Posts: 32
Credit: 1,017,362
RAC: 0
Message 3592 - Posted: 30 May 2008, 4:06:26 UTC - in response to Message 3591.  

It looks like we have new work starting to come through, hopefully all clean this time. Thanks Travis.

EDIT: But no, the first one I got was a dud. :(


Got 2 WU's, one's of the gs_373 variety that seems to running forever, the gs_596 unit is running top speed though...Kind of looks like they're running off by a factor of ten 14% to 1.4% completion...

I'll leave milkyway running, help get rid of those 37's even if it does chew up a bit of time.

Thanks for the updates and continuing attempts to solve this though!
ID: 3592 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Misfit
Avatar

Send message
Joined: 27 Aug 07
Posts: 915
Credit: 1,503,319
RAC: 0
Message 3594 - Posted: 30 May 2008, 5:31:53 UTC

It's only a true alpha when a WU is granted -1M credits.
me@rescam.org
ID: 3594 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile RAMen
Avatar

Send message
Joined: 8 Apr 08
Posts: 45
Credit: 161,943,995
RAC: 0
Message 3595 - Posted: 30 May 2008, 5:56:00 UTC

Got a batch of new work units including the
gs_3737082_1211996843_1095635 variety
one unit ran for 25 minutes for zero progress on the ubuntu linux work station
aborted the lot.

Thanks for the update on cleaning the database Travis
ID: 3595 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Hefto99

Send message
Joined: 29 Dec 07
Posts: 9
Credit: 101,275,462
RAC: 28,248
Message 3598 - Posted: 30 May 2008, 10:58:28 UTC

I have got one gs_3737* too:
this one
ID: 3598 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Altivo

Send message
Joined: 5 Dec 07
Posts: 6
Credit: 1,687,632
RAC: 0
Message 3599 - Posted: 30 May 2008, 11:14:35 UTC

I ended up aborting a whole stack of gs_3737... units because they just looped and ran forever. Now I had to abort gs_591_1212007885_142924_0 for the same reason. It was running for 11+ hours. This is on a Linux worstation (Slackware) with BOINC 5.8.16.

It is particularly troubling that these tasks do not seem to surrender control of the CPU when their hour time slot is up. Looping or not, this is hostile behavior that keeps other projects from getting their share of the available services. I've noticed Milkyway tasks doing this before, and I don't particularly like it.

This has been a severe enough problem that I frankly think it should have been posted to the project front page sooner than it was. I lost hours of processing time on multiple machines because they were busily looping away and locking out other legitimate projects.
ID: 3599 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Hefto99

Send message
Joined: 29 Dec 07
Posts: 9
Credit: 101,275,462
RAC: 28,248
Message 3600 - Posted: 30 May 2008, 12:40:05 UTC
Last modified: 30 May 2008, 12:42:28 UTC

Is problem with gs_373* related to SDSS Stripe 82 search?

[edit]Date would fit - 27th May[/edit]
ID: 3600 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Hefto99

Send message
Joined: 29 Dec 07
Posts: 9
Credit: 101,275,462
RAC: 28,248
Message 3601 - Posted: 30 May 2008, 12:54:02 UTC
Last modified: 30 May 2008, 12:55:09 UTC

gs_3737* are still being generated:

WU1
WU2
ID: 3601 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
stefsaber

Send message
Joined: 2 Apr 08
Posts: 32
Credit: 1,017,362
RAC: 0
Message 3602 - Posted: 30 May 2008, 12:59:48 UTC - in response to Message 3601.  

Just got a batch of 20 new WU's and happy to report no site of the 373's!
ID: 3602 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile alijay

Send message
Joined: 15 Apr 08
Posts: 55
Credit: 24,047
RAC: 0
Message 3605 - Posted: 30 May 2008, 14:44:58 UTC

Only had one '373' early today that was a re-issue and had now reach its tomany error notice and will not be reissued again

Since then everyting has gone through perfectly
ID: 3605 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Sysadm@Nbg
Avatar

Send message
Joined: 24 Jan 08
Posts: 6
Credit: 14,836
RAC: 0
Message 3606 - Posted: 30 May 2008, 15:36:15 UTC - in response to Message 3547.  

And at the moment: "no new work" accepted from Milky, sorry !!

I began to crunch again. At the moment no bad WU was seen ...
thanks to Travis for his work; god job !
Sysadm@Nbg
Member of Team

ID: 3606 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 21 Dec 07
Posts: 69
Credit: 7,048,412
RAC: 0
Message 3607 - Posted: 30 May 2008, 15:41:47 UTC
Last modified: 30 May 2008, 15:51:29 UTC

Am still getting the odd one or two here (just aborted one less than a minute ago) so it doesn't look like whatever Travis has done to try and get rid of them is working 100%.

FWIW, they are all re-issues - perhaps there's a way (at the server end) to stop these from being re-issued? At this rate the problem will not go away for quite some time, especially if the work units end up going to unattended/unmonitored Linux hosts.

EDIT: had to abort two more (freshly downloaded) within minutes of posting
Join the #1 Aussie Alliance on MilkyWay!
ID: 3607 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile alijay

Send message
Joined: 15 Apr 08
Posts: 55
Credit: 24,047
RAC: 0
Message 3608 - Posted: 30 May 2008, 17:41:45 UTC - in response to Message 3607.  



FWIW, they are all re-issues - perhaps there's a way (at the server end) to stop these from being re-issued? At this rate the problem will not go away for quite some time, especially if the work units end up going to unattended/unmonitored Linux hosts.



I just hope that there is not a poor Linux user out there who has not checked his/her cruncher for the last three days!!!!
ID: 3608 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Problem with new W/Us

©2024 Astroinformatics Group