Problem with new W/Us

Author	Message
Webmaster Yoda Send message Joined: 21 Dec 07 Posts: 69 Credit: 7,048,412 RAC: 0	Message 3573 - Posted: 29 May 2008, 15:14:57 UTC Sure is an ongoing plague. Did not have time to check my computers for an hour today. Sure enough, when I got back to them all 26 cores were hanging on those nasty things. Will have to suspend MW overnight so I don't waste another 150-200 hours of CPU time on useless work units. Join the #1 Aussie Alliance on MilkyWay! ID: 3573 · Rating: 0 · rate: / Reply Quote

AnRM Send message Joined: 6 Mar 08 Posts: 15 Credit: 3,006,602 RAC: 0	Message 3574 - Posted: 29 May 2008, 15:36:01 UTC - in response to Message 3572. Since Windows WUs error out quickly, rather than taking many hours to get nowhere as they do on Linux, I've enabled Work Fetch on 2 of my boxes to help clear them out. Apart from lost credit, I'm hoping there won't be any problems. You see, I don't only crunch for credits... Rod Good idea, Rod. We have added the rest of our Win machines to the fray....Cheers, Rog. ID: 3574 · Rating: 0 · rate: / Reply Quote

Odd-Rod Send message Joined: 7 Sep 07 Posts: 444 Credit: 5,715,481 RAC: 0	Message 3578 - Posted: 29 May 2008, 18:01:28 UTC - in response to Message 3574. Since Windows WUs error out quickly, rather than taking many hours to get nowhere as they do on Linux, I've enabled Work Fetch on 2 of my boxes to help clear them out. Apart from lost credit, I'm hoping there won't be any problems. You see, I don't only crunch for credits... Rod Good idea, Rod. We have added the rest of our Win machines to the fray....Cheers, Rog. Ah, spoken like a true alpha tester! I have now put all my hosts back on again. Since putting on those first 2 hosts, I have completed 9 WUs with only 1 error. I don't think I got much less credit for the time spent on those 9 than for the same time on some other projects. So credit crunchers, it might not be so bad... :-) Rod ID: 3578 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 3580 - Posted: 29 May 2008, 18:04:11 UTC - in response to Message 3578. Our database has been mostly purged so it's a bit more manageable. I'm running the SQL to delete the workunits related to the bad search from the database right now. i'm not quite sure how long it will take, but just to give you an idea -- it took over an hour to just do a GET on the WUs from the bad search. We've updated our daily scripts so we're running the correct script to clean out the database -- i think this is what might have caused the problems. Let me know if you get any more of the bad workunits, i'm pretty sure they should all be cleared out of the database within an hour or so. ID: 3580 · Rating: 0 · rate: / Reply Quote

AnRM Send message Joined: 6 Mar 08 Posts: 15 Credit: 3,006,602 RAC: 0	Message 3582 - Posted: 29 May 2008, 19:21:04 UTC Thanks for the update, Travis. You guys do a great job communicating. It's appreciated.....Cheers, Rog. ID: 3582 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 3585 - Posted: 29 May 2008, 20:43:26 UTC - in response to Message 3582. Thanks for the update, Travis. You guys do a great job communicating. It's appreciated.....Cheers, Rog. another update on the problem :( i've been trying to manually remove the bad workunits from the database, but the connection keeps timing out. we might have to drag labstaff into this to try and fix the database. either way the server shouldn't be generating any more bad workunits, so i'm hoping they'll work their way out of the system even if i can't manually get them out of the database. ID: 3585 · Rating: 0 · rate: / Reply Quote

Odd-Rod Send message Joined: 7 Sep 07 Posts: 444 Credit: 5,715,481 RAC: 0	Message 3590 - Posted: 29 May 2008, 21:35:34 UTC - in response to Message 3585. Thanks for the update, Travis. You guys do a great job communicating. It's appreciated.....Cheers, Rog. I second that! another update on the problem :( i've been trying to manually remove the bad workunits from the database, but the connection keeps timing out. we might have to drag labstaff into this to try and fix the database. either way the server shouldn't be generating any more bad workunits, so i'm hoping they'll work their way out of the system even if i can't manually get them out of the database. I fully appreciate crunchers putting Linux hosts on No New Work (NNW) because their systems get stuck for hours. However, I will leave all my Windows hosts fetching work and accept the occasional error WU. After all, it's only minutes wasted, not hours. (Well, on my slowest host it IS almost an hour! Just over 56min!) Rod ID: 3590 · Rating: 0 · rate: / Reply Quote

Wang Solutions Send message Joined: 22 Dec 07 Posts: 13 Credit: 46,606,530 RAC: 0	Message 3591 - Posted: 30 May 2008, 3:12:41 UTC Last modified: 30 May 2008, 3:14:17 UTC It looks like we have new work starting to come through, hopefully all clean this time. Thanks Travis. EDIT: But no, the first one I got was a dud. :( Proud member of BOINC@AUSTRALIA ID: 3591 · Rating: 0 · rate: / Reply Quote

stefsaber Send message Joined: 2 Apr 08 Posts: 32 Credit: 1,017,362 RAC: 0	Message 3592 - Posted: 30 May 2008, 4:06:26 UTC - in response to Message 3591. It looks like we have new work starting to come through, hopefully all clean this time. Thanks Travis. EDIT: But no, the first one I got was a dud. :( Got 2 WU's, one's of the gs_373 variety that seems to running forever, the gs_596 unit is running top speed though...Kind of looks like they're running off by a factor of ten 14% to 1.4% completion... I'll leave milkyway running, help get rid of those 37's even if it does chew up a bit of time. Thanks for the updates and continuing attempts to solve this though! ID: 3592 · Rating: 0 · rate: / Reply Quote

Misfit Send message Joined: 27 Aug 07 Posts: 915 Credit: 1,503,319 RAC: 0	Message 3594 - Posted: 30 May 2008, 5:31:53 UTC It's only a true alpha when a WU is granted -1M credits. me@rescam.org ID: 3594 · Rating: 0 · rate: / Reply Quote

RAMen Send message Joined: 8 Apr 08 Posts: 45 Credit: 161,943,995 RAC: 0	Message 3595 - Posted: 30 May 2008, 5:56:00 UTC Got a batch of new work units including the gs_3737082_1211996843_1095635 variety one unit ran for 25 minutes for zero progress on the ubuntu linux work station aborted the lot. Thanks for the update on cleaning the database Travis ID: 3595 · Rating: 0 · rate: / Reply Quote

Hefto99 Send message Joined: 29 Dec 07 Posts: 9 Credit: 106,219,317 RAC: 1,514	Message 3598 - Posted: 30 May 2008, 10:58:28 UTC I have got one gs_3737* too: this one ID: 3598 · Rating: 0 · rate: / Reply Quote

Altivo Send message Joined: 5 Dec 07 Posts: 6 Credit: 1,687,632 RAC: 0	Message 3599 - Posted: 30 May 2008, 11:14:35 UTC I ended up aborting a whole stack of gs_3737... units because they just looped and ran forever. Now I had to abort gs_591_1212007885_142924_0 for the same reason. It was running for 11+ hours. This is on a Linux worstation (Slackware) with BOINC 5.8.16. It is particularly troubling that these tasks do not seem to surrender control of the CPU when their hour time slot is up. Looping or not, this is hostile behavior that keeps other projects from getting their share of the available services. I've noticed Milkyway tasks doing this before, and I don't particularly like it. This has been a severe enough problem that I frankly think it should have been posted to the project front page sooner than it was. I lost hours of processing time on multiple machines because they were busily looping away and locking out other legitimate projects. ID: 3599 · Rating: 0 · rate: / Reply Quote

Hefto99 Send message Joined: 29 Dec 07 Posts: 9 Credit: 106,219,317 RAC: 1,514	Message 3600 - Posted: 30 May 2008, 12:40:05 UTC Last modified: 30 May 2008, 12:42:28 UTC Is problem with gs_373* related to SDSS Stripe 82 search? [edit]Date would fit - 27th May[/edit] ID: 3600 · Rating: 0 · rate: / Reply Quote

Hefto99 Send message Joined: 29 Dec 07 Posts: 9 Credit: 106,219,317 RAC: 1,514	Message 3601 - Posted: 30 May 2008, 12:54:02 UTC Last modified: 30 May 2008, 12:55:09 UTC gs_3737* are still being generated: WU1 WU2 ID: 3601 · Rating: 0 · rate: / Reply Quote

stefsaber Send message Joined: 2 Apr 08 Posts: 32 Credit: 1,017,362 RAC: 0	Message 3602 - Posted: 30 May 2008, 12:59:48 UTC - in response to Message 3601. Just got a batch of 20 new WU's and happy to report no site of the 373's! ID: 3602 · Rating: 0 · rate: / Reply Quote

alijay Send message Joined: 15 Apr 08 Posts: 55 Credit: 24,047 RAC: 0	Message 3605 - Posted: 30 May 2008, 14:44:58 UTC Only had one '373' early today that was a re-issue and had now reach its tomany error notice and will not be reissued again Since then everyting has gone through perfectly ID: 3605 · Rating: 0 · rate: / Reply Quote

Sysadm@Nbg Send message Joined: 24 Jan 08 Posts: 6 Credit: 14,836 RAC: 0	Message 3606 - Posted: 30 May 2008, 15:36:15 UTC - in response to Message 3547. And at the moment: "no new work" accepted from Milky, sorry !! I began to crunch again. At the moment no bad WU was seen ... thanks to Travis for his work; god job ! Sysadm@Nbg Member of Team ID: 3606 · Rating: 0 · rate: / Reply Quote

Webmaster Yoda Send message Joined: 21 Dec 07 Posts: 69 Credit: 7,048,412 RAC: 0	Message 3607 - Posted: 30 May 2008, 15:41:47 UTC Last modified: 30 May 2008, 15:51:29 UTC Am still getting the odd one or two here (just aborted one less than a minute ago) so it doesn't look like whatever Travis has done to try and get rid of them is working 100%. FWIW, they are all re-issues - perhaps there's a way (at the server end) to stop these from being re-issued? At this rate the problem will not go away for quite some time, especially if the work units end up going to unattended/unmonitored Linux hosts. EDIT: had to abort two more (freshly downloaded) within minutes of posting Join the #1 Aussie Alliance on MilkyWay! ID: 3607 · Rating: 0 · rate: / Reply Quote

alijay Send message Joined: 15 Apr 08 Posts: 55 Credit: 24,047 RAC: 0	Message 3608 - Posted: 30 May 2008, 17:41:45 UTC - in response to Message 3607. FWIW, they are all re-issues - perhaps there's a way (at the server end) to stop these from being re-issued? At this rate the problem will not go away for quite some time, especially if the work units end up going to unattended/unmonitored Linux hosts. I just hope that there is not a poor Linux user out there who has not checked his/her cruncher for the last three days!!!! ID: 3608 · Rating: 0 · rate: / Reply Quote