Message boards :
Number crunching :
Problem with new W/Us
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 21 Dec 07 Posts: 69 Credit: 7,048,412 RAC: 0 |
Sure is an ongoing plague. Did not have time to check my computers for an hour today. Sure enough, when I got back to them all 26 cores were hanging on those nasty things. Will have to suspend MW overnight so I don't waste another 150-200 hours of CPU time on useless work units. Join the #1 Aussie Alliance on MilkyWay! |
Send message Joined: 6 Mar 08 Posts: 15 Credit: 3,006,602 RAC: 0 |
Since Windows WUs error out quickly, rather than taking many hours to get nowhere as they do on Linux, I've enabled Work Fetch on 2 of my boxes to help clear them out. Apart from lost credit, I'm hoping there won't be any problems. Good idea, Rod. We have added the rest of our Win machines to the fray....Cheers, Rog. |
Send message Joined: 7 Sep 07 Posts: 444 Credit: 5,712,523 RAC: 0 |
Since Windows WUs error out quickly, rather than taking many hours to get nowhere as they do on Linux, I've enabled Work Fetch on 2 of my boxes to help clear them out. Apart from lost credit, I'm hoping there won't be any problems. Ah, spoken like a true alpha tester! I have now put all my hosts back on again. Since putting on those first 2 hosts, I have completed 9 WUs with only 1 error. I don't think I got much less credit for the time spent on those 9 than for the same time on some other projects. So credit crunchers, it might not be so bad... :-) Rod |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
Our database has been mostly purged so it's a bit more manageable. I'm running the SQL to delete the workunits related to the bad search from the database right now. i'm not quite sure how long it will take, but just to give you an idea -- it took over an hour to just do a GET on the WUs from the bad search. We've updated our daily scripts so we're running the correct script to clean out the database -- i think this is what might have caused the problems. Let me know if you get any more of the bad workunits, i'm pretty sure they should all be cleared out of the database within an hour or so. |
Send message Joined: 6 Mar 08 Posts: 15 Credit: 3,006,602 RAC: 0 |
Thanks for the update, Travis. You guys do a great job communicating. It's appreciated.....Cheers, Rog. |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
Thanks for the update, Travis. You guys do a great job communicating. It's appreciated.....Cheers, Rog. another update on the problem :( i've been trying to manually remove the bad workunits from the database, but the connection keeps timing out. we might have to drag labstaff into this to try and fix the database. either way the server shouldn't be generating any more bad workunits, so i'm hoping they'll work their way out of the system even if i can't manually get them out of the database. |
Send message Joined: 7 Sep 07 Posts: 444 Credit: 5,712,523 RAC: 0 |
I second that!Thanks for the update, Travis. You guys do a great job communicating. It's appreciated.....Cheers, Rog.
I fully appreciate crunchers putting Linux hosts on No New Work (NNW) because their systems get stuck for hours. However, I will leave all my Windows hosts fetching work and accept the occasional error WU. After all, it's only minutes wasted, not hours. (Well, on my slowest host it IS almost an hour! Just over 56min!) Rod |
Send message Joined: 22 Dec 07 Posts: 13 Credit: 46,606,530 RAC: 0 |
It looks like we have new work starting to come through, hopefully all clean this time. Thanks Travis. EDIT: But no, the first one I got was a dud. :( Proud member of BOINC@AUSTRALIA |
Send message Joined: 2 Apr 08 Posts: 32 Credit: 1,017,362 RAC: 0 |
It looks like we have new work starting to come through, hopefully all clean this time. Thanks Travis. Got 2 WU's, one's of the gs_373 variety that seems to running forever, the gs_596 unit is running top speed though...Kind of looks like they're running off by a factor of ten 14% to 1.4% completion... I'll leave milkyway running, help get rid of those 37's even if it does chew up a bit of time. Thanks for the updates and continuing attempts to solve this though! |
Send message Joined: 27 Aug 07 Posts: 915 Credit: 1,503,319 RAC: 0 |
It's only a true alpha when a WU is granted -1M credits. me@rescam.org |
Send message Joined: 8 Apr 08 Posts: 45 Credit: 161,943,995 RAC: 0 |
Got a batch of new work units including the gs_3737082_1211996843_1095635 variety one unit ran for 25 minutes for zero progress on the ubuntu linux work station aborted the lot. Thanks for the update on cleaning the database Travis |
Send message Joined: 29 Dec 07 Posts: 9 Credit: 101,276,558 RAC: 27,792 |
I have got one gs_3737* too: this one |
Send message Joined: 5 Dec 07 Posts: 6 Credit: 1,687,632 RAC: 0 |
I ended up aborting a whole stack of gs_3737... units because they just looped and ran forever. Now I had to abort gs_591_1212007885_142924_0 for the same reason. It was running for 11+ hours. This is on a Linux worstation (Slackware) with BOINC 5.8.16. It is particularly troubling that these tasks do not seem to surrender control of the CPU when their hour time slot is up. Looping or not, this is hostile behavior that keeps other projects from getting their share of the available services. I've noticed Milkyway tasks doing this before, and I don't particularly like it. This has been a severe enough problem that I frankly think it should have been posted to the project front page sooner than it was. I lost hours of processing time on multiple machines because they were busily looping away and locking out other legitimate projects. |
Send message Joined: 29 Dec 07 Posts: 9 Credit: 101,276,558 RAC: 27,792 |
|
Send message Joined: 29 Dec 07 Posts: 9 Credit: 101,276,558 RAC: 27,792 |
|
Send message Joined: 2 Apr 08 Posts: 32 Credit: 1,017,362 RAC: 0 |
Just got a batch of 20 new WU's and happy to report no site of the 373's! |
Send message Joined: 15 Apr 08 Posts: 55 Credit: 24,047 RAC: 0 |
Only had one '373' early today that was a re-issue and had now reach its tomany error notice and will not be reissued again Since then everyting has gone through perfectly |
Send message Joined: 24 Jan 08 Posts: 6 Credit: 14,836 RAC: 0 |
|
Send message Joined: 21 Dec 07 Posts: 69 Credit: 7,048,412 RAC: 0 |
Am still getting the odd one or two here (just aborted one less than a minute ago) so it doesn't look like whatever Travis has done to try and get rid of them is working 100%. FWIW, they are all re-issues - perhaps there's a way (at the server end) to stop these from being re-issued? At this rate the problem will not go away for quite some time, especially if the work units end up going to unattended/unmonitored Linux hosts. EDIT: had to abort two more (freshly downloaded) within minutes of posting Join the #1 Aussie Alliance on MilkyWay! |
Send message Joined: 15 Apr 08 Posts: 55 Credit: 24,047 RAC: 0 |
I just hope that there is not a poor Linux user out there who has not checked his/her cruncher for the last three days!!!! |
©2024 Astroinformatics Group