Message boards :
Number crunching :
Lack of communication
Message board moderation
Author | Message |
---|---|
Send message Joined: 5 Dec 07 Posts: 6 Credit: 1,687,632 RAC: 0 |
Of the numerous projects for which I've been volunteering my CPU time, this one seems to have the worst management and absolutely the worst communication. The cavalier attitude about thousands of work units with improper timing parameters is a good example. Has anyone considered the fact that not only do these units waste our time by running for 22 hours and then crashing on "exceeded CPU time limit" but because of the ridiculously short deadlines set for completion, BOINC gives them top priority, locking out other work for other projects. This last issue is really inexcusable. It's one thing to waste volunteer time on bad workunits, but it's quite another to steal time from other projects that manage their work more effectively. On my machines that run Milkyway, almost everything else has been completely shut out as these badly coded work units hog all the available cycles and then crash. Yet the only response from project management has been "It will work itself out in due time." The truth is, no, it won't. After the wu crashes, it just gets reassigned to someone else and crashes the same way, until all of those thousands of units eventually get assigned to someone who has a hot-rodded quad core machine or something and manages to get through them. Meanwhile the rest of us continue to burn electricity and cycles for nothing, and our other project work is shut out. No more for me. I'm now blocking MilkyWay units on all my machines. |
Send message Joined: 28 Aug 07 Posts: 146 Credit: 10,703,601 RAC: 0 |
Has anyone considered the fact that not only do these units waste our time by running for 22 hours and then crashing on "exceeded CPU time limit" but because of the ridiculously short deadlines set for completion... Sorry to hear that your machine is too slow to crunch the actual WUs, 22 hours is too long indeed. :-( I remember this is already adressed elsewhere in one of the other threads, so the admins should be aware of that probably. Besides that, have you ever readed some of the other threads? I wouldn't call this a bad communication here, Travis and Nathan do a very good job here to satisfy us users most of the time. Go and look in the SHA-1 or vtu-forums, you won't find any admin there posting as much as here, that is what I call bad communication. And remember, this project is hosted on a university, escpecially in summer time there are not much people around. ;-) For the deadline problem: As Nathan told, only Travis is able to change it. He's not in town actually I guess, so he will have to fix it when he's back... Member of BOINC@Heidelberg and ATA! My BOINCstats |
Send message Joined: 19 Apr 08 Posts: 7 Credit: 3,067 RAC: 0 |
Hi, I remember this is already adressed elsewhere in one of the other threads, so the admins should be aware of that probably. Yes, the problem is well known: The deadlines and the maximum allowed CPU time don't reflect the new wu length. We were told that Travis is the only one who knows how to adjust these values but he's out of town. The best solution at the moment is: Go back to the status quo before the change until they figured out how to set up the values properly. Regards, Carsten |
Send message Joined: 5 Dec 07 Posts: 6 Credit: 1,687,632 RAC: 0 |
Sure, I've read the other threads. Mostly they pose questions and no answers are being given. This is a problem large enough that it ought to be on the "home" page of the project though, warning people of what is going on and giving them some official advice about what to do. Digging through verbose threads on a forum only to find others reporting the same problems isn't really that helpful. All it does is confirm that there really is a problem. Don't they TEST things when making major changes, rather than just dumping thousands of work units out on all of us? Apparently not. As far as I can see, I'm the only one who has raised the issue of unfairness to other projects that have their act more together, like WCG or Rosetta. Locking their units out by insisting the MW needs more immediate priority, which is what happens with these malformed units, is really irresponsible on the part of this project's administration. That's my serious complaint. |
Send message Joined: 30 Aug 07 Posts: 125 Credit: 207,206 RAC: 0 |
Don't they TEST things when making major changes, rather than just dumping thousands of work units out on all of us? Apparently not. Nathan 'spost: Nathan wrote: In response to the longer WU problem, I give you WUs much longer than what you've been experiencing. I'm not quite sure how much longer, so let me know how they are performing; I don't want you crunching so slow it's disheartening now. :) Nathan's post: Nathan: wrote: Thanks for the quick feedback! I'll cut this down, I had a feeling it'd be a little overkill after I had started it. Nathan's post: Nathan wrote: After seeing some of the times, I'm thinking that the "medium" length ones will probably work out the best for everyone, so figure on timings roughly around what you're getting with the 373... series. Now the only person we haven't heard from yet is Travis, who is the lead person for changes like this. Jord. The BOINC FAQ Service. |
Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0 |
Also, once the issue of the BOINC run parameters not being correct and that it might cause problems for 'slower' hosts was discovered, the steps one could take to workaround the issue until Travis gets back and makes the permanent fix were posted. So it's not like there was nowhere to turn for help over this. I will grant you that a front page news 'heads up' item is appropriate. Then folks could take a look here and decide if they want to get involved with the workarounds or just suspend MW until the official fixes are in place. FWIW, my PII (450), K6-2's (500's), and PIII (550) can all make the deadline, even on the long 372x's. For the medium length 373x's they can finish them in well under half the current deadline, so they wouldn't even have to be run flat out or run MW exclusively. In addition, depending on what you have for a cache setting it might seem like MW is hogging the host, but let me assure that the BOINC CC will shut it down in pretty short order on it's own once it figures out what's going on with the new work (or you help it get there quicker manually). This is not to say that I haven't have to make a few manual interventions in order to correct problems the change here made in the work stream for them (they run other projects as well). However it is far from being a 'showstopper', or even all that big a deal AFAICT. Alinator |
Send message Joined: 19 Apr 08 Posts: 7 Credit: 3,067 RAC: 0 |
Hi, FWIW, my PII (450), K6-2's (500's), and PIII (550) can all make the deadline, even on the long 372x's. But only if you crunch 24/7. My PIII Laptop would need 10 days before it could return a long wu regarding its current uptimes. Regards, Carsten |
Send message Joined: 9 Jul 08 Posts: 85 Credit: 44,842,651 RAC: 0 |
As far as I can see, I'm the only one who has raised the issue of unfairness to other projects that have their act more together, like WCG or Rosetta. Locking their units out by insisting the MW needs more immediate priority, which is what happens with these malformed units, is really irresponsible on the part of this project's administration. That's my serious complaint. I don't want to turn this thread into a disagreement, because I do agree with you that it's been too long with no communication from Travis, the project admin. However, I think the reason you've not heard a lot of complaints about MW running high-priority and stopping other projects from running is that it's not really a problem. BOINC has a 'debt' system for projects that assures that projects will get the priority YOU assign in the long term. For example, I had 12 computers that went 'panic mode' to finish MW units, but out of those 12, only 2 have gone back to processing MW once they emptied the queue. Once the 'long term debt' that is owed to the projects that couldn't work while MW was forcing the issue is satisfied, MW will start running again. Looks like for most of the machines that will be 4-7 days of running no MW. I didn't do anything... didn't touch any files or edit anything or force any updates or detach or anything. BOINC just did it on it's own and everything has worked out. Secondly, I think it's worth bearing in mind that this project is still in beta stage. I believe it is a bit unfair to attach your computers to a project that you know isn't fully functional and then expect it to run without causing any problems. Finally, bear in mind that the problem came about because the project had run out of work and many users were complaining because there was no work available. The project scientist that created the new work simply made a mistake. He posted that he was trying something new and that it might cause problems at the time he did it. If you're going to run a beta project, I really think you must accept that you have to follow the message boards and be prepared to adjust to problems and provide feedback to the project staff. |
Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0 |
Hi, Actually, I could run all of them at somewhere around 12 hours per day and meet the deadlines for the long length work. The G3 can probably do at 12 per as well, but it would be really close! However they all run 24/7, but only because they have a primary job to do which is not crunching. If they didn't, they most likely would not be on at all. ;-) As you point out though, an older laptop which still gets used like a laptop is going to have trouble cutting it on MW, even if the deadline goes to 7 days. Even with adding a couple of extra days, it still represents a big increase in the deadline tightness factor for MW. Alinator |
Send message Joined: 30 Mar 08 Posts: 50 Credit: 11,593,755 RAC: 0 |
Meanwhile the rest of us continue to burn electricity and cycles for nothing, and our other project work is shut out. You think this is bad, try Orbit@home. I pulled a couple of work units from them that went 137 hours each on a "hotrodded" A64 dual core. What do you expect to accomplish with a hagged out lappy running a Pentium III from 1999? The Katmai was the first iteration of the p III and one step up from the P II. I am having some difficulty understanding why you are complaining about swimming with the sharks when all you brought to the party was a rubber ducky. Voltron |
Send message Joined: 19 Apr 08 Posts: 7 Credit: 3,067 RAC: 0 |
Hi, You think this is bad, try Orbit@home. Orbit doesn't have a deadline of five days ... I am having some difficulty understanding why you are complaining about swimming with the sharks when all you brought to the party was a rubber ducky. Complaining about an improper project setup has nothing to do with "complaining about swimming with the sharks". Regards, Carsten |
Send message Joined: 8 Oct 07 Posts: 289 Credit: 3,690,838 RAC: 0 |
I haven't had any problems on P4 and above....sorry but the needs of the many outweigh the needs of the few.... when the server was freezing up.... countless hosts were not able to contact the server.....jamming those systems up.....Murphys Law...What can go wrong will go wrong! |
Send message Joined: 27 Nov 07 Posts: 5 Credit: 87,063 RAC: 0 |
The deadlines are really tooo short for slow computers. The problem is not only the "run high priority" On my Mac G4 1GHz the wu's need more than 48 hours as the wu are not optimized for Macintosh PPC and it's a older (slower) computer. So I will have to cancel many wu's which can't meet the deadline. :( In my opinion it would be better if there was a statement how to deal with wu's which can't meet the deadline. Furthermore a little warning on the front site should be set. Not every user can follow all threads, especially if taking part on many projects ;) I know it's often a hard job to keep things running and not all projects can do such a good communication as for an example climate prediction does, but some more information and planning I would prefer. So if changing things like the length of the wu, it may be better to do if the admin has time to manage the (unexpected) effects. - only as a hint ;) |
Send message Joined: 5 Dec 07 Posts: 6 Credit: 1,687,632 RAC: 0 |
I haven't had any problems on P4 and above....sorry but the needs of the many outweigh the needs of the few.... when the server was freezing up.... countless hosts were not able to contact the server.....jamming those systems up.....Murphys Law...What can go wrong will go wrong! My P4s have been failing just like the rest. "Exceeded CPU time limit" or "Exceeded memory limit" or just not completing before the deadline. And because the deadlines are too short, BOINC apparently prioritizes the workunits and locks everything else out. Because the time estimates are too short, it downloads several at once when it won't be able to complete even one within the allotted time frame. About one out of four or five work units has managed to complete in time and without error. The rest fail and get dumped back in the pool so they can be downloaded by some other poor sucker and fail again. |
Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0 |
Thunder wrote: BOINC has a 'debt' system for projects that assures that projects will get the priority YOU assign in the long term. For example, I had 12 computers that went 'panic mode' to finish MW units, but out of those 12, only 2 have gone back to processing MW once they emptied the queue. Once the 'long term debt' that is owed to the projects that couldn't work while MW was forcing the issue is satisfied, MW will start running again. Looks like for most of the machines that will be 4-7 days of running no MW. I didn't do anything... didn't touch any files or edit anything or force any updates or detach or anything. BOINC just did it on it's own and everything has worked out. This is *great* information, especially for a newb like me. Like many others, apparently, I got hammered with long WUs and short deadlines not long ago. Since they finished, I've been wondering -- and fretting, sorry to say -- why I've heard nothing from MW@H in several days. I now regret my bit of manual intervention -- several updates and a reset or two -- to try and coax a few WUs my way. I've also learned that computations don't always quit after the deadline and you don't necessarily have to abort WUs in fear of wasting CPU time and not getting credit for what you've done. If you check the WU ID, you can, among other things: (1) See how many results a project requires; (2) See which other host(s), if any, have been given the same WU; and (3) Get a pretty good idea whether or not you can finish before anyone else. If you finish after your deadline, but before other hosts who have been given the same WU, you should still get credit. I'm not sure how all projects work, but my experience with MW@H bears this out. |
Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0 |
LOL... Yes, you have discovered a large part of the fun of the game! Normally, BOINC takes care of the day to day routine on it's own quite well, and is quite boring from the user's POV. The trick is knowing when and where to intervene when unusual things happen and it gets into trouble for one reason or another. FWIW, many of the third party tools available for managing and monitoring BOINC, like BoinvView, BoincLogX, etc. can help you get a better feel for what's going on than BOINC Manager can, and are a big help in deciding when to step in. In any event, they sure beat having to root around in the BOINC directories and state files to get the answers to simple questions! ;-) Alinator |
Send message Joined: 27 Aug 07 Posts: 85 Credit: 405,705 RAC: 0 |
A typical problem is when the DCF is too low at work fetch time and a huge number of tasks is downloaded. After the first one, the estimated completion times go up and it is now obvious that there is a PROBLEM. Some of the tasks need to be aborted for sanity. Another problem is tasks that run forever. BOINC will quit eventually (but it may be VERY eventually) because of time exceeded. BOINC WIKI |
Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0 |
LOL... Yes, that is very true. It's also very true if folks really stop and listen to what you are telling them, they can avoid most of the major pitfalls when boundary conditions begin to apply. I know for a fact I chose to ignore your advice on a couple of cases to my regret. That pretty much sums up your 'street cred' in a nutshell. ;-) Of course, giving away the whole show takes away half the fun of the 'hunt'! :-D Alinator |
Send message Joined: 12 Nov 07 Posts: 2425 Credit: 524,164 RAC: 0 |
Where is Travis? wasn't he supposed to have been back by now. (or last week?) I clearly remember "by wednesday". How about the other admins? Doesn't expecting the unexpected make the unexpected the expected? If it makes sense, DON'T do it. |
Send message Joined: 7 Sep 07 Posts: 444 Credit: 5,712,523 RAC: 0 |
Where is Travis? wasn't he supposed to have been back by now. (or last week?) I clearly remember "by wednesday". I guess he should have specified which Wednesday... ;) |
©2024 Astroinformatics Group