Welcome to MilkyWay@home

Lack of communication

Message boards : Number crunching : Lack of communication
Message board moderation

To post messages, you must log in.

AuthorMessage
Altivo

Send message
Joined: 5 Dec 07
Posts: 6
Credit: 1,687,632
RAC: 0
Message 4332 - Posted: 22 Jul 2008, 14:51:21 UTC

Of the numerous projects for which I've been volunteering my CPU time, this one seems to have the worst management and absolutely the worst communication.

The cavalier attitude about thousands of work units with improper timing parameters is a good example. Has anyone considered the fact that not only do these units waste our time by running for 22 hours and then crashing on "exceeded CPU time limit" but because of the ridiculously short deadlines set for completion, BOINC gives them top priority, locking out other work for other projects. This last issue is really inexcusable. It's one thing to waste volunteer time on bad workunits, but it's quite another to steal time from other projects that manage their work more effectively.

On my machines that run Milkyway, almost everything else has been completely shut out as these badly coded work units hog all the available cycles and then crash. Yet the only response from project management has been "It will work itself out in due time." The truth is, no, it won't. After the wu crashes, it just gets reassigned to someone else and crashes the same way, until all of those thousands of units eventually get assigned to someone who has a hot-rodded quad core machine or something and manages to get through them. Meanwhile the rest of us continue to burn electricity and cycles for nothing, and our other project work is shut out.

No more for me. I'm now blocking MilkyWay units on all my machines.
ID: 4332 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile DoctorNow
Avatar

Send message
Joined: 28 Aug 07
Posts: 146
Credit: 10,703,601
RAC: 0
Message 4337 - Posted: 22 Jul 2008, 15:37:37 UTC - in response to Message 4332.  
Last modified: 22 Jul 2008, 15:40:24 UTC

Has anyone considered the fact that not only do these units waste our time by running for 22 hours and then crashing on "exceeded CPU time limit" but because of the ridiculously short deadlines set for completion...

Sorry to hear that your machine is too slow to crunch the actual WUs, 22 hours is too long indeed. :-(
I remember this is already adressed elsewhere in one of the other threads, so the admins should be aware of that probably.

Besides that, have you ever readed some of the other threads?
I wouldn't call this a bad communication here, Travis and Nathan do a very good job here to satisfy us users most of the time.
Go and look in the SHA-1 or vtu-forums, you won't find any admin there posting as much as here, that is what I call bad communication.
And remember, this project is hosted on a university, escpecially in summer time there are not much people around. ;-)

For the deadline problem:
As Nathan told, only Travis is able to change it. He's not in town actually I guess, so he will have to fix it when he's back...
Member of BOINC@Heidelberg and ATA!

My BOINCstats
ID: 4337 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Idefix

Send message
Joined: 19 Apr 08
Posts: 7
Credit: 3,067
RAC: 0
Message 4338 - Posted: 22 Jul 2008, 15:51:18 UTC - in response to Message 4337.  

Hi,
I remember this is already adressed elsewhere in one of the other threads, so the admins should be aware of that probably.

Yes, the problem is well known: The deadlines and the maximum allowed CPU time don't reflect the new wu length.

We were told that Travis is the only one who knows how to adjust these values but he's out of town. The best solution at the moment is: Go back to the status quo before the change until they figured out how to set up the values properly.

Regards,
Carsten
ID: 4338 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Altivo

Send message
Joined: 5 Dec 07
Posts: 6
Credit: 1,687,632
RAC: 0
Message 4339 - Posted: 22 Jul 2008, 16:02:27 UTC - in response to Message 4337.  

Sure, I've read the other threads. Mostly they pose questions and no answers are being given.

This is a problem large enough that it ought to be on the "home" page of the project though, warning people of what is going on and giving them some official advice about what to do. Digging through verbose threads on a forum only to find others reporting the same problems isn't really that helpful. All it does is confirm that there really is a problem.

Don't they TEST things when making major changes, rather than just dumping thousands of work units out on all of us? Apparently not.

As far as I can see, I'm the only one who has raised the issue of unfairness to other projects that have their act more together, like WCG or Rosetta. Locking their units out by insisting the MW needs more immediate priority, which is what happens with these malformed units, is really irresponsible on the part of this project's administration. That's my serious complaint.
ID: 4339 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jord
Avatar

Send message
Joined: 30 Aug 07
Posts: 125
Credit: 207,206
RAC: 0
Message 4340 - Posted: 22 Jul 2008, 16:14:43 UTC - in response to Message 4339.  
Last modified: 22 Jul 2008, 16:16:47 UTC

Don't they TEST things when making major changes, rather than just dumping thousands of work units out on all of us? Apparently not.

Nathan 'spost:
Nathan wrote:
In response to the longer WU problem, I give you WUs much longer than what you've been experiencing. I'm not quite sure how much longer, so let me know how they are performing; I don't want you crunching so slow it's disheartening now. :)


Nathan's post:
Nathan: wrote:
Thanks for the quick feedback! I'll cut this down, I had a feeling it'd be a little overkill after I had started it.

Travis is a conference for the next couple days, so I can't change the deadlines or anything until I talk to him, as I'm not sure how to do this.


Nathan's post:
Nathan wrote:
After seeing some of the times, I'm thinking that the "medium" length ones will probably work out the best for everyone, so figure on timings roughly around what you're getting with the 373... series.

As for the rest, I got my times wrong and Travis is flying back today, so tonight or tomorrow we should have some of the timing issues fixed. I need to talk to him and confirm things, but what I've been thinking is upping the deadline to about a week (this way we get results back on a reasonable schedule) and reducing the number of max work units (so you can actually finish all that you get). If this sounds unreasonable in any way please let us know. Also if you can suggest a good number of WUs given the new runtimes associated with teh 373 series, it would be very helpful.


Now the only person we haven't heard from yet is Travis, who is the lead person for changes like this.
Jord.

The BOINC FAQ Service.
ID: 4340 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Alinator

Send message
Joined: 7 Jun 08
Posts: 464
Credit: 56,639,936
RAC: 0
Message 4342 - Posted: 22 Jul 2008, 16:57:18 UTC

Also, once the issue of the BOINC run parameters not being correct and that it might cause problems for 'slower' hosts was discovered, the steps one could take to workaround the issue until Travis gets back and makes the permanent fix were posted. So it's not like there was nowhere to turn for help over this.

I will grant you that a front page news 'heads up' item is appropriate. Then folks could take a look here and decide if they want to get involved with the workarounds or just suspend MW until the official fixes are in place.

FWIW, my PII (450), K6-2's (500's), and PIII (550) can all make the deadline, even on the long 372x's. For the medium length 373x's they can finish them in well under half the current deadline, so they wouldn't even have to be run flat out or run MW exclusively. In addition, depending on what you have for a cache setting it might seem like MW is hogging the host, but let me assure that the BOINC CC will shut it down in pretty short order on it's own once it figures out what's going on with the new work (or you help it get there quicker manually).

This is not to say that I haven't have to make a few manual interventions in order to correct problems the change here made in the work stream for them (they run other projects as well). However it is far from being a 'showstopper', or even all that big a deal AFAICT.

Alinator
ID: 4342 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Idefix

Send message
Joined: 19 Apr 08
Posts: 7
Credit: 3,067
RAC: 0
Message 4343 - Posted: 22 Jul 2008, 17:14:25 UTC - in response to Message 4342.  

Hi,
FWIW, my PII (450), K6-2's (500's), and PIII (550) can all make the deadline, even on the long 372x's.

But only if you crunch 24/7. My PIII Laptop would need 10 days before it could return a long wu regarding its current uptimes.

Regards,
Carsten
ID: 4343 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Thunder
Avatar

Send message
Joined: 9 Jul 08
Posts: 85
Credit: 44,842,651
RAC: 0
Message 4346 - Posted: 22 Jul 2008, 17:31:56 UTC - in response to Message 4339.  

As far as I can see, I'm the only one who has raised the issue of unfairness to other projects that have their act more together, like WCG or Rosetta. Locking their units out by insisting the MW needs more immediate priority, which is what happens with these malformed units, is really irresponsible on the part of this project's administration. That's my serious complaint.


I don't want to turn this thread into a disagreement, because I do agree with you that it's been too long with no communication from Travis, the project admin.

However, I think the reason you've not heard a lot of complaints about MW running high-priority and stopping other projects from running is that it's not really a problem. BOINC has a 'debt' system for projects that assures that projects will get the priority YOU assign in the long term. For example, I had 12 computers that went 'panic mode' to finish MW units, but out of those 12, only 2 have gone back to processing MW once they emptied the queue. Once the 'long term debt' that is owed to the projects that couldn't work while MW was forcing the issue is satisfied, MW will start running again. Looks like for most of the machines that will be 4-7 days of running no MW. I didn't do anything... didn't touch any files or edit anything or force any updates or detach or anything. BOINC just did it on it's own and everything has worked out.

Secondly, I think it's worth bearing in mind that this project is still in beta stage. I believe it is a bit unfair to attach your computers to a project that you know isn't fully functional and then expect it to run without causing any problems.

Finally, bear in mind that the problem came about because the project had run out of work and many users were complaining because there was no work available. The project scientist that created the new work simply made a mistake. He posted that he was trying something new and that it might cause problems at the time he did it. If you're going to run a beta project, I really think you must accept that you have to follow the message boards and be prepared to adjust to problems and provide feedback to the project staff.
ID: 4346 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Alinator

Send message
Joined: 7 Jun 08
Posts: 464
Credit: 56,639,936
RAC: 0
Message 4347 - Posted: 22 Jul 2008, 17:40:25 UTC - in response to Message 4343.  
Last modified: 22 Jul 2008, 17:43:17 UTC

Hi,
FWIW, my PII (450), K6-2's (500's), and PIII (550) can all make the deadline, even on the long 372x's.

But only if you crunch 24/7. My PIII Laptop would need 10 days before it could return a long wu regarding its current uptimes.

Regards,
Carsten


Actually, I could run all of them at somewhere around 12 hours per day and meet the deadlines for the long length work. The G3 can probably do at 12 per as well, but it would be really close!

However they all run 24/7, but only because they have a primary job to do which is not crunching. If they didn't, they most likely would not be on at all. ;-)

As you point out though, an older laptop which still gets used like a laptop is going to have trouble cutting it on MW, even if the deadline goes to 7 days. Even with adding a couple of extra days, it still represents a big increase in the deadline tightness factor for MW.

Alinator
ID: 4347 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
voltron
Avatar

Send message
Joined: 30 Mar 08
Posts: 50
Credit: 11,593,755
RAC: 0
Message 4382 - Posted: 22 Jul 2008, 22:21:58 UTC - in response to Message 4332.  

Meanwhile the rest of us continue to burn electricity and cycles for nothing, and our other project work is shut out.

No more for me. I'm now blocking MilkyWay units on all my machines.


You think this is bad, try Orbit@home. I pulled a couple of work units from them that went 137 hours each on a "hotrodded" A64 dual core. What do you expect to accomplish with a hagged out lappy running a Pentium III from 1999?

The Katmai was the first iteration of the p III and one step up from the P II.

I am having some difficulty understanding why you are complaining about swimming with the sharks when all you brought to the party was a rubber ducky.

Voltron
ID: 4382 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Idefix

Send message
Joined: 19 Apr 08
Posts: 7
Credit: 3,067
RAC: 0
Message 4395 - Posted: 22 Jul 2008, 23:55:24 UTC - in response to Message 4382.  

Hi,
You think this is bad, try Orbit@home.

Orbit doesn't have a deadline of five days ...

I am having some difficulty understanding why you are complaining about swimming with the sharks when all you brought to the party was a rubber ducky.

Complaining about an improper project setup has nothing to do with "complaining about swimming with the sharks".

Regards,
Carsten
ID: 4395 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Jayargh
Avatar

Send message
Joined: 8 Oct 07
Posts: 289
Credit: 3,690,838
RAC: 0
Message 4406 - Posted: 23 Jul 2008, 3:12:35 UTC
Last modified: 23 Jul 2008, 3:55:14 UTC

I haven't had any problems on P4 and above....sorry but the needs of the many outweigh the needs of the few.... when the server was freezing up.... countless hosts were not able to contact the server.....jamming those systems up.....Murphys Law...What can go wrong will go wrong!
ID: 4406 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [b@h] tomcat

Send message
Joined: 27 Nov 07
Posts: 5
Credit: 87,063
RAC: 0
Message 4411 - Posted: 23 Jul 2008, 9:04:31 UTC

The deadlines are really tooo short for slow computers.
The problem is not only the "run high priority"

On my Mac G4 1GHz the wu's need more than 48 hours as the wu are not optimized for Macintosh PPC and it's a older (slower) computer.

So I will have to cancel many wu's which can't meet the deadline. :(

In my opinion it would be better if there was a statement how to deal with wu's which can't meet the deadline.
Furthermore a little warning on the front site should be set.

Not every user can follow all threads, especially if taking part on many projects ;)

I know it's often a hard job to keep things running and not all projects can do such a good communication as for an example climate prediction does, but some more information and planning I would prefer.
So if changing things like the length of the wu, it may be better to do if the admin has time to manage the (unexpected) effects. - only as a hint ;)
ID: 4411 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Altivo

Send message
Joined: 5 Dec 07
Posts: 6
Credit: 1,687,632
RAC: 0
Message 4414 - Posted: 23 Jul 2008, 11:07:47 UTC - in response to Message 4406.  

I haven't had any problems on P4 and above....sorry but the needs of the many outweigh the needs of the few.... when the server was freezing up.... countless hosts were not able to contact the server.....jamming those systems up.....Murphys Law...What can go wrong will go wrong!


My P4s have been failing just like the rest. "Exceeded CPU time limit" or "Exceeded memory limit" or just not completing before the deadline. And because the deadlines are too short, BOINC apparently prioritizes the workunits and locks everything else out. Because the time estimates are too short, it downloads several at once when it won't be able to complete even one within the allotted time frame.

About one out of four or five work units has managed to complete in time and without error. The rest fail and get dumped back in the pool so they can be downloaded by some other poor sucker and fail again.
ID: 4414 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 16 Jun 08
Posts: 93
Credit: 366,882,323
RAC: 0
Message 4452 - Posted: 24 Jul 2008, 16:50:11 UTC - in response to Message 4346.  
Last modified: 24 Jul 2008, 17:00:39 UTC

Thunder wrote:

BOINC has a 'debt' system for projects that assures that projects will get the priority YOU assign in the long term. For example, I had 12 computers that went 'panic mode' to finish MW units, but out of those 12, only 2 have gone back to processing MW once they emptied the queue. Once the 'long term debt' that is owed to the projects that couldn't work while MW was forcing the issue is satisfied, MW will start running again. Looks like for most of the machines that will be 4-7 days of running no MW. I didn't do anything... didn't touch any files or edit anything or force any updates or detach or anything. BOINC just did it on it's own and everything has worked out.


This is *great* information, especially for a newb like me. Like many others, apparently, I got hammered with long WUs and short deadlines not long ago. Since they finished, I've been wondering -- and fretting, sorry to say -- why I've heard nothing from MW@H in several days. I now regret my bit of manual intervention -- several updates and a reset or two -- to try and coax a few WUs my way.

I've also learned that computations don't always quit after the deadline and you don't necessarily have to abort WUs in fear of wasting CPU time and not getting credit for what you've done. If you check the WU ID, you can, among other things:

(1) See how many results a project requires;
(2) See which other host(s), if any, have been given the same WU; and
(3) Get a pretty good idea whether or not you can finish before anyone else.

If you finish after your deadline, but before other hosts who have been given the same WU, you should still get credit. I'm not sure how all projects work, but my experience with MW@H bears this out.
ID: 4452 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Alinator

Send message
Joined: 7 Jun 08
Posts: 464
Credit: 56,639,936
RAC: 0
Message 4454 - Posted: 24 Jul 2008, 19:04:15 UTC - in response to Message 4452.  



This is *great* information, especially for a newb like me. Like many others, apparently, I got hammered with long WUs and short deadlines not long ago. Since they finished, I've been wondering -- and fretting, sorry to say -- why I've heard nothing from MW@H in several days. I now regret my bit of manual intervention -- several updates and a reset or two -- to try and coax a few WUs my way.

I've also learned that computations don't always quit after the deadline and you don't necessarily have to abort WUs in fear of wasting CPU time and not getting credit for what you've done. If you check the WU ID, you can, among other things:

(1) See how many results a project requires;
(2) See which other host(s), if any, have been given the same WU; and
(3) Get a pretty good idea whether or not you can finish before anyone else.

If you finish after your deadline, but before other hosts who have been given the same WU, you should still get credit. I'm not sure how all projects work, but my experience with MW@H bears this out.


LOL...

Yes, you have discovered a large part of the fun of the game! Normally, BOINC takes care of the day to day routine on it's own quite well, and is quite boring from the user's POV.

The trick is knowing when and where to intervene when unusual things happen and it gets into trouble for one reason or another.

FWIW, many of the third party tools available for managing and monitoring BOINC, like BoinvView, BoincLogX, etc. can help you get a better feel for what's going on than BOINC Manager can, and are a big help in deciding when to step in.

In any event, they sure beat having to root around in the BOINC directories and state files to get the answers to simple questions! ;-)

Alinator
ID: 4454 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John McLeod VII
Avatar

Send message
Joined: 27 Aug 07
Posts: 85
Credit: 405,705
RAC: 0
Message 4460 - Posted: 25 Jul 2008, 1:15:50 UTC - in response to Message 4454.  



This is *great* information, especially for a newb like me. Like many others, apparently, I got hammered with long WUs and short deadlines not long ago. Since they finished, I've been wondering -- and fretting, sorry to say -- why I've heard nothing from MW@H in several days. I now regret my bit of manual intervention -- several updates and a reset or two -- to try and coax a few WUs my way.

I've also learned that computations don't always quit after the deadline and you don't necessarily have to abort WUs in fear of wasting CPU time and not getting credit for what you've done. If you check the WU ID, you can, among other things:

(1) See how many results a project requires;
(2) See which other host(s), if any, have been given the same WU; and
(3) Get a pretty good idea whether or not you can finish before anyone else.

If you finish after your deadline, but before other hosts who have been given the same WU, you should still get credit. I'm not sure how all projects work, but my experience with MW@H bears this out.


LOL...

Yes, you have discovered a large part of the fun of the game! Normally, BOINC takes care of the day to day routine on it's own quite well, and is quite boring from the user's POV.

The trick is knowing when and where to intervene when unusual things happen and it gets into trouble for one reason or another.

FWIW, many of the third party tools available for managing and monitoring BOINC, like BoinvView, BoincLogX, etc. can help you get a better feel for what's going on than BOINC Manager can, and are a big help in deciding when to step in.

In any event, they sure beat having to root around in the BOINC directories and state files to get the answers to simple questions! ;-)

Alinator

A typical problem is when the DCF is too low at work fetch time and a huge number of tasks is downloaded. After the first one, the estimated completion times go up and it is now obvious that there is a PROBLEM. Some of the tasks need to be aborted for sanity.

Another problem is tasks that run forever. BOINC will quit eventually (but it may be VERY eventually) because of time exceeded.


BOINC WIKI
ID: 4460 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Alinator

Send message
Joined: 7 Jun 08
Posts: 464
Credit: 56,639,936
RAC: 0
Message 4467 - Posted: 25 Jul 2008, 6:09:19 UTC - in response to Message 4460.  
Last modified: 25 Jul 2008, 6:13:55 UTC


A typical problem is when the DCF is too low at work fetch time and a huge number of tasks is downloaded. After the first one, the estimated completion times go up and it is now obvious that there is a PROBLEM. Some of the tasks need to be aborted for sanity.

Another problem is tasks that run forever. BOINC will quit eventually (but it may be VERY eventually) because of time exceeded.


LOL...

Yes, that is very true. It's also very true if folks really stop and listen to what you are telling them, they can avoid most of the major pitfalls when boundary conditions begin to apply.

I know for a fact I chose to ignore your advice on a couple of cases to my regret. That pretty much sums up your 'street cred' in a nutshell. ;-)

Of course, giving away the whole show takes away half the fun of the 'hunt'! :-D

Alinator
ID: 4467 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile banditwolf
Avatar

Send message
Joined: 12 Nov 07
Posts: 2425
Credit: 524,164
RAC: 0
Message 4484 - Posted: 25 Jul 2008, 23:01:34 UTC

Where is Travis? wasn't he supposed to have been back by now. (or last week?) I clearly remember "by wednesday".

How about the other admins?
Doesn't expecting the unexpected make the unexpected the expected?
If it makes sense, DON'T do it.
ID: 4484 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Odd-Rod

Send message
Joined: 7 Sep 07
Posts: 444
Credit: 5,712,523
RAC: 0
Message 4485 - Posted: 26 Jul 2008, 7:40:21 UTC - in response to Message 4484.  

Where is Travis? wasn't he supposed to have been back by now. (or last week?) I clearly remember "by wednesday".


I guess he should have specified which Wednesday... ;)
ID: 4485 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Lack of communication

©2024 Astroinformatics Group