Welcome to MilkyWay@home

Posts by Altivo

1) Message boards : Number crunching : Compute errors (Message 4504)
Posted 28 Jul 2008 by Altivo
Post:
I've seen this before, and not just on this project. It happens to me on machines that have slow or dialup network access. Apparently there are timeouts set much too tightly for that condition, so that instead of waiting patiently, things freak out and abort. It's irritating to say the least, and especially with WUs that were running successfully.

Just noticed I got a computer error on a 3 hour WU...

"<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
too many normally harmless exit(s)
</message>
<stderr_txt>
No heartbeat from core client for 31 sec - exiting

</stderr_txt>
]]>
"
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=45159505

2) Message boards : Number crunching : Lack of communication (Message 4414)
Posted 23 Jul 2008 by Altivo
Post:
I haven't had any problems on P4 and above....sorry but the needs of the many outweigh the needs of the few.... when the server was freezing up.... countless hosts were not able to contact the server.....jamming those systems up.....Murphys Law...What can go wrong will go wrong!


My P4s have been failing just like the rest. "Exceeded CPU time limit" or "Exceeded memory limit" or just not completing before the deadline. And because the deadlines are too short, BOINC apparently prioritizes the workunits and locks everything else out. Because the time estimates are too short, it downloads several at once when it won't be able to complete even one within the allotted time frame.

About one out of four or five work units has managed to complete in time and without error. The rest fail and get dumped back in the pool so they can be downloaded by some other poor sucker and fail again.
3) Message boards : Number crunching : Lack of communication (Message 4339)
Posted 22 Jul 2008 by Altivo
Post:
Sure, I've read the other threads. Mostly they pose questions and no answers are being given.

This is a problem large enough that it ought to be on the "home" page of the project though, warning people of what is going on and giving them some official advice about what to do. Digging through verbose threads on a forum only to find others reporting the same problems isn't really that helpful. All it does is confirm that there really is a problem.

Don't they TEST things when making major changes, rather than just dumping thousands of work units out on all of us? Apparently not.

As far as I can see, I'm the only one who has raised the issue of unfairness to other projects that have their act more together, like WCG or Rosetta. Locking their units out by insisting the MW needs more immediate priority, which is what happens with these malformed units, is really irresponsible on the part of this project's administration. That's my serious complaint.
4) Message boards : Number crunching : Lack of communication (Message 4332)
Posted 22 Jul 2008 by Altivo
Post:
Of the numerous projects for which I've been volunteering my CPU time, this one seems to have the worst management and absolutely the worst communication.

The cavalier attitude about thousands of work units with improper timing parameters is a good example. Has anyone considered the fact that not only do these units waste our time by running for 22 hours and then crashing on "exceeded CPU time limit" but because of the ridiculously short deadlines set for completion, BOINC gives them top priority, locking out other work for other projects. This last issue is really inexcusable. It's one thing to waste volunteer time on bad workunits, but it's quite another to steal time from other projects that manage their work more effectively.

On my machines that run Milkyway, almost everything else has been completely shut out as these badly coded work units hog all the available cycles and then crash. Yet the only response from project management has been "It will work itself out in due time." The truth is, no, it won't. After the wu crashes, it just gets reassigned to someone else and crashes the same way, until all of those thousands of units eventually get assigned to someone who has a hot-rodded quad core machine or something and manages to get through them. Meanwhile the rest of us continue to burn electricity and cycles for nothing, and our other project work is shut out.

No more for me. I'm now blocking MilkyWay units on all my machines.
5) Message boards : Number crunching : Problem with new W/Us (Message 3599)
Posted 30 May 2008 by Altivo
Post:
I ended up aborting a whole stack of gs_3737... units because they just looped and ran forever. Now I had to abort gs_591_1212007885_142924_0 for the same reason. It was running for 11+ hours. This is on a Linux worstation (Slackware) with BOINC 5.8.16.

It is particularly troubling that these tasks do not seem to surrender control of the CPU when their hour time slot is up. Looping or not, this is hostile behavior that keeps other projects from getting their share of the available services. I've noticed Milkyway tasks doing this before, and I don't particularly like it.

This has been a severe enough problem that I frankly think it should have been posted to the project front page sooner than it was. I lost hours of processing time on multiple machines because they were busily looping away and locking out other legitimate projects.
6) Questions and Answers : Web site : Unable to attach new machine to existing account (Message 2482)
Posted 21 Mar 2008 by Altivo
Post:
The messages that appear in boincmgr are:

Fri 21 Mar 2008 4:07:59 PM The web page at http://milkyway.cs.rpi.edu/milkyway/ contains no BOINC information.
Fri 21 Mar 2008 4:08:01 PM Resetting project.
Fri 21 Mar 2008 4:08:01 PM Detaching from project.

This is a Linux environment, P3 processor, BOINC version 5.8.16. Times in messages above are US Central Daylight.




©2024 Astroinformatics Group