Welcome to MilkyWay@home

Can't Write State File - Take 2

Message boards : Number crunching : Can't Write State File - Take 2
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile The Gas Giant
Avatar

Send message
Joined: 24 Dec 07
Posts: 1947
Credit: 240,884,648
RAC: 0
Message 27140 - Posted: 6 Jul 2009, 0:01:23 UTC

#903: BOINC Manager Freeze
----------------------------+-----------------------------------------------
Reporter: The Gas Giant | Owner: romw
Type: Defect | Status: new
Priority: Minor | Milestone: Undetermined
Component: Manager | Version: 6.6.28
Resolution: | Keywords: Freeze
----------------------------+-----------------------------------------------
Changes (by charlief):

* owner: charlief => romw

Comment:

I asked Nicolas for more info, and he wrote:

The client may hang for many reasons, sometimes briefly, sometimes for as long as a minute. Doing a DNS request (if libcurl isn't compiled with async
name lookups and the DNS server is down), cleaning a slot with lots of files when a workunit finishes, calculating disk usage when
there are lots of files in project or slot directories, checking the MD5 of a giant file, copying a giant file from project to slot or
vice-versa (if there is <copy_file/>), etc.

Those are possible causes for client hangs. Now for the consequences:

The manager used to hang immediately after the client hanged, because it sent GUI RPC requests to the client and then *blocked* waiting for a reply.
Async GUI RPCs may have fixed this particular problem.

But there is another consequence that remains: if the client is hanged for more than 30 seconds, science apps will think the client quit (because it's
not sending "heartbeat" messages), and will quit too. When the client gets out of its blockage, it will notice the apps disappeared, and restart them.

The user-visible behavior of this set of problems is strange: his Internet connection went down and his BOINC Manager hanged, sometimes coming back to
show the bad news that WUs were giving errors: "app quit with zero status but no finished file, if this happens repeatedly you may want to reset". The
symptoms seem completely unrelated.

The full problem chain for that situation is: Internet is down, client tries to contact project, gets blocked for a whole minute before timing out on
the DNS request; meanwhile, the manager is blocked waiting for a GUIRPC reply from the client, and science apps kill themselves thinking the core client
quit because it's not sending heartbeats. When the client finally times out the DNS request, it notices science apps are gone, restarts them, and starts
answering GUI RPCs again. But since a whole minute passed, it's possible the backoff for another project reached zero by now, so the process repeats!
Until the Internet connection is back working, or all projects get large-ish backoffs, or the user quickly disables network activity before it hangs again.

A lot of this description may be outdated by now. I'm saying what used to happen back when I saw this problem on my own machines. Now the situation has
surely changed. For example, BOINC Manager has async RPCs, and I think the current Windows binaries have async DNS enabled (they only lacked it for a
short period of time). But since I don't use Windows on this computer anymore (Ubuntu's libcurl definitely has async DNS and I use self-compiled BOINC), and
I have never used 6.4/6.6, I don't know what the current situation really is on the most common platform.
ID: 27140 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Can't Write State File - Take 2

©2024 Astroinformatics Group