Message boards :
Number crunching :
Can't Write State File - Take 2
Message board moderation
Author | Message |
---|---|
Send message Joined: 24 Dec 07 Posts: 1947 Credit: 240,884,648 RAC: 0 |
#903: BOINC Manager Freeze ----------------------------+----------------------------------------------- Reporter: The Gas Giant | Owner: romw Type: Defect | Status: new Priority: Minor | Milestone: Undetermined Component: Manager | Version: 6.6.28 Resolution: | Keywords: Freeze ----------------------------+----------------------------------------------- Changes (by charlief): * owner: charlief => romw Comment: I asked Nicolas for more info, and he wrote: The client may hang for many reasons, sometimes briefly, sometimes for as long as a minute. Doing a DNS request (if libcurl isn't compiled with async name lookups and the DNS server is down), cleaning a slot with lots of files when a workunit finishes, calculating disk usage when there are lots of files in project or slot directories, checking the MD5 of a giant file, copying a giant file from project to slot or vice-versa (if there is <copy_file/>), etc. Those are possible causes for client hangs. Now for the consequences: The manager used to hang immediately after the client hanged, because it sent GUI RPC requests to the client and then *blocked* waiting for a reply. Async GUI RPCs may have fixed this particular problem. But there is another consequence that remains: if the client is hanged for more than 30 seconds, science apps will think the client quit (because it's not sending "heartbeat" messages), and will quit too. When the client gets out of its blockage, it will notice the apps disappeared, and restart them. The user-visible behavior of this set of problems is strange: his Internet connection went down and his BOINC Manager hanged, sometimes coming back to show the bad news that WUs were giving errors: "app quit with zero status but no finished file, if this happens repeatedly you may want to reset". The symptoms seem completely unrelated. The full problem chain for that situation is: Internet is down, client tries to contact project, gets blocked for a whole minute before timing out on the DNS request; meanwhile, the manager is blocked waiting for a GUIRPC reply from the client, and science apps kill themselves thinking the core client quit because it's not sending heartbeats. When the client finally times out the DNS request, it notices science apps are gone, restarts them, and starts answering GUI RPCs again. But since a whole minute passed, it's possible the backoff for another project reached zero by now, so the process repeats! Until the Internet connection is back working, or all projects get large-ish backoffs, or the user quickly disables network activity before it hangs again. A lot of this description may be outdated by now. I'm saying what used to happen back when I saw this problem on my own machines. Now the situation has surely changed. For example, BOINC Manager has async RPCs, and I think the current Windows binaries have async DNS enabled (they only lacked it for a short period of time). But since I don't use Windows on this computer anymore (Ubuntu's libcurl definitely has async DNS and I use self-compiled BOINC), and I have never used 6.4/6.6, I don't know what the current situation really is on the most common platform. |
©2024 Astroinformatics Group