Welcome to MilkyWay@home

WU Freezes BOINC Manager...

Message boards : Number crunching : WU Freezes BOINC Manager...
Message board moderation

To post messages, you must log in.

AuthorMessage
Jimmy Gondek

Send message
Joined: 28 Sep 11
Posts: 60
Credit: 22,764,173
RAC: 0
Message 51505 - Posted: 25 Oct 2011, 16:59:51 UTC
Last modified: 25 Oct 2011, 17:01:46 UTC

Hi Folks,

I thought I'd post this problem here first instead of on the BOINC Board since you've got the luxury of being able to look at the WU in question. That the problem is not on the MWAH end of things I'll be sure to post about this over at BOINC, too.

I've been running MWAH (on an iMac 12,2 @ 3.4GHz 16GB, OSX 10.6.8, BOINC Manager 6.12.35, MWAH 0.82 & N 0.60 mt) constantly for the past few weeks and have been encountering an occasional freeze on particular WU's. At first I attributed it to an inexplicable hiccup, but then it occurred a second time and now it is occurring every few days.

What is occurring is that the MWAH WU stops processing mid-stream with no error logged and this then freezes up BOINC Manager. Any attempt to access any BOINC Manager menus or interface controls causes a BOINC Manager Communication popup box to appear with the message, "Communicating with BOINC Client. Please wait ..." with two select buttons to choose from, "Quit BOINC manager" or "Cancel". This has occurred whether the computer is in active use (web-browsing, editing, etc.) or when the computer is idle (overnight, only BOINC running).

So I decided to keep an eye on things the past few occurrences and noticed a few things...

Using the OSX utility Activity Monitor I discovered that once this freeze occurred only 4 of the 8 CPUs were churning, and then, only at about 50% usage. Normally all 8 CPUs are performing at nearly 100% while working on either 8 simultaneous 0.82's or a single 0.60 mt WU.

Also, of the 2 PIDs that BOINC maintains the one assigned to User boinc_master was using ~86% CPU and the process assign to User me, 0.1% post freeze.

Also, all idle tasks (WU's to BOINC, Processes to Activity Monitor) which were normally kept in Virtual Memory (both suspended CPDN tasks and all currently active MWAH tasks) were all dumped from their computer memory spaces including the frozen WU which BOINC Manager stated was running.

I made screen captures of these various phenomenon and retained the stdoutdae.txt file for the most current event...I can send all of this along via email should anyone want to peruse it. (Or if someone can walk me through how to post images off my HD into these posts I can add those here! Doh!)

FWIW, I did a search for this error and came up with the BOINC FAQ for this error...

BOINC FAQ Service:
http://boincfaq.mundayweb.com/index.php?language=1&view=516

...however, their account has to do with firewalls or having too many tasks in BOINC Manager and neither of these issues applies to me here...I had ~25 MWAH WU's in the queue and 8 CPDN.

If you'd like to rerun the MWAH WU ps_nbody_test3_5442548_1 that caused the hang on your end, that can be found here...

Task 11180604:
http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=11180604

...please note that BOINC Manager showed this WU hung at 40.182% and your results page shows the WU 100% Completed and Validated! Huh?

If there are any further deatils I can provide, let me know!

:)
Jimmy G
ID: 51505 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jimmy Gondek

Send message
Joined: 28 Sep 11
Posts: 60
Credit: 22,764,173
RAC: 0
Message 52183 - Posted: 2 Jan 2012, 19:56:49 UTC

No love for this problem?
ID: 52183 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,939,976
RAC: 22,667
Message 52186 - Posted: 2 Jan 2012, 21:48:04 UTC - in response to Message 52183.  

No love for this problem?


It happens on most projects and NO ONE has an answer except when you see it suspend Boinc and then restart it and the unit should go back to wherever it was when things went bad and then finish normally. It is just time lost and is so random it hasn't been tracked down yet. IF you can make it happen at will contact Dr. David Anderson at Seti, he would be VERY interested!!!
ID: 52186 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jimmy Gondek

Send message
Joined: 28 Sep 11
Posts: 60
Credit: 22,764,173
RAC: 0
Message 52193 - Posted: 3 Jan 2012, 1:49:06 UTC - in response to Message 52186.  

No love for this problem?


It happens on most projects and NO ONE has an answer except when you see it suspend Boinc and then restart it and the unit should go back to wherever it was when things went bad and then finish normally. It is just time lost and is so random it hasn't been tracked down yet. IF you can make it happen at will contact Dr. David Anderson at Seti, he would be VERY interested!!!


Thanks for the info, Mikey, I'll keep it aside as an unsolved mystery! :)
ID: 52193 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,939,976
RAC: 22,667
Message 52198 - Posted: 3 Jan 2012, 12:55:24 UTC - in response to Message 52193.  

No love for this problem?


It happens on most projects and NO ONE has an answer except when you see it suspend Boinc and then restart it and the unit should go back to wherever it was when things went bad and then finish normally. It is just time lost and is so random it hasn't been tracked down yet. IF you can make it happen at will contact Dr. David Anderson at Seti, he would be VERY interested!!!


Thanks for the info, Mikey, I'll keep it aside as an unsolved mystery! :)


No problem, it is usually one of those GDMF moments that we all see, mostly rarely, that means we lost x number of hours of crunching time. I usually exit Boinc and then on the restart it picks right back up again, but you can also just suspend the unit and then it will pick back up again that way too. That is why I now check my pc's on a daily basis to make sure they are up and running okay and also why there is 3rd party software to keep track of all of your Boinc machines. I have 15 pc's running here at home and I check them all every day, remotely of course.
ID: 52198 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jimmy Gondek

Send message
Joined: 28 Sep 11
Posts: 60
Credit: 22,764,173
RAC: 0
Message 52373 - Posted: 11 Jan 2012, 3:02:35 UTC - in response to Message 52198.  

Hi Mikey,

The remote monitoring sounds like an interesting solution, but I'm hoping for just some "no need to babysit" time with MWAH...between the glitches I've mentioned, the inordinate share of MWAH Waiting for GPU errors and, of course, the ongoing server issues, I find that running MWAH does require a bit more attention than my other two projects.

One other glitch also exists for me with MWAH...since the doled out WUs only keep my machine busy for several hours at best, I cannot run the project overnight as I shut my internet connection down completely and, well, I'd run out of work. So MWAH is a daytime-only app for me and I run CPDN overnight on the desktop and SETI continuously on the laptop...they don't require burpings.

Unless things even out with MWAH with the new server my time with it may become more limited...my daytime hours will become much busier as springtime nears and I won't have the luxury of time to keep an eye on things like I've had to thus far. (Fingers crossed!)

Thanks again for your time on this!

:)
ID: 52373 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Werkstatt

Send message
Joined: 19 Feb 08
Posts: 350
Credit: 141,284,369
RAC: 0
Message 52379 - Posted: 11 Jan 2012, 9:30:22 UTC

Hi,
I reinstalled my mainsys between Xmas and new Year, using a SSD as main drive. I see the same problem since then, but I assumed the source of the problem in the new installation. I'm glad the see that this is a common problem.
It happens nearly every morning when I start my system. I need to reinstall BM with the Repair-option, then it works.
If someone gives me instruction where to look I may be able to help to solve this problem.

Alexander
ID: 52379 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Werkstatt

Send message
Joined: 19 Feb 08
Posts: 350
Credit: 141,284,369
RAC: 0
Message 52380 - Posted: 11 Jan 2012, 10:50:17 UTC

I took a closer look onto this, found a thread here

http://boinc.berkeley.edu/dev/forum_thread.php?id=7055#41055
Seems to be really a hard to find problem.

ID: 52380 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,939,976
RAC: 22,667
Message 52383 - Posted: 11 Jan 2012, 14:35:39 UTC - in response to Message 52373.  

Hi Mikey,

The remote monitoring sounds like an interesting solution, but I'm hoping for just some "no need to babysit" time with MWAH...between the glitches I've mentioned, the inordinate share of MWAH Waiting for GPU errors and, of course, the ongoing server issues, I find that running MWAH does require a bit more attention than my other two projects.

One other glitch also exists for me with MWAH...since the doled out WUs only keep my machine busy for several hours at best, I cannot run the project overnight as I shut my internet connection down completely and, well, I'd run out of work. So MWAH is a daytime-only app for me and I run CPDN overnight on the desktop and SETI continuously on the laptop...they don't require burpings.

Unless things even out with MWAH with the new server my time with it may become more limited...my daytime hours will become much busier as springtime nears and I won't have the luxury of time to keep an eye on things like I've had to thus far. (Fingers crossed!)

Thanks again for your time on this!

:)


You an adjust your priorities for each project so it gets more or less work than another project but until MWAH settles down I am not sure it will help either, it hasn't for me. I run Moo and MWAH as my only two gpu projects right now and have Moo set at 40% but it gets far more work than MWAH due to MWAH's restrictions.
ID: 52383 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jimmy Gondek

Send message
Joined: 28 Sep 11
Posts: 60
Credit: 22,764,173
RAC: 0
Message 53472 - Posted: 28 Feb 2012, 14:00:45 UTC

Resolution Update: Looks like the problem was identified as a narrow Mac OSX 10.6.8 BOINC 6.12.34/35 issue which was cleared with BOINC 6.12.41 as noted here...

"WU Freezes BOINC Manager" Redux...:
http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=2789&nowrap=true#53464

...thanks Kashi and scasady for figuring this out! Also thanks to everyone at MWAH & BOINC who spent time with me trying to solve this!

Oh, happy day,
:)
ID: 53472 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,939,976
RAC: 22,667
Message 53492 - Posted: 29 Feb 2012, 12:22:22 UTC - in response to Message 52373.  

Hi Mikey,

The remote monitoring sounds like an interesting solution, but I'm hoping for just some "no need to babysit" time with MWAH...between the glitches I've mentioned, the inordinate share of MWAH Waiting for GPU errors and, of course, the ongoing server issues, I find that running MWAH does require a bit more attention than my other two projects.

One other glitch also exists for me with MWAH...since the doled out WUs only keep my machine busy for several hours at best, I cannot run the project overnight as I shut my internet connection down completely and, well, I'd run out of work. So MWAH is a daytime-only app for me and I run CPDN overnight on the desktop and SETI continuously on the laptop...they don't require burpings.

Unless things even out with MWAH with the new server my time with it may become more limited...my daytime hours will become much busier as springtime nears and I won't have the luxury of time to keep an eye on things like I've had to thus far. (Fingers crossed!)

Thanks again for your time on this!

:)


Yeah MW is still experiencing growth and is slowly raising the daily limit on the total number of workunits any one pc can get at one time, but I can go thru them at 1 1/2 minutes each, so they don't last long for me either! I too run other projects so my pc's keep busy 24/7!!
ID: 53492 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jimmy Gondek

Send message
Joined: 28 Sep 11
Posts: 60
Credit: 22,764,173
RAC: 0
Message 53503 - Posted: 1 Mar 2012, 12:36:04 UTC - in response to Message 53492.  

Hi Mikey,

Thanks again for your early help on this issue (and those very insightful PMs!).

It was actually something you said to me about running projects full time that got me to experiment with letting MWAH run through the night to see how the system would handle that environment...good thing, as it helped bring into stark relief how the MWAH/BOINC socketpair issue was manifesting! The shorter MWAH WUs brought this right out in short order whereas longer WUs (ala SETI and CPDN) would have taken longer for the patterns and symptoms related to the issue to become noticeable...in fact, that's exactly what was happening when I was juggling all three projects, these odd crashes and hiccups looked to be sporadic issues that could well have been caused by non-BOINC software compatibility issues!

MWAH user Kashi (over on the other thread) has got me seriously considering doing the upgrade to 6.12.43 which should solve this issue. When I'm ready to start pulling my hair out again I'll be testing v.7 builds for Ageless and company! Ha!

Thanks again!

:)
ID: 53503 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,939,976
RAC: 22,667
Message 53504 - Posted: 1 Mar 2012, 17:19:12 UTC - in response to Message 53503.  

Hi Mikey,

Thanks again for your early help on this issue (and those very insightful PMs!).

It was actually something you said to me about running projects full time that got me to experiment with letting MWAH run through the night to see how the system would handle that environment...good thing, as it helped bring into stark relief how the MWAH/BOINC socketpair issue was manifesting! The shorter MWAH WUs brought this right out in short order whereas longer WUs (ala SETI and CPDN) would have taken longer for the patterns and symptoms related to the issue to become noticeable...in fact, that's exactly what was happening when I was juggling all three projects, these odd crashes and hiccups looked to be sporadic issues that could well have been caused by non-BOINC software compatibility issues!

MWAH user Kashi (over on the other thread) has got me seriously considering doing the upgrade to 6.12.43 which should solve this issue. When I'm ready to start pulling my hair out again I'll be testing v.7 builds for Ageless and company! Ha!

Thanks again!

:)


No problem, I am just glad it worked!!
ID: 53504 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jimmy Gondek

Send message
Joined: 28 Sep 11
Posts: 60
Credit: 22,764,173
RAC: 0
Message 54069 - Posted: 19 Apr 2012, 13:01:19 UTC

6-Week Final Update:...looks like BOINC Manager 6.12.43 has solved all of my issues with spontaneous restarts, waiting for gpu statuses and mdnsresponder system freezes...yay!...

...seeing how 6.12.35 was not playing well with OSX 10.6.8, perhaps the kind folks at BOINC would consider elevating 6.12.43 as their preferred v.6 OSX install on this page?...

Download BOINC client software:
http://boinc.berkeley.edu/download_all.php

...and again, my sincere thanks to everyone for all your time, help, insights and suggestions!... :)
ID: 54069 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : WU Freezes BOINC Manager...

©2024 Astroinformatics Group