Welcome to MilkyWay@home

How can I control the BOINC client from a remote computer?

Message boards : Number crunching : How can I control the BOINC client from a remote computer?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Sunny129
Avatar

Send message
Joined: 25 Jan 11
Posts: 271
Credit: 346,072,284
RAC: 0
Message 49384 - Posted: 16 Jun 2011, 22:24:47 UTC

you guys are right - i should eliminate the password variable first, since that's the error i'm getting, and also b/c it should theoretically reconnect to the client every time it loses its connection. so here's how i'm gonna go at it - and of course i'm gonna try one thing at a time to guarantee isolation of the source of the problem:

1) clear the contents of the gui_rpc_auth.cfg file (remove the client's password requirement) - really i just expect this to allow the manager to reconnect to the client without issue, but i'm skeptical that it'll prevent the client-manager connection from failing...

2) close ClamWin Antivirus - this is the only malware/spyware/virus software that i let run in the background, and i have no active firewall (including the Windows firewall)

3) show active tasks only in the BOINC manager - if the above doesn't just "solve it," perhaps this will. if the manager is in fact too busy juggling tasks to authenticate a password every time it needs to communicate with the client, then maybe this is the key...though i don't expect it to do anything.

4) kill boinctray.exe and disable it on startup

5) exit the BOINC manager and leave the apps running (to eliminiate the possibility that the manager itself is somehow causing the problem)

6) try BAM or BoincTasks to see if its any easier to control things remotely (this and the next item are last resorts in the event that i just can't colve the client-manager connection issue)

7) try Log ME In or, Tight VNC, or some other "global" remote control software to see if its any easier to control things remotely


god only knows how long it'll take me to get through all these tests, but i'll report back with results, good or bad. thanks for the help, all of you.
ID: 49384 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sunny129
Avatar

Send message
Joined: 25 Jan 11
Posts: 271
Credit: 346,072,284
RAC: 0
Message 49385 - Posted: 16 Jun 2011, 22:56:21 UTC
Last modified: 16 Jun 2011, 23:04:14 UTC

ok, removing the client password has been done and partially tested - unfortunately it did not stop the client-manager connection from failing. the manager has lost its connection to the client twice now since i cleared the gui_rpc_auth.cfg file, both lasting for ~40 seconds. the 2nd part of this test requires me to let it run for quite a while to make sure that the password error won't crop up again after several periods of disconnection. i'm tempted to run it through the morning just to see if anything goes awry overnight...

one thing's for sure, as i just confirmed it: unlike Beyond during his disconnection experiences, my tasks definitely stop running during my moments of disconnection, both CPU & GPU usage dropping to 0%. i'd almost be content with letting the disconnections go on if i knew my tasks would keep running (particularly if the removal of a client password eliminates the "reconnection" part of the problem), but the fact that they do stop running during the outages means i have no choice but to eliminate the connection problem altogether. that being said, i'm tempted to jump straight to the next step b/c i now know that client password removal didn't solve the problem entirely. but i'd still like to see if the password error doesn't pop up again, so i'll let it run.
ID: 49385 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
robertmiles

Send message
Joined: 30 Sep 09
Posts: 211
Credit: 36,977,315
RAC: 0
Message 49386 - Posted: 16 Jun 2011, 23:16:15 UTC - in response to Message 49375.  
Last modified: 16 Jun 2011, 23:32:52 UTC

Yes, BOINC Manager refuses to connect to BOINC until boinctray is killed. The difference is that in my case boinc.exe and the project apps keep running. They can also be accessed via BoincTasks and BoincView from another machine on the local network. BTW, BoincTasks is excellent, BoincView is now badly outdated. If boinc.exe is failing and the client apps aren't running you may have a more serious problem. Make sure BOINC and the client apps aren't getting restricted by any security programs.


A few things I've noticed:

The usual Microsoft-provided way of logging into another computer interferes with GPU crunching at both ends of the connection. Look for ways provided by someone other than Microsoft.

For me, losing access between the BOINC client and the BOINC manager is easily handled by killing the boincmgr.exe process, then starting another copy, whenever I see the problem. I suspect that this problem is due to some interference from my antivirus program.

i run ClamWin Antivirus b/c its free and interferes less than most other antivirus software. as far as i know none of its settings should allow it to interfere w/ BOINC, but i'm not 100% sure of that. what kind of things should i be looking for that might indicate that ClamWin may be causing problems with BOINC? i suppose the easiest way is to just close ClamWin, let BOINC run for a while, and see if the client-manager connection problems persist.


The problems I've identified with Norton Internet Security:

1. Probable cause of the fairly frequent shutdowns I see of communications between boincmgr.exe and boinc.exe.

2. Considers it bad for any program to come close to using 100% of a CPU core for very long, such as most BOINC application programs. Treats this as a warning, though, and therefore allows you to tell it to stop such warnings FOR THAT VERSION of that application program.

3. Decides that some application programs from alpha test BOINC projects are behaving strangely enough, and the program is unfamiliar enough, that it should halt the program and delete its executable before even telling you about this.

4. A well-known cause of interference with Microsoft email/newsreader programs, such as Windows Mail and Windows Live Mail. Updates try to make it more compatible with the email sections, but not the other sections.

5. Makes it rather difficult to temporarily shut it down, for cases where some installation procedure requires this, or you want to test if it's the cause of a certain type of problem.

6. May have been the cause of the usual Windows Vista backups program working so poorly over a year ago that I had to give up backups for around a year, until I had a disk server with a different backups program installed.

7. Needs to have both the BOINC directory trees excluded from what it should scan whenever it is in use.

8. For a total number of files as large as I have, automatic full scans never finish before I return to the computer in the morning, and that automatic scan halts. The next automatic full scan will start at the same point as the previous one, instead of where the previous one halted.


I'm already looking for a replacement antivirus program, but want it to be one with a good reputation among BOINC users, experts at Windows Live Mail, and also experts at Windows Mail (with the last two including newsgroups use).


The McAfee and Trend antivirus programs are the two others listed as well-known causes of Microsoft email/newsreader problems.


If you're using a free antivirus program, you may want to check if it includes antispyware functions (looking for something that often acts like computer viruses, but doesn't quite meet their definition). Many of the older free antivirus programs don't, and therefore need to have a separate antispyware program installed.
ID: 49386 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sunny129
Avatar

Send message
Joined: 25 Jan 11
Posts: 271
Credit: 346,072,284
RAC: 0
Message 49387 - Posted: 17 Jun 2011, 0:22:33 UTC

few more observations i've made in the mean time - sometimes BOINC loses its connection with the client every 5 minutes or so, and yet other times it'll go hours before losing the connection. when i mentioned in my last post that i had already seen two periods of disconnection...well i saw 3 or 4 more after that, all within a few minutes of each other. as of now though, its been 35 minutes since the last "outage."
ID: 49387 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sunny129
Avatar

Send message
Joined: 25 Jan 11
Posts: 271
Credit: 346,072,284
RAC: 0
Message 49388 - Posted: 17 Jun 2011, 3:00:29 UTC

well BOINC managed to stay connected to the client for ~70 this time before losing its connection. on the upside, its going on 4 hours now, and i haven't seen the password error box pop up yet. typically, i leave for work around 8:00am and by noon my BOINC client connection has failed to the point where it won't reconnect...so perhaps this is a good sign in the sense that if i don't completely eliminate the failed connections by tomorrow morning, BOINC will be able to reconnect all day long without password problems. but i'll feel more comfortable if she makes it through the night without throwing the password error. the good thing is that, if the password error disappears, i'll no longer lose half a day's productivity (~6 hours during the work day for which i can't be home in order to reconnect to the client , and ~6 hours at night while i'm sleeping and don't realize that BOINC needs to be reconnected). the bad thing is that i'm still losing quite a bit of productivity to this "failed connection" issue. as i mentioned in a previous post, sometimes BOINC goes for hours before losing its connection, and other times it seems like such a failure occurs just about every few minutes. one thing i've observed (though i'm not sure if its coincidence or not) is that BOINC seems to stay connected longer while i'm away from the computer and not surfing the internet/checking email/etc. for calculation's sake though, let's assume the worst case scenario and suppose that BOINC loses its connection every 5 minutes or so for approx. 60 seconds at a time. that's 1/6th, or ~16.7%, of my productivity lost. granted, i'm probably not losing quite that much production, given BOINC sometimes goes hours without losing its connection. but its still a significant production loss when you have a powerful GPU at your disposal. so i really need to get this fixed so i'm not wasting GPU cycles and electricity.
ID: 49388 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 15 Jul 08
Posts: 383
Credit: 729,293,740
RAC: 0
Message 49392 - Posted: 17 Jun 2011, 12:49:16 UTC

Sunny, does boinc.exe disappear from task manager when the client apps stop? Also try 6.12.33 instead of 6.10.60. I've noticed that since 6.12.28 I'm getting far fewer client disconnection issues. We discussed this problem on the lists after 6.12.26...

Things to check: Also make sure to set Activity to Run Always. In preferences check "While computer is in use and "Use GPU while computer is in use". Set "While processor usage is less than" to 0.
ID: 49392 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sunny129
Avatar

Send message
Joined: 25 Jan 11
Posts: 271
Credit: 346,072,284
RAC: 0
Message 49393 - Posted: 17 Jun 2011, 14:25:14 UTC - in response to Message 49392.  

ok, so i woke up this morning around 7:30am local time to find BOINC still connected to the client. but i clicked on the messages tab and saw client startup dialogue at around 6:30am. so the bad news continues to be that BOINC is still losing its connection with the client regularly. the good news is that, since removing the client password, BOINC has yet to fail to reconnect to the client..which means that, aside from approx. 60 seconds of wasted CPU & GPU cycles for every lost connection, i'm no longer experiencing super-long outages caused by a password error.

on that note, there were several successfully completed MW@H tasks waiting to report at around 7:30am local time this morning. so i checked the server status page, and sure enough some servers are down. looking at the "tasks in progress" on my account page, i noticed that the most recent WU's were sent to me around 7:00am local time - so the servers had only been down for approx. 30 minutes at that point...which means that my host must have been crunching away all night (assuming the servers didn't go down during the night). the servers weren't back up by the time i left for work, so i switched back to crunching S@H AP tasks in the mean time. one thing i noticed is that the failed connections still occur, even though S@H is running now and MW@H isn't...so the problem isn't specifically related to MW@H. in fact i'm inclined to believe its got nothing to do with the particular projects/applications i'm running.

so i'm at a bit of a stand-still in that i cannot proceed to the next troubleshooting step (disable ClamWin Antivirus) until i get home. in the mean time though, it'll be further confirmation that removing the client password got rid of the password error if i get home and BOINC still reconnects without issue after a failed connection.


Sunny, does boinc.exe disappear from task manager when the client apps stop? Also try 6.12.33 instead of 6.10.60. I've noticed that since 6.12.28 I'm getting far fewer client disconnection issues. We discussed this problem on the lists after 6.12.26...

i don't know, as i haven't checked that yet...i don't know why i didn't think to check that earlier - probably b/c i'm always watching the CPU usage drop to zero on the performance tab, instead of looking at the processes tab. but i'll have a look at that when i get home this afternoon/evening.


Things to check: Also make sure to set Activity to Run Always. In preferences check "While computer is in use and "Use GPU while computer is in use". Set "While processor usage is less than" to 0.

well i'm running multiple projects/applications, so wouldn't i want BOINC to run based on my preferences, as opposed to always? or are you suggesting trying this to eliminate yet another variable (like i ended up doing with the password)? as for the other three things you mentioned, i've already got them set as you suggested.
ID: 49393 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 15 Jul 08
Posts: 383
Credit: 729,293,740
RAC: 0
Message 49397 - Posted: 17 Jun 2011, 17:17:39 UTC - in response to Message 49393.  

ok, so i woke up this morning around 7:30am local time to find BOINC still connected to the client. but i clicked on the messages tab and saw client startup dialogue at around 6:30am. so the bad news continues to be that BOINC is still losing its connection with the client regularly. the good news is that, since removing the client password, BOINC has yet to fail to reconnect to the client..which means that, aside from approx. 60 seconds of wasted CPU & GPU cycles for every lost connection, i'm no longer experiencing super-long outages caused by a password error.

on that note, there were several successfully completed MW@H tasks waiting to report at around 7:30am local time this morning. so i checked the server status page, and sure enough some servers are down. looking at the "tasks in progress" on my account page, i noticed that the most recent WU's were sent to me around 7:00am local time - so the servers had only been down for approx. 30 minutes at that point...which means that my host must have been crunching away all night (assuming the servers didn't go down during the night). the servers weren't back up by the time i left for work, so i switched back to crunching S@H AP tasks in the mean time. one thing i noticed is that the failed connections still occur, even though S@H is running now and MW@H isn't...so the problem isn't specifically related to MW@H. in fact i'm inclined to believe its got nothing to do with the particular projects/applications i'm running.

I guess I'd avoid apps that don't checkpoint for now. It's just possible that there's a particular project or other app that's killing BOINC so keep that in mind.

Things to check: Also make sure to set Activity to Run Always. In preferences check "While computer is in use and "Use GPU while computer is in use". Set "While processor usage is less than" to 0.

well i'm running multiple projects/applications, so wouldn't i want BOINC to run based on my preferences, as opposed to always? or are you suggesting trying this to eliminate yet another variable (like i ended up doing with the password)? as for the other three things you mentioned, i've already got them set as you suggested.

I set all 3 options under Activity to Always no matter what projects I'm running.
ID: 49397 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
robertmiles

Send message
Joined: 30 Sep 09
Posts: 211
Credit: 36,977,315
RAC: 0
Message 49409 - Posted: 18 Jun 2011, 5:00:19 UTC - in response to Message 49397.  

I set all 3 options under Activity to Always no matter what projects I'm running.


If you do that, make sure you do it only on computers that can stand BOINC using 100% of the time available on all the CPU cores you let it use - it disables any settings for telling it to use less. Rather important on laptops, but desktops are often able to stand it without overheating.
ID: 49409 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sunny129
Avatar

Send message
Joined: 25 Jan 11
Posts: 271
Credit: 346,072,284
RAC: 0
Message 49416 - Posted: 19 Jun 2011, 2:49:05 UTC
Last modified: 19 Jun 2011, 2:52:56 UTC

*UPDATE*

ok, so once i got step one out of the way (eliminating the password error), the next few steps did not take long at all to test...

2) close ClamWin Antivirus - this is the only malware/spyware/virus software that i let run in the background, and i have no active firewall (including the Windows firewall)

disabled this and still got lost connections regularly...

3) show active tasks only in the BOINC manager - if the above doesn't just "solve it," perhaps this will. if the manager is in fact too busy juggling tasks to authenticate a password every time it needs to communicate with the client, then maybe this is the key...though i don't expect it to do anything.

again, BOINC lost its connection with the client, even with only the active tasks showing...

4) kill boinctray.exe and disable it on startup

didn't do anything...

5) exit the BOINC manager and leave the apps running (to eliminiate the possibility that the manager itself is somehow causing the problem)

this actually seemed promising at first, except for the fact that the boinc.exe process and the individual task processes would terminate seemingly randomly on their own shortly after closing the BOINC manager. to me the outage seemed very similar to the lost connections i would get while still using the BOINC manager, only this time the tasks would fail to restart. this makes sense though if you think about it: first of all, with boincmgr.exe terminated [and disabled on startup], there's no longer a program running that'll re-execute boinc.exe if it closes due to a failed BOINC client connection [or after a system restart/reboot]. whereas, with the BOINC manager running, it'll always reconnect to the client if the connection fails (provided you have no client password errors, or no client password at all). so its no wonder that if the BOINC manager is not running, the client (boinc.exe) and all the active tasks will not restart in the event that a failed connection occurs.

knowing that the BOINC client (boinc.exe) would not start with Windows on a restart/reboot if the BOINC manager is not enabled on startup, i placed a short-cut to boinc.exe in the Windows startup folder (C:\Documents and Settings\All Users\Start Menu\Programs\Startup). after a restart, boinc.exe opened in what visually appeared to be a command prompt-style window simulating the messages tab of the BOINC manager that i had disabled. so i let it run for a while to see if the last time boinc.exe and all the active tasks terminated while the BOINC manager was disabled was a fluke...unfortunately, it wasn't. only this time, i got an error message in the message window (which i didn't get on the messages tab of the BOINC manager during outages b/c it always went blank). here's what it looked like:



though i'm not sure what it means, such an error message seems to suggest taking a whole new direction in tackling this problem. i get this feeling like maybe i should try a different version of BOINC. of course the reason i switched from v6.12.26 back to v6.10.60 was b/c of the excessive project back-off times i was getting while running SETI@Home - switching to v6.10.60 fixed all that. but that was the first time i had used v6.10.60, and so i hadn't yet used it with MW@H...maybe it just doesn't play well with my mix of hardware, projects, and applications. so i guess the next step is to see if any other versions of BOINC work better for me...

6) run always/use GPU always/ network always available

didn't do anything...

on that note, the newest observations are as follows:
1) CPU tasks seem to stop during all outages, whereas GPU tasks do not necessarily stop running during the outages. @ Beyond - boinc.exe does disappear from the processes list during the failed connections.
2) running only 1 MW@H task at a time seems to eliminate the problem, but i've only tested this for up to 30 minutes so far. i'd rather not think of running 2 MW@H tasks simultaneously as the problem. but if it comes down to it, and i just can't eliminate these constant and repetitive failed connections, i'll happily run only 1 MW@H task at a time. after all, the amount of production lost from running 1 MW@H task at a time vs 2 at a time pales in comparison to the amount of production lost to errors, invalids, and downtime due to my client connection problems.
ID: 49416 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sunny129
Avatar

Send message
Joined: 25 Jan 11
Posts: 271
Credit: 346,072,284
RAC: 0
Message 49417 - Posted: 19 Jun 2011, 4:00:31 UTC - in response to Message 49416.  
Last modified: 19 Jun 2011, 4:08:25 UTC

...running only 1 MW@H task at a time seems to eliminate the problem, but i've only tested this for up to 30 minutes so far...

i take it back - BOINC ran for approx 60 minutes in this configuration before the client connection failed...so i guess cutting back to only 1 GPU task at a time doesn't get rid of my client connection problem like i thought it might. though i must admit, BOINC ran alot longer before losing its connection this time than it typically did when 2 GPU tasks were running simultaneously...of course that could just be a coincidence - after all, this was the first failed connection i've witnessed since switching back to 1 GPU task at a time. i'm gonna let it run this way over night and see if i can't monitor how often the outages recur with this configuration. while i'm not exactly sure how that information is going to benefit me, its better to have it and not need it than to need it and not have it. after that, i'm gonna try some other versions of BOINC, as well as try to run multiple clients, and then i'll report back.
ID: 49417 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile dnolan
Avatar

Send message
Joined: 26 Oct 09
Posts: 55
Credit: 352,166,802
RAC: 0
Message 49418 - Posted: 19 Jun 2011, 4:46:17 UTC

Just making some guesses here, but...
When you switched versions of Boinc, is it possible one or more of the files from the other version remained in place? I would try un-installing Boinc from the control panel (stop Boinc first, you should not lose any work, but you can back up your data folder just in case and disable network communications if you want to be safe), then make sure the PROGRAM dir for Boinc is completely empty (not the DATA directory), then try installing a version of Boinc again. The error you have in the command prompt window has been attributed to tasks exceeding the allotted computation time, but I'm not thinking this is what's causing it on your system.
I do see this in one of your results right now:
Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4' 
Error reading astronomy parameters from file 'astronomy_parameters.txt'
  Trying old parameters file
Number of streams does not match
<search_application> milkywayathome_client separation 0.82 Windows x86 double CAL++ </search_application>
18:47:19 (1536): called boinc_finish

So it's possible this is what's causing boinc to exit, though I'm not sure.
One other thing you could try, if the above is the problem, is set your system to no new work, finish what you have right now, then detach from MW, exit Boinc, remove any files in the MW project folder, (save your app_info.xml file if you have one) then re-attach (and put back your app_info.xml file if needed, after the MW program and data have re-downloaded).
Those are all the ideas I have for the moment...

-Dave
ID: 49418 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sunny129
Avatar

Send message
Joined: 25 Jan 11
Posts: 271
Credit: 346,072,284
RAC: 0
Message 49426 - Posted: 19 Jun 2011, 14:44:34 UTC - in response to Message 49417.  

...so i guess cutting back to only 1 GPU task at a time doesn't get rid of my client connection problem like i thought it might. though i must admit, BOINC ran alot longer before losing its connection this time than it typically did when 2 GPU tasks were running simultaneously...of course that could just be a coincidence - after all, this was the first failed connection i've witnessed since switching back to 1 GPU task at a time. i'm gonna let it run this way over night and see if i can't monitor how often the outages recur with this configuration...

so after letting it run like this overnight, it still does seem that the outages occur less frequently while running 1 GPU task at a time than they do while running 2 at a time...however, there's alot i could have missed during the hours i was asleep. that being said, the one thing i am sure about is this: while i cannot say with much confidence whether the outages occur less frequently while only running 1 GPU task at a time, my number of "consecutive valid tasks" has increased drastically. that is to say, when i was running 2 GPU tasks simultaneously, i would get far more errors and invalids than i do while running only 1 GPU at a time. i don't think this phenomenon is directly related to the frequency of the outages, but i can't say for sure at this point.


Just making some guesses here, but...
When you switched versions of Boinc, is it possible one or more of the files from the other version remained in place? I would try un-installing Boinc from the control panel (stop Boinc first, you should not lose any work, but you can back up your data folder just in case and disable network communications if you want to be safe), then make sure the PROGRAM dir for Boinc is completely empty (not the DATA directory), then try installing a version of Boinc again.

i suppose this is quite possible. of the several times i've switched back and forth between BOINC versions in the past 6 months, not once did i uninstall before reinstalling. i was told that i could install right over the previous installation without any problem. but while this may be true in most instances, perhaps something didn't go quite right during one of my installations. so i went ahead and did what you suggested - i uninstalled BOINC and reinstalled with the same v6.10.60 just to see if anything would change...unfortunately, i was able to reproduce the failed connection. so i guess it isn't a bogus installation that's causing the problem...


The error you have in the command prompt window has been attributed to tasks exceeding the allotted computation time, but I'm not thinking this is what's causing it on your system.
I do see this in one of your results right now:
Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4' 
Error reading astronomy parameters from file 'astronomy_parameters.txt'
  Trying old parameters file
Number of streams does not match
<search_application> milkywayathome_client separation 0.82 Windows x86 double CAL++ </search_application>
18:47:19 (1536): called boinc_finish

So it's possible this is what's causing boinc to exit, though I'm not sure.
One other thing you could try, if the above is the problem, is set your system to no new work, finish what you have right now, then detach from MW, exit Boinc, remove any files in the MW project folder, (save your app_info.xml file if you have one) then re-attach (and put back your app_info.xml file if needed, after the MW program and data have re-downloaded).

i'm about to give this a try, but i'm not expecting anything good to come of it, simply b/c i was having the same problem with MW@H suspended and S@H running instead...i'll let you know what happens though.
ID: 49426 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sunny129
Avatar

Send message
Joined: 25 Jan 11
Posts: 271
Credit: 346,072,284
RAC: 0
Message 49429 - Posted: 19 Jun 2011, 15:58:27 UTC
Last modified: 19 Jun 2011, 16:18:36 UTC

ok, so i set MW@H to NNT, let the remaining tasks finish, made a copy of my app_info.xml, and detached from MW@H. i then reattached to MW@H, but files started downloading and tasks started running before i could insert my app_info.xml into the newly created MW@H project folder. so i figured i would let it run like this for a while, seeing as how the only reason i was using an app_info.xml before was to allow 2 MW@H tasks to run simultaneously. the good news is that MW@H seems to be crunching away with the stock app without a problem, and has been doing so for ~60 minutes now without a failed connection. i'll let it run for a while and see if i can't reproduce the failed connection under this configuration.

the odd thing is this: before i detached and reattached to the project, i would always have 12 tasks in the queue. after reattaching to the project, i got 12 new tasks. but it isn't maintaining a 12-task queue anymore - that is to say, it isn't asking for new work every time a completed task gets reported like it used to before i detached. rather, it crunched through the initial 12 tasks, and didn't ask for new work until the next to last task was getting reported. in other words, everything is crunching away fine on the stock app, but my queue dpeth is now only 1 task deep, compared to 12 tasks deep before i detached (when i was using an anonymous platform via the app_info.xml). i wonder if this has anything to do with the fact that i've never really used the stock app before - i've always used an anonymous platform. under "application details" on my account pages, it shows that i've completed 20,139 tasks under the "anonymous platform" app, but only 21 completed tasks under the "MilkyWay@Home 0.82 windows_intelx86 (ati14)" app. maybe this is one of those instances where i need a certain number of valid tasks under my belt before BOINC allows my queue depth to reach the max allowable # of tasks? *EDIT* - this must be the case, b/c i've completed 36 tasks using the stock app now, and my queue depth is back to 12 tasks deep.

at any rate, as i mentioned above, i have yet to see a failed connection since detaching and reattaching to MW@H. so i'm beginning to wonder if the presence of an app_info.xml file was causing the problem. recall that i was still getting failed connections even while running only 1 GPU task at a time. the only difference between then and now was the presence of the app_info.xml file. and perhaps that's why i was also seeing failed connections wit hMW@H suspended and S@H running instead - b/c i'm using an app_info.xml file with S@H too. the problem there is that i cannot test S@H without an app_info.xml file to see if the failed connection problems go away b/c the ATI GPU app is not a stock app (in other words the presence of an app_info.xml file is absolutely necessary in order for me to crunch S@H AP or MB tasks on my ATI GPU). *EDIT* - its not the presence (or absence) of an app_info.xml file that's responsible, b/c the manager just lost its connection to the client...

i'm eager to get this round of testing out of the way though. based on the sporadic termination of boinc.exe and all its associated active tasks while i was testing with the BOINC manager disabled leads me to believe that my "failed connection" problem has less to do with an actual connection to the client, and more to do with the client itself. if boinc.exe and all its active tasks weren't randomly terminating, then the BOINC manager would not lose its connection with the client in the first place. and i probably wouldn't have even seen this odd behavior if i hadn't tested boinc.exe without boincmgr.exe. so i probably shouldn't be asking "why is the manager failing to connect to the client?" b/c we already know why - b/c the client and all of its active tasks are randomly terminating for some reason. so then the question should really be "why do boinc.exe and all its active tasks terminate randomly?" if i could solve this problem, then i wouldn't ever have to worry about the manager losing its connection to the client. and i would therefore never have to worry about tasks terminating or pausing - i.e. productivity would be back to normal for me.
ID: 49429 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile kashi

Send message
Joined: 30 Dec 07
Posts: 311
Credit: 149,490,184
RAC: 0
Message 49435 - Posted: 19 Jun 2011, 23:20:10 UTC

Have you tried suspending your Einstein and any other CPU project tasks, then stopping and restarting BOINC and running MilkyWay ATI only? With 3GB of memory feeding 6 cores of CPU and GPU computation as well, you may be running short of memory to support GPU processing. Even when lots of free memory is shown in task manager it is possible to run short of sufficient system memory to support GPU processing. This has happened a number of times to those with a relatively small amount of memory running Collatz ATI. Running 2 tasks concurrently on the GPU would make this worse. This would usually just result in errored tasks but it may be possible that a shortage of resources is also causing problems with boinc.exe itself.

Another potential area that can cause problems is the GPU throttling down to a very low clockspeed when the monitor is turned off. It is a powersaving feature of the Catalyst driver. Although this is more common on multiple GPU configurations it is also possible with a single GPU. A way to test this is to use a blank screensaver and not turn off the monitor.

These are just suggestions, the main causes I have experienced of boinc.exe freezing and then terminating were caused by the GPU itself. It was either a corrupted Catalyst driver or an overheating/damaged GPU. A way to reduce heat and strain on GPUs when processing MilkyWay ATI tasks is to use a very low GPU memory speed. I use 500MHz, some use lower. If boinc.exe is failing to start or freezing and terminating even when processing CPU tasks only then a way to test if it is the GPU/driver responsible is to disable the GPU in BOINC using <no_gpus>1</no_gpus> option in a cc_config.xml file.

Something is causing boinc.exe to terminate, you need to determine if it related to BOINC CPU processing or GPU processing/recognition first, then go on from there to investigate whether it is an overheating/damaged GPU, insufficient/faulty/mismatched system memory, corrupt drivers or other software, a software quirk such a GPU or CPU powersaving feature or a virus scanner or other process that is causing trouble. Also sometimes people experience trouble with BOINC due to summer heat issues, so you should ensure the core and VRMs of your GPU are not overheating and similarly that the CPU itself is not becoming unstable due to overheating. For example, I sometimes need to reduce the CPU overclock on hotter summer days. Even system memory can overheat and cause random errors if load/speed is high and case ventilation is insufficient for the conditions. Running 6 cores of Einstein and GPU processing as well is very demanding on computer resources and requires good cooling too.

Just some possibilities for you to consider, none of this may apply to your configuration.
ID: 49435 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sunny129
Avatar

Send message
Joined: 25 Jan 11
Posts: 271
Credit: 346,072,284
RAC: 0
Message 49437 - Posted: 20 Jun 2011, 1:34:50 UTC - in response to Message 49435.  

Have you tried suspending your Einstein and any other CPU project tasks, then stopping and restarting BOINC and running MilkyWay ATI only? With 3GB of memory feeding 6 cores of CPU and GPU computation as well, you may be running short of memory to support GPU processing. Even when lots of free memory is shown in task manager it is possible to run short of sufficient system memory to support GPU processing. This has happened a number of times to those with a relatively small amount of memory running Collatz ATI. Running 2 tasks concurrently on the GPU would make this worse. This would usually just result in errored tasks but it may be possible that a shortage of resources is also causing problems with boinc.exe itself.

yes. i have no such problems when running CPU tasks only or GPU tasks only. its only when i run both CPU & GPU tasks concurrently that the problem occurs. i've been warned before about WinXP 32-bit only recognizing ~3.25GB of system memory even with 4GB present, and i've also heard about how the task manager can be deceiving about memory resources. but for the past few days i've been testing with only 1 GPU task and 5 CPU tasks concurrently, and BOINC is still having these problems. in fact, these "failed client connection" issues first cropped up for me about 6 months ago and lasted for a few weeks, but just went away on its own. but during that time, i went from running CPU tasks on all 6 cores to running only 5 tasks in order to leave a free core for GPU tasks. that didn't specifically fix my problem - as i i said, it just seemed to vanish on its own. but ever since then, i've only been running 5 concurrent tasks on the CPU. for about 3 months, march through may, i was running 5-6 CPU tasks and 2-4 GPU tasks concurrently without so much as a hiccup from BOINC. and now all of the sudden the problem is back...so its hard to imagine that i'm having a resource issue considering i'm running less concurrent CPU & GPU tasks now than i was for the last 3 problem-free months.

if it is in fact a resource issue, do you think WinXP 64-bit would be a sufficient upgrade? then i could add more system memory, since my motherboard and 64-bit OS's can handle 16GB of memory (or something like that). i would really prefer to stay away from Windows 7 or Vista, since they both have 10 times as much resource-consuming bullshit running in the background than WinxP ever did...

Another potential area that can cause problems is the GPU throttling down to a very low clockspeed when the monitor is turned off. It is a powersaving feature of the Catalyst driver. Although this is more common on multiple GPU configurations it is also possible with a single GPU. A way to test this is to use a blank screensaver and not turn off the monitor.

i'm pretty sure the Catalyst drivers aren't throttling down my GPU, even though i power my monitor down all the time when i'm not in front of the computer. the reason i'm sure about this is b/c MSI Afterburner shows a ~5 minute history of GPU stats, including the core and memory clocks, which never appear to "throttle down" for any longer than the ~12 second gap between MW@H tasks or the ~60 second disconnections i'm experiencing.

These are just suggestions, the main causes I have experienced of boinc.exe freezing and then terminating were caused by the GPU itself. It was either a corrupted Catalyst driver or an overheating/damaged GPU. A way to reduce heat and strain on GPUs when processing MilkyWay ATI tasks is to use a very low GPU memory speed. I use 500MHz, some use lower. If boinc.exe is failing to start or freezing and terminating even when processing CPU tasks only then a way to test if it is the GPU/driver responsible is to disable the GPU in BOINC using <no_gpus>1</no_gpus> option in a cc_config.xml file.

i've been using that trick too for a while now - my HD 5870 runs at the stock core clock @ 850mhz, but underclocked memory @ 600MHz (down from the stock memory clock of 1200mhz). as i said above, i cannot reproduce the client connection problem when running either CPU tasks only or GPU tasks only, so i'm pretty sure it isn't video driver conflicts.

Something is causing boinc.exe to terminate, you need to determine if it related to BOINC CPU processing or GPU processing/recognition first, then go on from there to investigate whether it is an overheating/damaged GPU, insufficient/faulty/mismatched system memory, corrupt drivers or other software, a software quirk such a GPU or CPU powersaving feature or a virus scanner or other process that is causing trouble. Also sometimes people experience trouble with BOINC due to summer heat issues, so you should ensure the core and VRMs of your GPU are not overheating and similarly that the CPU itself is not becoming unstable due to overheating. For example, I sometimes need to reduce the CPU overclock on hotter summer days. Even system memory can overheat and cause random errors if load/speed is high and case ventilation is insufficient for the conditions. Running 6 cores of Einstein and GPU processing as well is very demanding on computer resources and requires good cooling too.

i think my cooling is sufficient - got a nice mesh case with plenty of fans, good flow, none of the many GPU temp sensors in GPU-Z ever reads above 70°C, and the CPU never runs hotter than 43°C, even though its OCed to 3.7GHz and under 100% load 24/7.

thanks again for the tips...the more possibilities i can eliminate, the better off i am.
ID: 49437 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile kashi

Send message
Joined: 30 Dec 07
Posts: 311
Credit: 149,490,184
RAC: 0
Message 49439 - Posted: 20 Jun 2011, 3:48:26 UTC

I didn't notice that you were running XP 32-bit, that explains the 3GB of memory. It seems that you have already covered a lot of potential causes. Before you consider changing operating systems have you tried with CPU and memory at default speed? Or with CPU projects other than Einstein running?

Sometimes overclocks can be stable for some time and then become unstable and need the CPU voltage increased slightly. Overclocked computers can be stable on some projects yet crash on others even if CPU stress testing shows no problems. Changes in the same project's applications or the length and complexity of tasks can also affect stability. Memory testing applications can show no errors in memory and yet certain applications can cause errors and may require changes to the memory speed and timings or the voltages.

It could still be related to network connectivity. I have a bad line which causes ADSL disconnections after rain. It also drops out sometimes if the phone rings and also the wireless part always dies if someone turns on the microwave oven. I have noticed that sometimes when the connection drops during BOINC network activity that BOINC Manager freezes until a connection request times out. I didn't notice that it caused boinc.exe to terminate but a few times it caused yoyo tasks to give "exited with zero status but no 'finished' file. If this happens repeatedly you may need to reset the project." Most inconvenient with Evolution@home tasks that have no checkpointing. :)

I just looked up that error message and there are a number of things mentioned as well as bad internet connection that can cause BOINC to lockup. They include the BOINC Daemon not being able to write the Client State File and file system activity causing too long a queue, particularly anything scanning all the files.

As suggested in that wiki entry have you had a look through the message logs for when it happens for any relevant clues? Naturally you lose the message log in BOINC Manager when BOINC Manager is restarted but in Win XP it keeps a copy in C:\Documents and Settings\All Users\Application Data\BOINC folder, called stdout.txt or stdoutdae.old.
ID: 49439 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
robertmiles

Send message
Joined: 30 Sep 09
Posts: 211
Credit: 36,977,315
RAC: 0
Message 49440 - Posted: 20 Jun 2011, 4:04:33 UTC
Last modified: 20 Jun 2011, 4:13:57 UTC

An idea to consider: See what happens if you download enough workunits to run for a few hours, then tell BOINC Manager to use a Network activity suspended setting for those few hours. That should eliminate some possible effects from internet connections. Likely to interfere with any other internet connection use during those few hours, though.
ID: 49440 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile kashi

Send message
Joined: 30 Dec 07
Posts: 311
Credit: 149,490,184
RAC: 0
Message 49441 - Posted: 20 Jun 2011, 5:19:20 UTC

Good point, I often suspend BOINC network activity when my ADSL connection is mucking up. Not possible with MilkyWay though. Cache of 12 tasks only lasts about 20 minutes on a HD 5870. Sometimes I switch to another ATI project, other times I just leave the GPU idle and save some power.

However suspending BOINC network activity while processing another ATI project that allows a reasonable cache size such as Collatz, DNETC, Moo! Wrapper or PrimeGrid would certainly be a good test of whether the internet connection may be the problem.
ID: 49441 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sunny129
Avatar

Send message
Joined: 25 Jan 11
Posts: 271
Credit: 346,072,284
RAC: 0
Message 49448 - Posted: 20 Jun 2011, 11:47:17 UTC

good news guys...

after testing over a dozen possibilities over the last few days, i decided to resort to using a different version of BOINC. so i installed the latest developer version, v6.12.33, and haven't had a single failed connection all night long! i can confirm it b/c when i scrolled back to the beginning of the BOINC message log this morning, it showed 10:26pm, which is exactly when i restarted BOINC after installing the new version. it ran flawlessly like this all night while running 5 CPU tasks and 1 GPU task. i'm going to leave CPU tasks at 5 and increase GPU task to 2 at a time before i leave for work this morning, and see if BOPINC v6.12.33 remains stable and connected to the client all day until i get home.

so i have to assume that when i originally encountered this "failed client connection" problem 6 months ago, switching from BOINC release v6.10.58 to developer v6.12.18 fixed my problem even though i didn't realize it at the time. i also have to assume that the problem came back when i went from BOINC developer v6.12.18 to release v6.10.60, and again didn't realize it at the time. of course the reason i made that particular switch as i mentioned previously was b/c i was starting to get extremely large project back-off times w/ S@H when my client failed to make contact with the project servers, and switching to an older BOINC version fixed that problem. clearly it created new problems for me w/ MW@H though. v6.12.33 appears to be working quite well with MW@H now, so here's to hoping that it works well with S@H...i won't know that for a few days when i jump back to S@H for a short while...

thanks again for everyone's help and suggestions. i'll report back again and confirm that running 2 simultaneous GPU tasks isn't still causing the problem i've been having...
ID: 49448 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : How can I control the BOINC client from a remote computer?

©2024 Astroinformatics Group