Uh Oh - Ran Out Of Work Again !

Author	Message
Jon Boy UK - Wales Send message Joined: 17 Nov 07 Posts: 17 Credit: 663,827 RAC: 0	Message 1697 - Posted: 13 Feb 2008, 19:23:51 UTC Help... I'm a lonely xeon cpu that badly need's some more workunit's to crunch ! Can you help me ! Kind Regards, Happy Crunchin John :0) ID: 1697 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 1698 - Posted: 13 Feb 2008, 19:42:49 UTC - in response to Message 1697. Help... I'm a lonely xeon cpu that badly need's some more workunit's to crunch ! Can you help me ! Kind Regards, Happy Crunchin John :0) i just noticed :)started a new search - i'll also be starting a few more when nate sends me some new data to crunch on. a little more on that later :D ID: 1698 · Rating: 0 · rate: / Reply Quote

Jon Boy UK - Wales Send message Joined: 17 Nov 07 Posts: 17 Credit: 663,827 RAC: 0	Message 1700 - Posted: 13 Feb 2008, 19:46:21 UTC Hello Travis, You were on the ball with the new batch kind sir ! No sooner had i mentioned the server had ran out of WU's to forward to my rather hungry machines they began coughing and spluttering back to life again ! It's raining Milky Way WU's - I love it ! Happy Crunchi'n John :0) ID: 1700 · Rating: 0 · rate: / Reply Quote

Jayargh Send message Joined: 8 Oct 07 Posts: 289 Credit: 3,690,838 RAC: 0	Message 1704 - Posted: 13 Feb 2008, 23:42:48 UTC Travis-I have a couple hosts running Milkyway solo ...only project running. Is this wise? Do you forsee running out of work at any given point that would have my machines empty of work? The server has not gone down or ran out of work for a couple of weeks now :)Thanks-Jeff ID: 1704 · Rating: 0 · rate: / Reply Quote

JLDun Send message Joined: 17 Nov 07 Posts: 77 Credit: 202,706 RAC: 0	Message 1709 - Posted: 14 Feb 2008, 5:08:04 UTC - in response to Message 1704. I know [i]I'm[/u] not Travis, but I'll try. o-o Do you forsee running out of work at any given point that would have my machines empty of work? The 'problem' is foreseeing running out of work. This- or any project- can run out unexpectedly due to power outage, hardware (server, UPS, RAM) failure, building fire, etc... I personally would recommend having two or three projects ready. [If the odds of one project at a particular moment is .5 (50%), then two projects at the same moment is .5.5=.25, and three at the same time is .5.5*.5=.125=12.5%=1 in 8.) ID: 1709 · Rating: 0 · rate: / Reply Quote

Jayargh Send message Joined: 8 Oct 07 Posts: 289 Credit: 3,690,838 RAC: 0	Message 1712 - Posted: 14 Feb 2008, 12:27:01 UTC - in response to Message 1709. Last modified: 14 Feb 2008, 12:29:50 UTC I know [i]I'm[/u] not Travis, but I'll try. o-o Do you forsee running out of work at any given point that would have my machines empty of work? The 'problem' is foreseeing running out of work. This- or any project- can run out unexpectedly due to power outage, hardware (server, UPS, RAM) failure, building fire, etc... I personally would recommend having two or three projects ready. [If the odds of one project at a particular moment is .5 (50%), then two projects at the same moment is .5.5=.25, and three at the same time is .5.5*.5=.125=12.5%=1 in 8.) Thanks but was really asking about forseen...know about the unforseen....but check the machines alot. Two weeks ago when the server kept crashing it wasn't even a consideration. ID: 1712 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 1722 - Posted: 15 Feb 2008, 15:43:45 UTC - in response to Message 1712. I know [i]I'm[/u] not Travis, but I'll try. o-o Do you forsee running out of work at any given point that would have my machines empty of work? The 'problem' is foreseeing running out of work. This- or any project- can run out unexpectedly due to power outage, hardware (server, UPS, RAM) failure, building fire, etc... I personally would recommend having two or three projects ready. [If the odds of one project at a particular moment is .5 (50%), then two projects at the same moment is .5.5=.25, and three at the same time is .5.5*.5=.125=12.5%=1 in 8.) Thanks but was really asking about forseen...know about the unforseen....but check the machines alot. Two weeks ago when the server kept crashing it wasn't even a consideration. well normally it's not an issue, i usually start up new searches when the old ones are close to being finished. right now i'm trying to see what effect the number of searches being run concurrently has on the convergence rates of our searches - so i have to wait for the previous batch of searches to stop before starting up a new batch. once these sets of results are finished things should go back to constantly available work. i've been trying to catch when the searches finish as fast as possible so hopefully there wont be much downtime. ID: 1722 · Rating: 0 · rate: / Reply Quote

Swordfish Send message Joined: 19 Nov 07 Posts: 4 Credit: 82,330,797 RAC: 0	Message 1725 - Posted: 18 Feb 2008, 0:14:37 UTC I'm hoping that there is plenty of work....5 dual quads just put on the project. ID: 1725 · Rating: 0 · rate: / Reply Quote

Jayargh Send message Joined: 8 Oct 07 Posts: 289 Credit: 3,690,838 RAC: 0	Message 1726 - Posted: 18 Feb 2008, 15:17:06 UTC - in response to Message 1725. Last modified: 18 Feb 2008, 15:56:47 UTC I'm hoping that there is plenty of work....5 dual quads just put on the project. Because of the 20 limit at a time here and 20 min rpc calls,I hope you have a backup project running because some of those cores might be idle at times if you don't. ID: 1726 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 1729 - Posted: 18 Feb 2008, 18:01:54 UTC - in response to Message 1726. I'm hoping that there is plenty of work....5 dual quads just put on the project. Because of the 20 limit at a time here and 20 min rpc calls,I hope you have a backup project running because some of those cores might be idle at times if you don't. all the work units now should be convolution, so i'm hoping the 20 limit should be able to cover it. quad core quad processors might be an issue though :P ID: 1729 · Rating: 0 · rate: / Reply Quote

Swordfish Send message Joined: 19 Nov 07 Posts: 4 Credit: 82,330,797 RAC: 0	Message 1730 - Posted: 19 Feb 2008, 2:25:01 UTC - in response to Message 1729. I'm hoping that there is plenty of work....5 dual quads just put on the project. Because of the 20 limit at a time here and 20 min rpc calls,I hope you have a backup project running because some of those cores might be idle at times if you don't. all the work units now should be convolution, so i'm hoping the 20 limit should be able to cover it. quad core quad processors might be an issue though :P Dang did I break it,lol ID: 1730 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 1733 - Posted: 19 Feb 2008, 15:56:12 UTC - in response to Message 1730. I'm hoping that there is plenty of work....5 dual quads just put on the project. Because of the 20 limit at a time here and 20 min rpc calls,I hope you have a backup project running because some of those cores might be idle at times if you don't. all the work units now should be convolution, so i'm hoping the 20 limit should be able to cover it. quad core quad processors might be an issue though :P Dang did I break it,lol and actually, i've been working on changing the server code to allow us to determine what work units get set to what computers -- i might be able to set it up so that theres a limit of work units per search, which could probably let us bump the WU limit to 30 or 40... i'll have to take a look into it. ID: 1733 · Rating: 0 · rate: / Reply Quote

Jayargh Send message Joined: 8 Oct 07 Posts: 289 Credit: 3,690,838 RAC: 0	Message 1734 - Posted: 19 Feb 2008, 17:12:16 UTC - in response to Message 1733. I'm hoping that there is plenty of work....5 dual quads just put on the project. Because of the 20 limit at a time here and 20 min rpc calls,I hope you have a backup project running because some of those cores might be idle at times if you don't. all the work units now should be convolution, so i'm hoping the 20 limit should be able to cover it. quad core quad processors might be an issue though :P Dang did I break it,lol and actually, i've been working on changing the server code to allow us to determine what work units get set to what computers -- i might be able to set it up so that theres a limit of work units per search, which could probably let us bump the WU limit to 30 or 40... i'll have to take a look into it. Travis-Another way to accomplish this is to change the rpc calls from 20min down to 10 or 15 min.....but I don't know if you want to increase the server load by 50-100% this way. ID: 1734 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 1758 - Posted: 25 Feb 2008, 19:49:59 UTC - in response to Message 1734. I'm hoping that there is plenty of work....5 dual quads just put on the project. Because of the 20 limit at a time here and 20 min rpc calls,I hope you have a backup project running because some of those cores might be idle at times if you don't. all the work units now should be convolution, so i'm hoping the 20 limit should be able to cover it. quad core quad processors might be an issue though :P Dang did I break it,lol and actually, i've been working on changing the server code to allow us to determine what work units get set to what computers -- i might be able to set it up so that theres a limit of work units per search, which could probably let us bump the WU limit to 30 or 40... i'll have to take a look into it. Travis-Another way to accomplish this is to change the rpc calls from 20min down to 10 or 15 min.....but I don't know if you want to increase the server load by 50-100% this way. actually i'll do that -- the server seems to be fine with the current load and could handle a bit more. ID: 1758 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 1772 - Posted: 27 Feb 2008, 22:42:24 UTC - in response to Message 1734. I'm hoping that there is plenty of work....5 dual quads just put on the project. Because of the 20 limit at a time here and 20 min rpc calls,I hope you have a backup project running because some of those cores might be idle at times if you don't. all the work units now should be convolution, so i'm hoping the 20 limit should be able to cover it. quad core quad processors might be an issue though :P Dang did I break it,lol and actually, i've been working on changing the server code to allow us to determine what work units get set to what computers -- i might be able to set it up so that theres a limit of work units per search, which could probably let us bump the WU limit to 30 or 40... i'll have to take a look into it. Travis-Another way to accomplish this is to change the rpc calls from 20min down to 10 or 15 min.....but I don't know if you want to increase the server load by 50-100% this way. I actually emailed the BOINC projects list about this -- Dave Anderson said that the client should automatically request new work when the # of workunits is low... there shouldn't be a 20 min RPC call. He said that if you guys had any transcripts of this to give them to him. So if you see this problem let me know and post a transcript and i'll forward it on. ID: 1772 · Rating: 0 · rate: / Reply Quote

Jayargh Send message Joined: 8 Oct 07 Posts: 289 Credit: 3,690,838 RAC: 0	Message 1773 - Posted: 27 Feb 2008, 23:47:51 UTC - in response to Message 1772. Last modified: 28 Feb 2008, 0:08:19 UTC I actually emailed the BOINC projects list about this -- Dave Anderson said that the client should automatically request new work when the # of workunits is low... there shouldn't be a 20 min RPC call. He said that if you guys had any transcripts of this to give them to him. So if you see this problem let me know and post a transcript and i'll forward it on. 2/24/2008 4:46:25 PM\|Milkyway@home\|Starting gs_260_1203802157_143213_0 2/24/2008 4:46:25 PM\|Milkyway@home\|Starting task gs_260_1203802157_143213_0 using astronomy version 113 2/24/2008 4:52:12 PM\|Milkyway@home\|Sending scheduler request: To fetch work 2/24/2008 4:52:12 PM\|Milkyway@home\|Requesting 54653 seconds of new work 2/24/2008 4:52:22 PM\|Milkyway@home\|Scheduler RPC succeeded [server version 511] 2/24/2008 4:52:22 PM\|Milkyway@home\|Message from server: No work sent 2/24/2008 4:52:22 PM\|Milkyway@home\|Message from server: (reached per-host limit of 20 tasks) 2/24/2008 4:52:22 PM\|Milkyway@home\|Deferring communication for 20 min 0 sec 2/24/2008 4:52:22 PM\|Milkyway@home\|Reason: requested by project This is the message we are receiving....... now So if you run out of work in less than 20 minutes tough bananas. Its the deferring communications for 20min that needs to change. Once you reach max work there has to be a pause for the next rpc call otherwise we would be contacting the sever every 7 seconds!Hence the delay of time. What does Dr.Anderson not understand about this? ID: 1773 · Rating: 0 · rate: / Reply Quote

Jon Boy UK - Wales Send message Joined: 17 Nov 07 Posts: 17 Credit: 663,827 RAC: 0	Message 1796 - Posted: 1 Mar 2008, 19:00:19 UTC Last modified: 1 Mar 2008, 19:01:27 UTC Lol..... You Gotta laugh ! - Haven't You :) Any one know when new WU's will be availiable ? ID: 1796 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 1801 - Posted: 2 Mar 2008, 3:16:02 UTC - in response to Message 1796. i forwarded the message on to dr. anderson so hopefully he'll have a response for me soon :) ID: 1801 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 1808 - Posted: 3 Mar 2008, 1:34:55 UTC - in response to Message 1801. i forwarded the message on to dr. anderson so hopefully he'll have a response for me soon :) FYI, heres the response from Dr. Anderson: This is a long-standing design flaw: - A host is sent max_wus_in_progress jobs - It finishes one of them and starts uploading it - It decides it needs more work and contacts the scheduler (before the upload has finished) - The scheduler sees that it already has max_wus_in_progress jobs, refuses to give it more, and tells it to back off for 20 min - a few seconds later the upload finishes The right thing is to not count the finished/uploading jobs (or to not count a limited number of them). I'll look at this. If anyone has other ideas let me know. i'll try and take a look into the scheduler code to see if i can figure out a fix for this. ID: 1808 · Rating: 0 · rate: / Reply Quote

Webmaster Yoda Send message Joined: 21 Dec 07 Posts: 69 Credit: 7,048,412 RAC: 0	Message 1809 - Posted: 3 Mar 2008, 6:53:07 UTC - in response to Message 1808. Last modified: 3 Mar 2008, 6:54:44 UTC I'm no expert on the server-side options of BOINC, but a search of the BOINC site shows the following seemingly relevant config options (at http://boinc.berkeley.edu/trac/wiki/ProjectOptions): <max_wus_in_progress> N </max_wus_in_progress> <min_sendwork_interval> N </min_sendwork_interval> Here's what it says about that last option: min_sendwork_interval Minimum number of seconds to wait after sending results to a given host, before new results are sent to the same host. Helps prevent hosts with download or application problems from trashing lots of results by returning lots of error results. But don't set it to be so long that a host goes idle after completing its work, before getting new work. What we are seeing on fast hosts, particularly with short (2 credit) work units is exactly as described - they run out of work before they are allowed to connect again... This may become even more prevalent if the new applications are faster. Perhaps if that were set to 5 or 10 minutes, things would run more smoothly Join the #1 Aussie Alliance on MilkyWay! ID: 1809 · Rating: 0 · rate: / Reply Quote