new workunit limit

Author	Message
Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 6951 - Posted: 29 Nov 2008, 21:06:40 UTC It looks like the transitioner really can't keep up with what's going on with milkyway right now, so in order to speed things up i would like to reduce the workunit limit (5 would be ideal, 10 passable), to reduce the size of the database, which would speed things up. Now that the server is assigning WUs at a per-core rate as opposed to a per-computer rate, i think this is should work out fine; it will also give us better results for the searches we're running. I'm going to lower the WU limit to 5 and if this is really unworkable i'll raise it. Hopefully this should speed up the transitioner and make more work available. ID: 6951 · Rating: 0 · rate: / Reply Quote

banditwolf Send message Joined: 12 Nov 07 Posts: 2425 Credit: 524,164 RAC: 0	Message 6953 - Posted: 29 Nov 2008, 21:26:46 UTC 5 doesn't do it unless they are made longer, as in an hour not 10 min. Doesn't expecting the unexpected make the unexpected the expected? If it makes sense, DON'T do it. ID: 6953 · Rating: 0 · rate: / Reply Quote

John Clark Send message Joined: 4 Oct 08 Posts: 1734 Credit: 64,228,409 RAC: 0	Message 6958 - Posted: 29 Nov 2008, 21:37:15 UTC Last modified: 29 Nov 2008, 21:50:19 UTC Travis I think 5 per core is too low for the new Penryn Quads. These crunch WUs in about 3 - 5 minutes. If the server contact script was modified to allow a maximum server recontact time of, say 10 minutes then this may work. But, I would also recommend a slightly higher WU-per-core number (say 10 max). ATM the server recontact script boots through a few minutes then quickly escalates to 45 minutes, then 55 minutes then to 1hour, 2 hours and 3 hours. My server recontact time is the same as JAMC's 'communication deferred' time. ID: 6958 · Rating: 0 · rate: / Reply Quote

JAMC Send message Joined: 9 Sep 08 Posts: 96 Credit: 336,443,946 RAC: 0	Message 6959 - Posted: 29 Nov 2008, 21:38:14 UTC Last modified: 29 Nov 2008, 21:40:57 UTC The killer for my quads is the 'communication deferred' time in the BOINC projects tab- is that set by the project or BOINC? Even with network activity set to always on, when the communication deferred time goes to 2 or 3+ hours I lose all hope of keeping WU's cached and always run out and that's with 20 WU/core limit- we have to get longer WU's to make the change to 5WU's/core work and I guess that means we all have to run test apps as well... ID: 6959 · Rating: 0 · rate: / Reply Quote

Thierry Godefroy Send message Joined: 29 Jul 08 Posts: 9 Credit: 2,200,784 RAC: 0	Message 6972 - Posted: 29 Nov 2008, 22:44:13 UTC This is pure non-sense... Even with 20 WUs per core the queue was stalling unless I manually requested more work. The problem being that when BOINC gets replied "Reached CPU limit" several times in a raw, it starts delaying the work request, and in the end it gets delayed by over 3 hours... And as 20 WUs are crunched in under 110 minutes, you get a queue stall (not to mention it's a Hell of a nightmare to get just a few more WUs at the next request). The solution is simple: make it so that the optimized apps will need 60 minutes or so to crunch each WU (multiply the work to do per WU by 12). As it is, I will rather crunch for another project than let the queue stall and the computer staying powered on for nothing at all... ID: 6972 · Rating: 0 · rate: / Reply Quote

Misfit Send message Joined: 27 Aug 07 Posts: 915 Credit: 1,503,319 RAC: 0	Message 6974 - Posted: 29 Nov 2008, 22:46:04 UTC - in response to Message 6951. Last modified: 29 Nov 2008, 22:47:13 UTC 11/29/2008 2:44:13 PM\|Milkyway@home\|Sending scheduler request: Requested by user. Requesting 3317 seconds of work, reporting 3 completed tasks 11/29/2008 2:44:28 PM\|Milkyway@home\|Scheduler request succeeded: got 0 new tasks 11/29/2008 2:44:28 PM\|Milkyway@home\|Message from server: No work sent 11/29/2008 2:44:28 PM\|Milkyway@home\|Message from server: (reached per-CPU limit of 5 tasks) Well that brought me to the message board. :/ I'm going to lower the WU limit to 5 and if this is really unworkable i'll raise it. It's really unworkable. You should raise it. me@rescam.org ID: 6974 · Rating: 0 · rate: / Reply Quote

Bigred Send message Joined: 23 Nov 07 Posts: 33 Credit: 300,042,542 RAC: 0	Message 6975 - Posted: 29 Nov 2008, 22:50:24 UTC This stragety seems to be working for me. My Quads are staying at 20 tasks. As soon as any are done they are reported and replaced. ID: 6975 · Rating: 0 · rate: / Reply Quote

caspr Send message Joined: 22 Mar 08 Posts: 90 Credit: 501,728 RAC: 0	Message 6977 - Posted: 29 Nov 2008, 23:00:51 UTC - in response to Message 6975. This stragety seems to be working for me. My Quads are staying at 20 tasks. As soon as any are done they are reported and replaced. Same here but also my pendings are still growing. A clear conscience is usually the sign of a bad memory ID: 6977 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 6979 - Posted: 29 Nov 2008, 23:05:32 UTC - in response to Message 6977. This stragety seems to be working for me. My Quads are staying at 20 tasks. As soon as any are done they are reported and replaced. Same here but also my pendings are still growing. I'm hoping with the lower limit the transitioner will be able to keep up with the work requests. I'll bump things up to 8 and see how that works out -- I don't want people getting communication deferreds if they're crunching too fast. The assimilator/validator for the new app are a lot faster than the old one, so when we make the switch to running only the new app, this should help a bit as well. ID: 6979 · Rating: 0 · rate: / Reply Quote

BarryAZ Send message Joined: 1 Sep 08 Posts: 520 Credit: 302,538,504 RAC: 0	Message 6980 - Posted: 29 Nov 2008, 23:10:02 UTC - in response to Message 6951. No problem, you did also say that the work units would be 12 to 20 times longer, right? If you REALLY want to lower the stress on the transitioner, then you must increase the length of the work unit. A 25 minute per core cache is only going to increase the stress on the transitioner as it will compel everyone running MW to be consinuously hitting the server. Seriously, the problem isn't a 5, 10 or 20 WU cache limit, it is the 5 minute WU, fix that and things would be fine, keep it the way it is, and you end up wasting your time chasing server problems along with users wasting their time chasing WU's. It looks like the transitioner really can't keep up with what's going on with milkyway right now, so in order to speed things up i would like to reduce the workunit limit (5 would be ideal, 10 passable), to reduce the size of the database, which would speed things up. Now that the server is assigning WUs at a per-core rate as opposed to a per-computer rate, i think this is should work out fine; it will also give us better results for the searches we're running. I'm going to lower the WU limit to 5 and if this is really unworkable i'll raise it. Hopefully this should speed up the transitioner and make more work available. ID: 6980 · Rating: 0 · rate: / Reply Quote

BarryAZ Send message Joined: 1 Sep 08 Posts: 520 Credit: 302,538,504 RAC: 0	Message 6982 - Posted: 29 Nov 2008, 23:11:50 UTC I see you just bumped up the cache to 8 WU's from 5. But until you have reasonably timed WU's, keeping things running smoothly is just a dream. ID: 6982 · Rating: 0 · rate: / Reply Quote

banditwolf Send message Joined: 12 Nov 07 Posts: 2425 Credit: 524,164 RAC: 0	Message 6984 - Posted: 29 Nov 2008, 23:15:44 UTC - in response to Message 6980. Seriously, the problem isn't a 5, 10 or 20 WU cache limit, it is the 5 minute WU, How many times has this been said? I know I have. It was why the old,old,old wu's were made into hours in the first place, until the optimised apps came along. Doesn't expecting the unexpected make the unexpected the expected? If it makes sense, DON'T do it. ID: 6984 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 6988 - Posted: 29 Nov 2008, 23:32:42 UTC - in response to Message 6984. Seriously, the problem isn't a 5, 10 or 20 WU cache limit, it is the 5 minute WU, How many times has this been said? I know I have. It was why the old,old,old wu's were made into hours in the first place, until the optimised apps came along. I know we need longer WUs. In fact, I think this next meeting will be all about how we can get them much longer :P ID: 6988 · Rating: 0 · rate: / Reply Quote

John Clark Send message Joined: 4 Oct 08 Posts: 1734 Credit: 64,228,409 RAC: 0	Message 6991 - Posted: 29 Nov 2008, 23:43:05 UTC - in response to Message 6988. Last modified: 29 Nov 2008, 23:44:13 UTC Seriously, the problem isn't a 5, 10 or 20 WU cache limit, it is the 5 minute WU, How many times has this been said? I know I have. It was why the old,old,old wu's were made into hours in the first place, until the optimised apps came along. I know we need longer WUs. In fact, I think this next meeting will be all about how we can get them much longer :P I presume you can now think about more science to make the WUs longer, more useful to you science. On an aside - So far my PCs are being kept fed, and the work ready to send, on the servers, has been high (compared the past) which means that the work is there to satisfy demand. The only problem I see, when the computers here are unsupervised, is the build up of the time due to deferring communications for xxxx As long as this does not rise above, say, 20 minutes I think the current WUs-per-core limit might work OK. ID: 6991 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 6994 - Posted: 29 Nov 2008, 23:46:36 UTC - in response to Message 6991. Seriously, the problem isn't a 5, 10 or 20 WU cache limit, it is the 5 minute WU, How many times has this been said? I know I have. It was why the old,old,old wu's were made into hours in the first place, until the optimised apps came along. I know we need longer WUs. In fact, I think this next meeting will be all about how we can get them much longer :P I presume you can now think about more science to make the WUs longer, more useful to you science. Yes, but unfortunately that takes time :( I'm going to see what we can do more short-term, until we can take the analysis up to the next level (and hopefully make the WUs really long). ID: 6994 · Rating: 0 · rate: / Reply Quote

banditwolf Send message Joined: 12 Nov 07 Posts: 2425 Credit: 524,164 RAC: 0	Message 6995 - Posted: 29 Nov 2008, 23:47:11 UTC - in response to Message 6991. Not quite the same but when I get 'No work' it will deferr: 1 min, 1 min, 1 min, 3 hours (or some varation). Doesn't expecting the unexpected make the unexpected the expected? If it makes sense, DON'T do it. ID: 6995 · Rating: 0 · rate: / Reply Quote

BarryAZ Send message Joined: 1 Sep 08 Posts: 520 Credit: 302,538,504 RAC: 0	Message 6996 - Posted: 29 Nov 2008, 23:49:44 UTC - in response to Message 6994. Well, it might take longer for you as since you are both the cook and bottle washer, the more time you spend nursing the server, the less time you have for all the good stuff (smile>) Yes, but unfortunately that takes time :( I'm going to see what we can do more short-term, until we can take the analysis up to the next level (and hopefully make the WUs really long). ID: 6996 · Rating: 0 · rate: / Reply Quote

banditwolf Send message Joined: 12 Nov 07 Posts: 2425 Credit: 524,164 RAC: 0	Message 7006 - Posted: 30 Nov 2008, 0:25:22 UTC Seems like only a temp fix: [As of 30 Nov 2008 0:21:21 UTC] Results ready to send 1,505 Results in progress 47,165 Workunits waiting for validation 3,524 Workunits waiting for assimilation 461 Workunits waiting for deletion 67 Results waiting for deletion 92 Transitioner backlog (hours) 2 ~30 min ago ready to send: 15k progess: 35k valid: >100 (others about same) deletion: 2 backlog: 2 hours Doesn't expecting the unexpected make the unexpected the expected? If it makes sense, DON'T do it. ID: 7006 · Rating: 0 · rate: / Reply Quote

m4rtyn Send message Joined: 16 Jan 08 Posts: 18 Credit: 4,111,257 RAC: 0	Message 7008 - Posted: 30 Nov 2008, 0:55:10 UTC It's not gonna work! already I'm getting repeated "No Work Sent" messages and my pc's are backing of to between 1 & 3 hours. Without constant attendance they'll spend most of the time with an empty cache. m4rtyn ***************************** ***************************** ID: 7008 · Rating: 0 · rate: / Reply Quote

JAMC Send message Joined: 9 Sep 08 Posts: 96 Credit: 336,443,946 RAC: 0	Message 7009 - Posted: 30 Nov 2008, 0:59:02 UTC - in response to Message 7008. It's not gonna work! already I'm getting repeated "No Work Sent" messages and my pc's are backing of to between 1 & 3 hours. Without constant attendance they'll spend most of the time with an empty cache. ...same here :( ID: 7009 · Rating: 0 · rate: / Reply Quote