Limit of 48 workunits at once

Author	Message
Aaron Finney Send message Joined: 30 Apr 10 Posts: 10 Credit: 27,768,203 RAC: 0	Message 39661 - Posted: 13 May 2010, 0:49:54 UTC Last modified: 13 May 2010, 1:00:40 UTC Currently, there is a limit of 48 workunits that my machine can have on it at any given time (server restriction). With this amount of WU, My machine can crunch happily for approximately 50 minutes. Assuming I never have any network outages that last longer than that... I'm good to go. - That being said, sometimes the network here gets shut off overnight. To fill this gap, I would need (or would have been able to crunch in addition) approximately 400 workunits more per night, and let's not talk about when the net gets shut down for weekend maintenance! Considering I only have one card, the following conjecture is of consideration : two cards would double this output, and by nature, double the amount of lost time it could be working. I plan on having 4 of these cards... I don't even want to think about the amount of work un-crunched at that rate! ------- Postulate : Can we get this limit increased ever so much, or is there any way to put more work into a WU? Aaron Finney Edited to add : The more I think about this, I can't help but thinking that this problem is only going to get worse as time goes on and people get new machines / cards. I highly (and humbly!) suggest futureproofing things now and probably doing a combination of the two suggestions above. Packing more into a WU isn't a bad thing if it can be done - I'd increase the amount of 'work' x5 if it is as simple as I type it. Increasing the amount of WU a machine is able to have on it at any given time (while effective) is not a cure, merely a way to treat the main issue, and may cause other areas of the project / wu system / server / credit system / scoring / etc.. to go wonky. ID: 39661 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0	Message 39665 - Posted: 13 May 2010, 2:54:52 UTC The real issue here is that yes the fast cards do more work faster, but, we also have a lot of people that run these tasks on the CPU side ... so, the issue is that let us assume worst case, all of your and my tasks are run on a fast GPU (as fast as 90 seconds per) and we get "married" to wingmen that are all running on the CPU side ... it is easy to see that the issue is that we pump work out so fast that we are a very heavy burden on the server side both in how fast we pull tasks and how fast we send them back ... and yet, they may not be paired and validated that fast ... so, the server side builds up huge lists in the database ... By limiting the tasks we each have on hand, the project keeps a little bit of a lid on how many tasks at one time each of us has in the pending situation ... The good news is that there is work being done to ease this some with the elimination of data-files being generated ... if Travis can eliminate those, then he can eliminate the burden of the file deleter on the server ... with that done, well, no promises, but, it may be possible THEN to petition to see if the limit can be raised... Right now it is limited to 6 per CPU core ... or for an 8 CPU machine... 48 tasks ... which as you noted, takes just over an hour to run through ... faster with dual GPUs of this class ... ID: 39665 · Rating: 0 · rate: / Reply Quote

scottishwebcamslive.com Send message Joined: 10 Oct 07 Posts: 79 Credit: 69,337,972 RAC: 0	Message 39668 - Posted: 13 May 2010, 8:16:29 UTC Last modified: 13 May 2010, 8:36:08 UTC Hi, I like Aarons point i am very lucky to use a multi threaded chip so get 48 wu at any one time unfortunarely i also have 3 moderatly fast cards in one box that can eat through those in less than 48 minutes if i split those cards into 3 differant machines i'd have 144 work units to crunch through taking nearly 2 and a half hours to do so this means i am being punished because i have all my cards in one machine instead of 3 maybe theres a way of linking avarage RAC per host to differant amounts so that more powerful hosts with faster cards or groups of cards get higher bundles of wu to download i.e. RAC = up to 100k = 48 WU 6X8 CORES RAC = between 100k and 200k = 64 WU 8X8 CORES RAC = Above 200k = 80 WU 10X8 CORES numbers above could be anything but at that my machine would have a little bank of 1 hour and 20 minutes worth of work instead of less than 48 mins Just a thought :) Best regards Ian ID: 39668 · Rating: 0 · rate: / Reply Quote

nmeofdst8 Send message Joined: 7 Mar 10 Posts: 11 Credit: 1,284,443 RAC: 0	Message 39682 - Posted: 13 May 2010, 19:46:36 UTC ahhh...is that why I always max out at 72 WUs. No matter what I set the cache amount to be, always 72, makes sense now. My card is doing 76sec per WU atm (other card = RMA), when i run out I just run Collatz and DNETC. I would really like to find a CPU project that coexists with DNETC GPU and MW GPU tasks...seems like when I run CPU tasks with those 2 projects..things go bonkers. I'm all for larger caches when it becomes feasible to do so. ID: 39682 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0	Message 39693 - Posted: 14 May 2010, 1:37:21 UTC Well, Travis is very much active, and very much regrets (if I may put words in his mouth based on his actions over the last umpteen months) the short queue sizes ... there are actually other issues involved as well such as some of the tasks are generated based on what we return (a la GPU Grid) so there is a "building" process where new work depends on old work returned... Think of it like a brick layer... we work on story 1 ... before we work on story 2 ... it would be faster if we could put 9 women on it ... get the baby in a month ... much more efficient ... So, as I said, this is the reason for the upcoming changes in the applications to eliminate data files ... perhaps, and I am speculating here, if the server load goes down enough maybe we can see an increase in task issue rates ... the problem is that some people like to have 10,000 tasks on hand ... can't happen with the server he has ... so, we all make do ... I know, he should be independently wealthy and go out and buy a server farm like SaH so we can have as much work as we want ... sadly we don't live in that perfect world ... and even if he could buy a bigger farm to support us, it is still possible that the sequential nature of the beast would still get in the way ... As to the playing nice, I have been arguing for months now that the Strict FIFO rule is causing more problems than it is worth for GPU oriented systems ... particularly with MW's issues in the mix ... so far to no avail ... ID: 39693 · Rating: 0 · rate: / Reply Quote

Aaron Finney Send message Joined: 30 Apr 10 Posts: 10 Credit: 27,768,203 RAC: 0	Message 39695 - Posted: 14 May 2010, 3:19:39 UTC - in response to Message 39693. Well, I don't have 4 cards yet. :) But when I do.. simple math tells me I will only have enough work to keep me busy for 12-14 minutes before I run out. It's surely not the 'intent' of the project owner to 'penalize' (as another user said) such power machines simply because they do more - I believe that he has had to concede this as an unfortunate 'fallout' in order to keep users from hoarding WU when they simply don't need them. Perhaps - if there are some settings to be made or created elsewhere in the boinc software, I could spread some talk around about the subject, and I'm very eager to get back into the testing community over on BoincAlpha. One method was to use RAC as the method of determining the amount of WU backlog, but I think there is already a way you can change this, by simply making it a application flag. Isn't there a setting for this on a per appplication / platform basis? - Upping the WU allotment for the GPU app / platform should be as easy as changing a setting, and it would not allow those with the CPU app to have more WU. Am I correct in this, or did that change never get implemented? ID: 39695 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0	Message 39698 - Posted: 14 May 2010, 6:32:52 UTC - in response to Message 39695. Aaron, There is no intent to penalize ... it is a consequence of the "flow" of the project ... (ignorant summarization of how the project works, but, for illustrative purposes only I will set forth this example) think of it this way ... we calculate x and return it ... now, you and I and the rest of the project participants are calculating x sub 1 to 1,000,000 ... now to generate x' sub 1 the project has to get a valid answer for x sub 1 FIRST ... and so on through 1,000,000 ... ok easy so far ... right? Well, to get x'' sub z we need both x and x' of the same index ... for x''' we need all of the prior plus x'' sub z ... also easy ... Now the fly in the ointment ... you and I with our super-fast cards are plowing through the tasks at light speed ... and yet ... in my simplified example I ignored the fact that there are about 2 orders of magnitude difference in processing speeds among the machines ... were I to run the tasks on the CPU side I would be taking hours to do each task an not the 90 seconds most of my GPUs take ... So, in the stream of tasks coming and going you and I are generations out from x while the CPU guys are still back there crunching their hearts out ... so, instead of letting us GPU snobs hog the project and excluding those that are not as fortunate in being able to buy high end GPUs they made some compromises so everyone that wants to can be part of the work ... personally, I wish it could be otherwise and we could all store up more tasks, but, that is just not in the cards ... All projects are delicate balancing acts and it is easy for us with our hot cards to forget that we are fewer than those that just have a small home computer ... were this not so I would not be in the world position I am with only what I think of as 5 inadequate computers ... yet, most who look at what I have might contemplate dark alleys and baseball bats ... :) At any rate, the limits are on the server side and were he able to Travis would have already increased the number we can have on hand through some other measurement ... but, the issue is how many tasks can he have "live" in the wild at any given moment ... and to keep that lid on we have this current system ... As to the last point, the application / platform segregation is both a server side and client side change, needed for several projects as it turns out ... and as far as I know not on the horizon anytime soon ... we need this change for proper DCF calculation more than anything else ... like PG with its multitude of applications with varying efficiencies ... and GPU / CPU projects where there can be wide variations in the "real" DCF for the GPU and CPU sides ... SaH has this issue as well ... ID: 39698 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 39730 - Posted: 15 May 2010, 7:53:30 UTC - in response to Message 39698. Paul pretty much hit the nail on the head. We really don't want to increase the caches anymore than they are right now. Part of the issue is that new work is being generated based on the results of the old work. So the larger your caches are, the older the work they're crunching (ie. it's less up to date so our searches progress slower). Another issue is that having large caches makes our poor server cry for mommy. If we doubled the cache size for example, the database would have to deal with 2x as many workunits out there at any given time; and things are slow enough as it is :P ID: 39730 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0	Message 39739 - Posted: 15 May 2010, 16:37:10 UTC - in response to Message 39730. Paul pretty much hit the nail on the head. We really don't want to increase the caches anymore than they are right now. Part of the issue is that new work is being generated based on the results of the old work. So the larger your caches are, the older the work they're crunching (ie. it's less up to date so our searches progress slower). Another issue is that having large caches makes our poor server cry for mommy. If we doubled the cache size for example, the database would have to deal with 2x as many workunits out there at any given time; and things are slow enough as it is :P Glad I was close enough ... :) I been trying to get ol' Doc Anderson to improve RRI so that we can be more selective ... I want RRI on GPU Grid because you get paid better for faster results, but I don't want it necessarily for MW because it hammers the server ... but so far to no avail ... it is all or nothing ... and sadly, that means with my configuration and project load and other considerations I have to beat on MW even though I don't want to do so ... ID: 39739 · Rating: 0 · rate: / Reply Quote

The Gas Giant Send message Joined: 24 Dec 07 Posts: 1947 Credit: 240,884,648 RAC: 0	Message 39742 - Posted: 15 May 2010, 17:41:28 UTC - in response to Message 39739. I been trying to get ol' Doc Anderson to improve RRI so that we can be more selective ... I want RRI on GPU Grid because you get paid better for faster results, but I don't want it necessarily for MW because it hammers the server ... but so far to no avail ... it is all or nothing ... and sadly, that means with my configuration and project load and other considerations I have to beat on MW even though I don't want to do so ... I think all GPU users beat the hell out of the MW server and all due to the 6wu/core limitation. I have my cache set at connect every 0.01 days and keep 0.1 days buffered. This and having a resource share pretty high, means that every time the MW back off completes BOINC asks for more MW work - hammer time. This is irrespective of RRI. ID: 39742 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0	Message 39749 - Posted: 15 May 2010, 22:18:49 UTC - in response to Message 39742. I been trying to get ol' Doc Anderson to improve RRI so that we can be more selective ... I want RRI on GPU Grid because you get paid better for faster results, but I don't want it necessarily for MW because it hammers the server ... but so far to no avail ... it is all or nothing ... and sadly, that means with my configuration and project load and other considerations I have to beat on MW even though I don't want to do so ... I think all GPU users beat the hell out of the MW server and all due to the 6wu/core limitation. I have my cache set at connect every 0.01 days and keep 0.1 days buffered. This and having a resource share pretty high, means that every time the MW back off completes BOINC asks for more MW work - hammer time. This is irrespective of RRI. Well, I have also been on him about the idiotic Strict FIFO rule ... which compounds things ... ID: 39749 · Rating: 0 · rate: / Reply Quote

Aaron Finney Send message Joined: 30 Apr 10 Posts: 10 Credit: 27,768,203 RAC: 0	Message 39755 - Posted: 16 May 2010, 5:42:14 UTC - in response to Message 39730. Last modified: 16 May 2010, 6:32:30 UTC Paul pretty much hit the nail on the head. We really don't want to increase the caches anymore than they are right now. Part of the issue is that new work is being generated based on the results of the old work. So the larger your caches are, the older the work they're crunching (ie. it's less up to date so our searches progress slower). Let's call this Issue 1a 'Diminishing Return with age of wu' Assuming I crunch 48 wu in about 50 minutes with the GPU client... I'm 'Person A'. This means that someone using the CPU application could have a cache that would last him well over 180 hours under the current scenario, even with a moderately priced processor... Let's call this individual 'Person B'. ASSUMING we could change the cache size for SOLELY the GPU client/platform, say - increasing it to where Person A would have a cache that lasted said person 12 hours, rather than less than 1 hour.... well.. Idunno to me it seems like a no brainer, Person B is still crunching much more outdated work than Person A. Or am I not understanding this particular section of the problem? I think I am correct, yes? (There are also numerous other issues that arise when lots of 'Person A' types get paired with 'Person B' types on WU's, that I might not be EXHAUSTIVELY covering but... as my machine still steadily draws work from the server, it seems that work is being created for me at the rate I could crunch it regardless, BOINC just has to beat on the server every minute for a new WU. ASSUMING there are no periods where Person A cannot retrieve work from teh server, These 'types' of issues should be of non-consideration.) Now --- another important thing to consider, is the age of the WU for Person B. Let's call this 'Issue 1b - CPU application work value deterioration' If I am correct that someone with the CPU application can grab even 150 hours of work, then we have to ask the following questions : Assuming that 12 hours of work on the GPU application would be pushing that client into the realm where work became less useful, then waiting 150+ hours for work to be complete on the CPU application must be terribly painful in terms of whether it is still 'relative' work being done at that point. In this case : Has steps been implemented to limit the number of WU that the CPU platform/application crunchers (Person B) can grab? -- Shouldn't they only have 1 WU per core using the GPU ratio? ___________________________ Let's call this Issue 2a 'Lack of server-side settings' Pretty straightforward, am I understanding this correctly that BOINC does not allow you to increase cache sized on a per-application basis? - Would using RAC help? I think that if BOINC is going to be used on multiple platforms, it's a SERIOUS consideration that needs to be discussed over with David and Rom about how to tackle the issues a project such as this one has where the WILDLY different processing times of users can have a significant effect on the science being done. --- Regardless if your server can handle the load now, Let's look to down the road --- What if your project had 6x as many users? We need to develop the strategies NOW to add the appropriate backend and frontend changes so that they can be conceived, produced, tested, and put to production before you get there. ;) _______ Another issue is that having large caches makes our poor server cry for mommy. If we doubled the cache size for example, the database would have to deal with 2x as many workunits out there at any given time; and things are slow enough as it is :P Let's call this Issue 3a - 'Server overload' Is this a ram issue, File System issue, or an I/O issue? Some people around here have spare parts.. deep pockets.. Would a second server (I.E. a 'feeder') help? Describe the problem? maybe someone can improve things :) Reason why I ask, is because as this project grows, your database is just going to steadily increase as time goes on.. likely exponentially as people upgrade their equipment and look to your project (as I did). If this is a problem that can be solved by something as simple as adding another hard drive, or some better / larger / faster / bigger / etc >insert random part<, then it's an easy fix that will futureproof the project, and improve things all-around. If you need another server.. well.. That's also something to consider, or perhaps give you the information to know what to do when you get to that point. Maybe you already have plans for upgrades? -- if so.. well n/m lolz XDD P.S. -- Since I've been gone awhile and some people may not know me / remember me, I feel I better give a little disclaimer : this is not some sort of selfish need to have more WU or credit than the average Joe. XD I've devoted large sections of my life to testing products, games, software, etc.. and I just want to improve where improvements can be made. If anything, I spend my day contemplating these things simply for the reward of improving the science. I'm sure there are many in every field who understands such needs ;) . ID: 39755 · Rating: 0 · rate: / Reply Quote

The Gas Giant Send message Joined: 24 Dec 07 Posts: 1947 Credit: 240,884,648 RAC: 0	Message 39756 - Posted: 16 May 2010, 5:52:26 UTC - in response to Message 39749. Well, I have also been on him about the idiotic Strict FIFO rule ... which compounds things ... Too true! That one really doesn't make a lot of sense. ID: 39756 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0	Message 39758 - Posted: 16 May 2010, 6:42:28 UTC - in response to Message 39756. Well, I have also been on him about the idiotic Strict FIFO rule ... which compounds things ... Too true! That one really doesn't make a lot of sense. It made sense at the time. There was an instability issue and this was an attempt to solve that issue. Sadly it didn't ... we still had instabilities. There was a later fix that solved the instabilities, or to put it another way, the real issue was located and changed... but the rule lingers... since UCB never explains I have no idea why they cling to this... ID: 39758 · Rating: 0 · rate: / Reply Quote

Haris Dublas Send message Joined: 25 Feb 10 Posts: 49 Credit: 10,137,837 RAC: 0	Message 39803 - Posted: 18 May 2010, 16:35:04 UTC How about this one? http://milkyway.cs.rpi.edu/milkyway/show_host_detail.php?hostid=171705 2381 tasks in progress and 1 pending. I'm sure all of that will time out. How did he accumulate so many work? Did he edit his cc_config and put in 50 on <ncpus>? ID: 39803 · Rating: 0 · rate: / Reply Quote