Efficiency improvement by larger GPU tasks?

Author	Message
Mr P Hucker Send message Joined: 5 Jul 11 Posts: 990 Credit: 376,143,149 RAC: 0	Message 73689 - Posted: 25 May 2022, 12:26:47 UTC At the moment, this project gives out tiny tasks for GPUs that are done in under a minute. By contrast, projects like Einstein are under an hour. Perhaps this would involve a big rewrite, but wouldn't it drastically reduce the server load if a big batch of an hour or so's work was included in each task? I read somewhere from Tom that currently each task is a bundle of 5, and those 5 each have a parameter passed in the command line. Could this be changed so for each task we download a text file of 1000 parameters and work through all that before giving it back? ID: 73689 · Rating: 0 · rate: / Reply Quote

HRFMguy Send message Joined: 12 Nov 21 Posts: 236 Credit: 575,038,236 RAC: 0	Message 73690 - Posted: 25 May 2022, 12:53:07 UTC - in response to Message 73689. At the moment, this project gives out tiny tasks for GPUs that are done in under a minute. By contrast, projects like Einstein are under an hour. Perhaps this would involve a big rewrite, but wouldn't it drastically reduce the server load if a big batch of an hour or so's work was included in each task? I read somewhere from Tom that currently each task is a bundle of 5, and those 5 each have a parameter passed in the command line. Could this be changed so for each task we download a text file of 1000 parameters and work through all that before giving it back? I would think it would cut down on the number of client/server contacts. That would be a help. So just what is the client/server contact per hour number? Anybody know? ID: 73690 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 990 Credit: 376,143,149 RAC: 0	Message 73691 - Posted: 25 May 2022, 13:27:11 UTC - in response to Message 73690. Last modified: 25 May 2022, 13:28:06 UTC At the moment, this project gives out tiny tasks for GPUs that are done in under a minute. By contrast, projects like Einstein are under an hour. Perhaps this would involve a big rewrite, but wouldn't it drastically reduce the server load if a big batch of an hour or so's work was included in each task? I read somewhere from Tom that currently each task is a bundle of 5, and those 5 each have a parameter passed in the command line. Could this be changed so for each task we download a text file of 1000 parameters and work through all that before giving it back? I would think it would cut down on the number of client/server contacts. That would be a help. So just what is the client/server contact per hour number? Anybody know? The problem I'm seeing with my machines is because the server won't give out tasks while getting them back, my computers are pestering the server every couple of minutes asking for work when the buffer is running low, but never get work because there's always one of those tiny tasks finished. So in my case it can be as often as every 1.5 minutes per computer (the minimum time on the server between asks). But my main thought was if each task was 100 times larger, there would be 100 times less tasks for the server to keep track of in the database. ID: 73691 · Rating: 0 · rate: / Reply Quote

HRFMguy Send message Joined: 12 Nov 21 Posts: 236 Credit: 575,038,236 RAC: 0	Message 73692 - Posted: 25 May 2022, 14:08:45 UTC - in response to Message 73691. At the moment, this project gives out tiny tasks for GPUs that are done in under a minute. By contrast, projects like Einstein are under an hour. Perhaps this would involve a big rewrite, but wouldn't it drastically reduce the server load if a big batch of an hour or so's work was included in each task? I read somewhere from Tom that currently each task is a bundle of 5, and those 5 each have a parameter passed in the command line. Could this be changed so for each task we download a text file of 1000 parameters and work through all that before giving it back? I would think it would cut down on the number of client/server contacts. That would be a help. So just what is the client/server contact per hour number? Anybody know? The problem I'm seeing with my machines is because the server won't give out tasks while getting them back, my computers are pestering the server every couple of minutes asking for work when the buffer is running low, but never get work because there's always one of those tiny tasks finished. So in my case it can be as often as every 1.5 minutes per computer (the minimum time on the server between asks). But my main thought was if each task was 100 times larger, there would be 100 times less tasks for the server to keep track of in the database. Yeah, that's one way to look at it. If I could keep my GPU separation running 24/7 with no waits for downloads, I'd be a happy camper. I'm OK with the small size if I don't run out of work. I kinda get a kick out of seeing a GPU task finish every 30 seconds, while the same task in CPU takes 50 minutes. Mind boggling. But yes, MW is leaving a lot of work on the table that could be completed, if there were no gaps in task delivery to the clients. Make the tasks larger, or keep the client buffer full. Or, as the Pointy Haired Boss in Dilbert says, "Let's do both!" ID: 73692 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 990 Credit: 376,143,149 RAC: 0	Message 73694 - Posted: 25 May 2022, 15:17:30 UTC - in response to Message 73692. Yeah, that's one way to look at it. If I could keep my GPU separation running 24/7 with no waits for downloads, I'd be a happy camper. I'm OK with the small size if I don't run out of work. I kinda get a kick out of seeing a GPU task finish every 30 seconds, while the same task in CPU takes 50 minutes. Mind boggling. But yes, MW is leaving a lot of work on the table that could be completed, if there were no gaps in task delivery to the clients. Make the tasks larger, or keep the client buffer full. Or, as the Pointy Haired Boss in Dilbert says, "Let's do both!" I like seeing them go fast too and would miss that. But the server must be doing a lot of work matching up so many millions of tasks. I remember back to before they put them in bundles of 5, you could do one in 15 seconds! ID: 73694 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0	Message 73701 - Posted: 26 May 2022, 3:09:38 UTC - in response to Message 73694. Yeah, that's one way to look at it. If I could keep my GPU separation running 24/7 with no waits for downloads, I'd be a happy camper. I'm OK with the small size if I don't run out of work. I kinda get a kick out of seeing a GPU task finish every 30 seconds, while the same task in CPU takes 50 minutes. Mind boggling. But yes, MW is leaving a lot of work on the table that could be completed, if there were no gaps in task delivery to the clients. Make the tasks larger, or keep the client buffer full. Or, as the Pointy Haired Boss in Dilbert says, "Let's do both!" I like seeing them go fast too and would miss that. But the server must be doing a lot of work matching up so many millions of tasks. I remember back to before they put them in bundles of 5, you could do one in 15 seconds! The only problem I can see with this is 'do they have another Server to analyze the data we are sending back or do they use the same one' because if they are using the same one then longer tasks would take longer to analyze and therefore more time the server 'is busy'. This again could be a money problem but getting a Server to analyze the data could be a pretty easy thing as it doesn't need to be the best of the best to do that. The IT folks should be able to help them figure out what specs they need so they aren't buying another one next year to replace this years purchase. ID: 73701 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 715 Credit: 555,441,388 RAC: 38,699	Message 73703 - Posted: 26 May 2022, 3:17:44 UTC Is this what you were looking for Mikey? https://grafana.kiska.pw/d/boinc/boinc?orgId=1&from=now-32d&to=now&var-project=milkyway@home ID: 73703 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 990 Credit: 376,143,149 RAC: 0	Message 73706 - Posted: 26 May 2022, 13:27:12 UTC - in response to Message 73701. The only problem I can see with this is 'do they have another Server to analyze the data we are sending back or do they use the same one' because if they are using the same one then longer tasks would take longer to analyze and therefore more time the server 'is busy'. This again could be a money problem but getting a Server to analyze the data could be a pretty easy thing as it doesn't need to be the best of the best to do that. The IT folks should be able to help them figure out what specs they need so they aren't buying another one next year to replace this years purchase. The tasks at the moment are a bundle of 5. I was thinking a bundle of 100 would be better. Overall the same amount of individual tasks within the task need to be processed, so shouldn't give more work for the server. But being less Boinc tasks running as they're larger bundles could perhaps make the database smaller to keep track of which of us has which one and needs to be compared to a wingman. ID: 73706 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0	Message 73711 - Posted: 26 May 2022, 23:23:38 UTC - in response to Message 73703. Is this what you were looking for Mikey? https://grafana.kiska.pw/d/boinc/boinc?orgId=1&from=now-32d&to=now&var-project=milkyway@home YES Thank you very much!! ID: 73711 · Rating: 0 · rate: / Reply Quote

.clair. Send message Joined: 3 Mar 13 Posts: 84 Credit: 779,527,712 RAC: 0	Message 73713 - Posted: 26 May 2022, 23:44:11 UTC was there something like more than 5 jobs in a workunit causes the command line parameters to overflow / run out of space or was that something else ? ID: 73713 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 990 Credit: 376,143,149 RAC: 0	Message 73714 - Posted: 27 May 2022, 0:38:00 UTC - in response to Message 73713. was there something like more than 5 jobs in a workunit causes the command line parameters to overflow / run out of space or was that something else ? You are correct, but why do they have to be in command line parameters? Why can't the parameters be in a text file and contain enough for 100? ID: 73714 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 213 Credit: 108,362,077 RAC: 4,510	Message 73715 - Posted: 27 May 2022, 2:36:16 UTC - in response to Message 73714. was there something like more than 5 jobs in a workunit causes the command line parameters to overflow / run out of space or was that something else ? You are correct, but why do they have to be in command line parameters? Why can't the parameters be in a text file and contain enough for 100? To get a definitive answer to that one would need to ask the original programmer(s) of both Separation and n-Body :-) However, my guess is that because the parameters in question are the only data items specific to a single work unit, it means that by not putting them in a file the overheads of creating and cleaning up the files is avoided. (There are a couple of files associated with all work units in a given batch, but I think those are immutable from start to eventual convergence of results and end of batch.) By the way, when the results come back it isn't just a case of "Log them in the database and that's that..." I believe they are using TAO, the "Toolkit for Asynchronous Optimization", in which case there is extra work associated with both validation and work unit generation! I have no idea how well that might scale with increase in jobs per work unit :-) One issue which some GPU users don't seem to consider is that work units that are good for GPUs are highly unlikely to be good for users with laptops and older PCs who want to contribute to a project, which is why places like Einstein@home tend to have distinct projects for different hardware combinations! Bear in mind that not all projects can be easily cut up into sub-projects for different sorts of hardware, and the two projects here are likely to fall into that category (unless TAO can cope with multiple projects working on a single optimization data set.) In such situations it is then up to the project team to decide on a balance between how quickly they want results and how many volunteers they are willing to exclude! Cheers - Al. P.S. For examples of what can happen if a project tries to feed "GPU fever" to excess, consider how SETI@home struggled in its later days as more folks with NVIDIA GPUs got access to a high-performance CUDA app; regular transitioner backlogs and all sorts of other problems... Also, when WCG did a deliberate stress test on their OPN1/OPNG project by just "opening the floodgates" the consequences there were very similar to what we have been seeing here and they had a far more powerful infrastructure to run things on... "Be careful what you wish for -- you might get it!" ID: 73715 · Rating: 0 · rate: / Reply Quote