Question for GPU code writers of the world ... :)

Author	Message
Paul D. Buck Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0	Message 38983 - Posted: 21 Apr 2010, 21:59:01 UTC I have been pondering this question for a bit, given impetus by the travails of Glaggy in getting BOINC to work the way he would like it to so as to use his resources more effectively. I am not sure that I am going to use the correct terms so bear with me on this ... but the essential point of BOINC is to effectively schedule and consume idle resources and use them to a good end. To this point some of the more fanatical of us have bent the definition of idle resource to include purpose built machines whose only function in life is to do BOINC work ... be that as it may, the essential question is are we using the GPU resources as effectively as we could? For higher end GPUs the issue may not be as critical (or the low end, depending on how things really work out) to block schedule an entire application because the runtimes are so short. However, if we go back to the old time-slice concepts we know that there were two competing schools, cooperative with unbounded or irregular length length slices and time slices of fixed lengths. With the cooperative model you got greater "inner" efficiencies because you did not suspend and resume that much. With the fixed length and preemption style you got greater "outer" efficiencies because a long running application could not hog the resources ... In Claggy's case, he is running a low usage application on his GPU and wants to run a higher GPU use application when the other is busy on the CPU side (His case is AP at SaH, but I think equally applies to Einstein's GPU application as well where the GPU use is low)... I guess the question is first would it be practical to run smaller "chunks" of these GPU applications without losing too much efficiency? I mean, in the systems I have, the GPU production so out shines CPU production that it is pathetic. I just started a new GPU project on only one system and in less than a week I am half way to my first million CS and that is not the only project on that computer that uses the GPUs ... Discussion please ... if this is totally stupid thats ok, say so ... if not we need to think about this so maybe we can present the idea for consideration and in 20 years or so UCB with consider doing something about it ... ID: 38983 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 38993 - Posted: 22 Apr 2010, 3:08:52 UTC - in response to Message 38983. I think part of the problem here is the GPUs themselves currently don't have the capability to do what you're talking about. AFAIK there isn't much in the way of multitasking on GPUs. ID: 38993 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0	Message 38994 - Posted: 22 Apr 2010, 3:19:02 UTC - in response to Message 38993. I think part of the problem here is the GPUs themselves currently don't have the capability to do what you're talking about. AFAIK there isn't much in the way of multitasking on GPUs. What I am suggesting it that were the applications able to be segmented (kernels I think they are called) and only the segments run, then you COULD do interleaving... I know that the work cannot be done on the GPU itself, but, were we to rethink the creation of GPU applications it might be possible ... and that it what I am asking for the up or down vote on ... ID: 38994 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 38995 - Posted: 22 Apr 2010, 3:43:36 UTC - in response to Message 38994. Well the way our application works right now (at least CUDA) is we do divide up the kernels, so there is some interleaving. If we didn't it would freeze your screen. ID: 38995 · Rating: 0 · rate: / Reply Quote

avidday Send message Joined: 8 Apr 10 Posts: 15 Credit: 534,184 RAC: 0	Message 38998 - Posted: 22 Apr 2010, 6:02:40 UTC For a whole host of reasons, what you are suggesting can't be done. Today's GPUs are extremely primitive and lack most of the hardware necessary for preemption - no stack or frame counter hardware (or something that could be used for it), no general purpose interrupts, no full hardware MMU. None of the current hardware I am familiar with supports recursion or time slicing execution. They are much closer in concept to a 1970s vector computer than a general purpose microprocessor. The host driver has to do most of the thinking for the GPU, and then it just pushes commands down a FIFO for the hardware to execute and waits for the a signal that the gpu is done. Each GPU task associated with a different host thread sits in an isolated, driver managed context which the cannot interact with another context in any way. Context text switching at the driver level is an expensive operation. Fermi takes very small steps towards what you might be thinking about - it can be "partitioned" into four independent segments and can run four kernels from the same context simultaneously. But the partitioning is done by the driver and the configuration is mostly static. If a new task arrives in a context and there is enough resources available, the driver will launch it. If there isn't, the task must wait until the GPU is idle, and then it will be started. Because the mechanism is restricted to within a context, it means it is useful for one application to overlap kernel computations on the hardware, but useless for true timesharing for different tasks. ID: 38998 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0	Message 39001 - Posted: 22 Apr 2010, 10:12:30 UTC - in response to Message 38998. Today's GPUs are extremely primitive and lack most of the hardware necessary for preemption - no stack or frame counter hardware (or something that could be used for it), no general purpose interrupts, no full hardware MMU. None of the current hardware I am familiar with supports recursion or time slicing execution. They are much closer in concept to a 1970s vector computer than a general purpose microprocessor. The host driver has to do most of the thinking for the GPU, and then it just pushes commands down a FIFO for the hardware to execute and waits for the a signal that the gpu is done. Each GPU task associated with a different host thread sits in an isolated, driver managed context which the cannot interact with another context in any way. Context text switching at the driver level is an expensive operation. Um, again, I think everyone is missing what I was driving at in that I know the GPUs are pretty stupid and are much like the VPs that were "bolted on" to many of the mini and super-mini computers of yore ... and that is why I was asking if the BOINC Manager could be altered if the GPU applications were made amenable to segmentation... With smarter kernel level scheduling by BOINC we would have the exact situation that you describe, the host doing the thinking for the executing code on the GPU... Mostly I am suggesting not changing the code to make it smarter on the GPU, but, a repackaging so that we could get more effective use of the resource because BOINC would not be letting it go idle as much. I know Collatz and MW are not the "offending" parties here in that they have super high GPU usage (to the point where in rare instances it can be too high), but this is where the action is at for innovation in GPUs which is why I am asking here ... For example I stay away from SaH and Einstein because as I work the math of their GPU usage levels vs the speed gains in the processing, well, it does not yet pay to run their applications on the GPU (PG was almost as bad though I think that was more a credit granting issue, but I digress)... I think what Travis says: we do divide up the kernels, so there is some interleaving. If we didn't it would freeze your screen. which tells me that were BOINC to schedule on the kernel level instead of the application level, we could schedule more than one task from more than one project to run "at the same time" though they would be interleaved ... but this would address Claggy's problem quite well and address my utilization problem as well ... SaH would get slices as needed to run its kernel and when it was done release the GPU for kernels from MW to be run (for example) getting the best use out of a scarce resource ... In a way, it is like at the UCSD campus super-computer center when I got a tour there to see the Cray we were shown a mini-computer (or super-mini I forget its class) on which they ran applications to prove they would run so they did not waste time on the Cray for bad programs ... way to expensive ... ID: 39001 · Rating: 0 · rate: / Reply Quote

avidday Send message Joined: 8 Apr 10 Posts: 15 Credit: 534,184 RAC: 0	Message 39008 - Posted: 22 Apr 2010, 13:12:05 UTC - in response to Message 39001. Last modified: 22 Apr 2010, 13:41:25 UTC With smarter kernel level scheduling by BOINC we would have the exact situation that you describe, the host doing the thinking for the executing code on the GPU... Perhaps I wasn't clear in my reply. I understand what you are asking for, but it really is infeasible. There is currently nothing at a GPU hardware level, host driver level or host API which would help facilitate this. It might be convenient to think of these CUDA or CAL applications as standalone programs that run on the GPU to do their thing, but they really aren't. They are effectively host programs which use an API to push and pull bit of data and code to and from the GPU to get their calculations done. Kernel launches are more like asynchronous subroutine calls than standalone programs. The internal structure of each applications host code will be completely different. The APIs they use are designed around a usage model that says "I shall be connected to this GPU for the life of this application, while I am using the GPU nobody else can, and when I disconnect any context or state I had on the GPU will be lost forever". To do as you are thinking, at the very least, the BOINC system would have to invent its own standardized computation API (one which could work with CUDA and CAL) which all applications would be forced to write for and use, so that the client could intercept all interaction with the GPU, and prioritise and execute it in the order it determined to be optimal. Doing that would require the client to have its own pre-emption/context migration/state preservation/checkpointing system (and I don't know whether than is even possible given what the host drivers expose), and then the client would have to be re-written to be multi-threaded so that each GPU task would be in its own thread so that the one host thread per GPU context model the underlying APIs use isn't violated. On top of that is the whole scheduling problem, which is hard. I write this as the author of a multi-gpu capable heterogeneous linear algebra system for CUDA. What you imagine is trivial is probably not doable. ID: 39008 · Rating: 0 · rate: / Reply Quote

Vid Vidmar* Send message Joined: 29 Aug 07 Posts: 81 Credit: 60,360,858 RAC: 0	Message 39009 - Posted: 22 Apr 2010, 13:26:58 UTC - in response to Message 39001. ... we could schedule more than one task from more than one project to run "at the same time" though they would be interleaved ... but this would address Claggy's problem quite well and address my utilization problem as well ... SaH would get slices as needed to run its kernel and when it was done release the GPU for kernels from MW to be run (for example) getting the best use out of a scarce resource ... I have had no problems with running Collatz and MW on the same (ATI) card "at the same time", when I could trick BOINC (because of BOINC's FIFO handling of the GPU tasks) into it. So I really miss your point here, unless, it's quite different story in CUDA world. BR ID: 39009 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0	Message 39016 - Posted: 22 Apr 2010, 18:56:53 UTC - in response to Message 39008. I write this as the author of a multi-gpu capable heterogeneous linear algebra system for CUDA. What you imagine is trivial is probably not doable. I never suggested it was trivial... I guess that it is not yet possible ... well, that was the point of the question, to see if it might be possible ... ID: 39016 · Rating: 0 · rate: / Reply Quote