CUDA compute exclusivity problem?

Author	Message
avidday Send message Joined: 8 Apr 10 Posts: 15 Credit: 534,184 RAC: 0	Message 38418 - Posted: 8 Apr 2010, 19:59:05 UTC Last modified: 8 Apr 2010, 20:09:49 UTC I have been playing around with the cuda client today and am seeing what I believe is a flaw in the GPU selection logic of the CUDA 2.3 linux client. On my development box (dual GTX-275), I see I lot of this: Thu 08 Apr 2010 10:20:50 PM EEST Milkyway@home Computation for task de_s222_3s_13_118279_1270735874_0 finished Thu 08 Apr 2010 10:20:52 PM EEST Milkyway@home Started upload of de_s222_3s_13_118279_1270735874_0_0 Thu 08 Apr 2010 10:20:55 PM EEST Milkyway@home Finished upload of de_s222_3s_13_118279_1270735874_0_0 Thu 08 Apr 2010 10:21:10 PM EEST Milkyway@home Sending scheduler request: To fetch work. Thu 08 Apr 2010 10:21:10 PM EEST Milkyway@home Reporting 1 completed tasks, requesting new tasks for GPU Thu 08 Apr 2010 10:21:15 PM EEST Milkyway@home Scheduler request completed: got 1 new tasks Thu 08 Apr 2010 10:21:17 PM EEST Milkyway@home Started download of de_s222_3s_13_2263_1270754406_search_parameters Thu 08 Apr 2010 10:21:20 PM EEST Milkyway@home Finished download of de_s222_3s_13_2263_1270754406_search_parameters Thu 08 Apr 2010 10:21:20 PM EEST Milkyway@home Can't get available GPU RAM: 999 Thu 08 Apr 2010 10:22:20 PM EEST Milkyway@home Sending scheduler request: To fetch work. Thu 08 Apr 2010 10:22:20 PM EEST Milkyway@home Requesting new tasks for GPU Thu 08 Apr 2010 10:22:25 PM EEST Milkyway@home Scheduler request completed: got 0 new tasks Where a GPU job finishes, another is requested from your server, the launch is attempted (which fails), then the job sits in the queue unstarted with an idle gpu available to calculate. That job is still sitting in the job queue, probably it has been rescheduled to run on the CPU (I am a bit hazy on how the boinc scheduler works, this is the first time I have tried it). I am pretty sure the reason for this is that I use the nvidia-smi utility to mark one of my two gpus as compute exclusive or compute prohibited so I can use it for other things. When a gpu is marked as either compute exclusive or compute prohibited, the driver limits the number of contexts permitted on the device to 1 or 0 respectively. Any host thread which tries to establish a context on a gpu marked in this way will received an error from the CUDA driver and won't be able to continue. It seems that what is happening is that your code is finding a compute 1.3 capable gpu, but then not checking the compute mode, which results in the failure I am seeing. I do a lot of CUDA development myself and am happy to suggest how to fix this [it should probably be a modification to your choose_cuda_13() function]. In my own apps I usually do something like this: void gpuIdentify(struct gpuThread * g) { char compModeString[maxstring]; char identstring[maxstring]; gpuAssert( cuDeviceGet(&g->deviceHandle, g->deviceNumber) ); gpuAssert( cuDeviceGetName(g->deviceName, maxstring, g->deviceHandle) ); gpuAssert( cuDeviceGetProperties(&g->deviceProps, g->deviceHandle) ); gpuAssert( cuDeviceTotalMem(&g->deviceMemoryTot, g->deviceHandle) ); gpuAssert( cuDeviceGetAttribute(&g->deviceCompMode, CU_DEVICE_ATTRIBUTE_COMPUTE_MODE, g->deviceHandle) ); gpuAssert( cuDeviceComputeCapability(&g->deviceCC[0], &g->deviceCC[1], g->deviceHandle) ); switch (g->deviceCompMode) { case CU_COMPUTEMODE_PROHIBITED: sprintf(compModeString,"Compute Prohibited mode"); break; case CU_COMPUTEMODE_DEFAULT: sprintf(compModeString, "Normal mode"); break; case CU_COMPUTEMODE_EXCLUSIVE: sprintf(compModeString, "Compute Exclusive mode"); break; default: sprintf(compModeString, "Unknown"); break; } sprintf(identstring, "%d %s, %d MHz, %d Mb, Compute Capability %d.%d, %s", g->deviceNumber, g->deviceName, g->deviceProps.clockRate/1000, g->deviceMemoryTot / constMb, g->deviceCC[0], g->deviceCC[1], compModeString); gpuDiagMsg(stderr, identstring, __FILE__, __LINE__); } The query for the CU_DEVICE_ATTRIBUTE_COMPUTE_MODE is the salient part here. This shouldn't be set to anything other than 0 on OS X or Windows (except perhaps with the next compute only Telsa driver NVIDIA released a few months ago), but on Linux can can be set to other values, depending on the GPU administration policies on the box it runs on. Thanks for you time. EDIT: It occurred to me after I posted this that your GPU selection code is going to fail on Fermi, because it will reject compute 2.0 when it probably shouldn't (actually so will that code snippet I posted, for the same reason...) ID: 38418 · Rating: 0 · rate: / Reply Quote

avidday Send message Joined: 8 Apr 10 Posts: 15 Credit: 534,184 RAC: 0	Message 38873 - Posted: 19 Apr 2010, 15:22:17 UTC - in response to Message 38418. Most of these symptoms turned out to be bugs in the Boinc client rather than the milkyway app, although the problems (at least theoretically anyway) are the same in your application code. I been in touch with the Boinc developers and have proposed a patch for the client scheduler that should fix the worst of the problem. Sorry to have slightly jumped the gun... ID: 38873 · Rating: 0 · rate: / Reply Quote

James Eniti Send message Joined: 16 Jun 10 Posts: 1 Credit: 16,281,408 RAC: 0	Message 40761 - Posted: 2 Jul 2010, 7:21:59 UTC - in response to Message 38873. I have been seeing the app failing on fermi with an error message that cuda 1.3 is required. this error in selection has made me shut down task requests on my dual GTX-470 sli box. Note that SETI@Home works fine but keeps running out of work Thanks James Most of these symptoms turned out to be bugs in the Boinc client rather than the milkyway app, although the problems (at least theoretically anyway) are the same in your application code. I been in touch with the Boinc developers and have proposed a patch for the client scheduler that should fix the worst of the problem. Sorry to have slightly jumped the gun... ID: 40761 · Rating: 0 · rate: / Reply Quote