Welcome to MilkyWay@home

CUDA compute exclusivity problem?


Advanced search

Message boards : Application Code Discussion : CUDA compute exclusivity problem?
Message board moderation

To post messages, you must log in.

AuthorMessage
avidday

Send message
Joined: 8 Apr 10
Posts: 15
Credit: 534,184
RAC: 0
500 thousand credit badge11 year member badge
Message 38418 - Posted: 8 Apr 2010, 19:59:05 UTC
Last modified: 8 Apr 2010, 20:09:49 UTC

I have been playing around with the cuda client today and am seeing what I believe is a flaw in the GPU selection logic of the CUDA 2.3 linux client. On my development box (dual GTX-275), I see I lot of this:

Thu 08 Apr 2010 10:20:50 PM EEST	Milkyway@home	Computation for task de_s222_3s_13_118279_1270735874_0 finished
Thu 08 Apr 2010 10:20:52 PM EEST	Milkyway@home	Started upload of de_s222_3s_13_118279_1270735874_0_0
Thu 08 Apr 2010 10:20:55 PM EEST	Milkyway@home	Finished upload of de_s222_3s_13_118279_1270735874_0_0
Thu 08 Apr 2010 10:21:10 PM EEST	Milkyway@home	Sending scheduler request: To fetch work.
Thu 08 Apr 2010 10:21:10 PM EEST	Milkyway@home	Reporting 1 completed tasks, requesting new tasks for GPU
Thu 08 Apr 2010 10:21:15 PM EEST	Milkyway@home	Scheduler request completed: got 1 new tasks
Thu 08 Apr 2010 10:21:17 PM EEST	Milkyway@home	Started download of de_s222_3s_13_2263_1270754406_search_parameters
Thu 08 Apr 2010 10:21:20 PM EEST	Milkyway@home	Finished download of de_s222_3s_13_2263_1270754406_search_parameters
Thu 08 Apr 2010 10:21:20 PM EEST	Milkyway@home	Can't get available GPU RAM: 999
Thu 08 Apr 2010 10:22:20 PM EEST	Milkyway@home	Sending scheduler request: To fetch work.
Thu 08 Apr 2010 10:22:20 PM EEST	Milkyway@home	Requesting new tasks for GPU
Thu 08 Apr 2010 10:22:25 PM EEST	Milkyway@home	Scheduler request completed: got 0 new tasks

Where a GPU job finishes, another is requested from your server, the launch is attempted (which fails), then the job sits in the queue unstarted with an idle gpu available to calculate. That job is still sitting in the job queue, probably it has been rescheduled to run on the CPU (I am a bit hazy on how the boinc scheduler works, this is the first time I have tried it).

I am pretty sure the reason for this is that I use the nvidia-smi utility to mark one of my two gpus as compute exclusive or compute prohibited so I can use it for other things. When a gpu is marked as either compute exclusive or compute prohibited, the driver limits the number of contexts permitted on the device to 1 or 0 respectively. Any host thread which tries to establish a context on a gpu marked in this way will received an error from the CUDA driver and won't be able to continue.

It seems that what is happening is that your code is finding a compute 1.3 capable gpu, but then not checking the compute mode, which results in the failure I am seeing.

I do a lot of CUDA development myself and am happy to suggest how to fix this [it should probably be a modification to your choose_cuda_13() function]. In my own apps I usually do something like this:

void gpuIdentify(struct gpuThread * g)
{
    char compModeString[maxstring];
    char identstring[maxstring];

    gpuAssert( cuDeviceGet(&g->deviceHandle, g->deviceNumber) );
    gpuAssert( cuDeviceGetName(g->deviceName, maxstring, g->deviceHandle) );
    gpuAssert( cuDeviceGetProperties(&g->deviceProps, g->deviceHandle) );
    gpuAssert( cuDeviceTotalMem(&g->deviceMemoryTot, g->deviceHandle) );
    gpuAssert( cuDeviceGetAttribute(&g->deviceCompMode, CU_DEVICE_ATTRIBUTE_COMPUTE_MODE, g->deviceHandle) );
    gpuAssert( cuDeviceComputeCapability(&g->deviceCC[0], &g->deviceCC[1], g->deviceHandle) );

    switch (g->deviceCompMode) {
    case CU_COMPUTEMODE_PROHIBITED:
        sprintf(compModeString,"Compute Prohibited mode");
        break;
    case CU_COMPUTEMODE_DEFAULT:
        sprintf(compModeString, "Normal mode");
        break;
    case CU_COMPUTEMODE_EXCLUSIVE:
        sprintf(compModeString, "Compute Exclusive mode");
        break;
    default:
        sprintf(compModeString, "Unknown");
        break;
    }

    sprintf(identstring, "%d %s, %d MHz, %d Mb, Compute Capability %d.%d, %s", 
            g->deviceNumber, g->deviceName, g->deviceProps.clockRate/1000,
            g->deviceMemoryTot / constMb, g->deviceCC[0], g->deviceCC[1], compModeString);

    gpuDiagMsg(stderr, identstring, __FILE__, __LINE__);
}


The query for the CU_DEVICE_ATTRIBUTE_COMPUTE_MODE is the salient part here. This shouldn't be set to anything other than 0 on OS X or Windows (except perhaps with the next compute only Telsa driver NVIDIA released a few months ago), but on Linux can can be set to other values, depending on the GPU administration policies on the box it runs on.

Thanks for you time.

EDIT: It occurred to me after I posted this that your GPU selection code is going to fail on Fermi, because it will reject compute 2.0 when it probably shouldn't (actually so will that code snippet I posted, for the same reason...)
ID: 38418 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
avidday

Send message
Joined: 8 Apr 10
Posts: 15
Credit: 534,184
RAC: 0
500 thousand credit badge11 year member badge
Message 38873 - Posted: 19 Apr 2010, 15:22:17 UTC - in response to Message 38418.  

Most of these symptoms turned out to be bugs in the Boinc client rather than the milkyway app, although the problems (at least theoretically anyway) are the same in your application code. I been in touch with the Boinc developers and have proposed a patch for the client scheduler that should fix the worst of the problem.

Sorry to have slightly jumped the gun...
ID: 38873 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
James Eniti

Send message
Joined: 16 Jun 10
Posts: 1
Credit: 16,281,408
RAC: 0
10 million credit badge11 year member badge
Message 40761 - Posted: 2 Jul 2010, 7:21:59 UTC - in response to Message 38873.  

I have been seeing the app failing on fermi with an error message that cuda 1.3 is required. this error in selection has made me shut down task requests on my dual GTX-470 sli box. Note that SETI@Home works fine but keeps running out of work

Thanks
James


Most of these symptoms turned out to be bugs in the Boinc client rather than the milkyway app, although the problems (at least theoretically anyway) are the same in your application code. I been in touch with the Boinc developers and have proposed a patch for the client scheduler that should fix the worst of the problem.

Sorry to have slightly jumped the gun...

ID: 40761 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Application Code Discussion : CUDA compute exclusivity problem?

©2021 Astroinformatics Group