Posts by avidday

1) Message boards : Number crunching : Running 2 different GPU's without SLI (GTX 470 and GTX 260) (Message 39160) Posted 26 Apr 2010 by avidday Post: Thanks for the response. I did some research before the GTX 470 arrived and discovered the need for the GPU switch in cc_config, so it was there. I now know that the architecture of the new card is very different "programming" wise as well as hardware wise (which we all knew). Fermi is basically identical from a programming perspective to the GT200, and code which was written for the GT200 will run on Fermi without modification. The only reason why the current milky way applications don't work on Fermi is because the GT200 has been "hard coded" into the host support code the app uses. Leaving aside QA and testing, precisely 1 line of code in the current 0.24 CUDA client requires changing for the application to run on Fermi cards.
2) Message boards : Number crunching : Multiple GPU's (Message 39113) Posted 24 Apr 2010 by avidday Post: I got in touch with the developers after posting on the boinc forums (http://boinc.berkeley.edu/dev/forum_thread.php?id=5606&nowrap=true#32282/ is one of the posts if you are interested).
3) Message boards : Number crunching : Multiple GPU's (Message 39103) Posted 24 Apr 2010 by avidday Post: It is worth pointing out that the boinc client has a bug with the way it handles multiple NVIDIA GPUs if you use Linux or the new Windows compute driver for Telsas. It causes scheduling and throughput problems if you use the compute mode settings in the driver to control which GPU will be used in a multi-gpu system. I have made the boinc developers aware of the problem and proposed a patch for it, but I don't know when/if they will fix it.
4) Message boards : Number crunching : No Milkyway with GTX480 (Message 39043) Posted 23 Apr 2010 by avidday Post: The benchmarking I have done for the GTX470 shows it to be slightly better than twice as fast at double precision compared to a GTX275. The linear algebra benchmarks I use are generally memory bandwidth limited - a stock GTX275 hits about 77 Gflop/s double precision, and the GTX470 hits about 160 Gflop/s doing the same operation running identical code. There are new architectural features in Fermi which should allow that to improve further with some tuning. On pure compute bound jobs (and Milkyway seems to be one of the few), Cypress has a considerable advantage. On memory bandwidth bound codes (or mixed single-double precision codes), the performance gap will be a lot smaller. I hope to get a Telsa C2050 to test in the next week or so.
5) Message boards : News : testing new application (milkyway3) (Message 39011) Posted 22 Apr 2010 by avidday Post: The version of the CUDA milkyway3 app my client received doesn't run on Centos 5.1 (the regular 0.24 application does). It seems to have been linked with a very recent libstdc++: [david@n0008 milkyway.cs.rpi.edu_milkyway]$ ldd ./milkyway3_0.02_x86_64-pc-linux-gnu__cuda23 ./milkyway3_0.02_x86_64-pc-linux-gnu__cuda23: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.9' not found (required by ./milkyway3_0.02_x86_64-pc-linux-gnu__cuda23) linux-vdso.so.1 => (0x00007fff4a1fe000) libcudart.so.2 => not found libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x00000000006d0000) libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000000000b6c000) libm.so.6 => /lib64/libm.so.6 (0x0000000000e6c000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000000003069000) libc.so.6 => /lib64/libc.so.6 (0x000000000525f000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00000000010ef000) libz.so.1 => /usr/lib64/libz.so.1 (0x0000000001309000) libdl.so.2 => /lib64/libdl.so.2 (0x000000000151d000) /lib64/ld-linux-x86-64.so.2 (0x0000000000110000) That build has been running OK with Ubuntu 9.04 for a couple of days now.
6) Message boards : Number crunching : Question for GPU code writers of the world ... :) (Message 39008) Posted 22 Apr 2010 by avidday Post: With smarter kernel level scheduling by BOINC we would have the exact situation that you describe, the host doing the thinking for the executing code on the GPU... Perhaps I wasn't clear in my reply. I understand what you are asking for, but it really is infeasible. There is currently nothing at a GPU hardware level, host driver level or host API which would help facilitate this. It might be convenient to think of these CUDA or CAL applications as standalone programs that run on the GPU to do their thing, but they really aren't. They are effectively host programs which use an API to push and pull bit of data and code to and from the GPU to get their calculations done. Kernel launches are more like asynchronous subroutine calls than standalone programs. The internal structure of each applications host code will be completely different. The APIs they use are designed around a usage model that says "I shall be connected to this GPU for the life of this application, while I am using the GPU nobody else can, and when I disconnect any context or state I had on the GPU will be lost forever". To do as you are thinking, at the very least, the BOINC system would have to invent its own standardized computation API (one which could work with CUDA and CAL) which all applications would be forced to write for and use, so that the client could intercept all interaction with the GPU, and prioritise and execute it in the order it determined to be optimal. Doing that would require the client to have its own pre-emption/context migration/state preservation/checkpointing system (and I don't know whether than is even possible given what the host drivers expose), and then the client would have to be re-written to be multi-threaded so that each GPU task would be in its own thread so that the one host thread per GPU context model the underlying APIs use isn't violated. On top of that is the whole scheduling problem, which is hard. I write this as the author of a multi-gpu capable heterogeneous linear algebra system for CUDA. What you imagine is trivial is probably not doable.
7) Message boards : Number crunching : Question for GPU code writers of the world ... :) (Message 38998) Posted 22 Apr 2010 by avidday Post: For a whole host of reasons, what you are suggesting can't be done. Today's GPUs are extremely primitive and lack most of the hardware necessary for preemption - no stack or frame counter hardware (or something that could be used for it), no general purpose interrupts, no full hardware MMU. None of the current hardware I am familiar with supports recursion or time slicing execution. They are much closer in concept to a 1970s vector computer than a general purpose microprocessor. The host driver has to do most of the thinking for the GPU, and then it just pushes commands down a FIFO for the hardware to execute and waits for the a signal that the gpu is done. Each GPU task associated with a different host thread sits in an isolated, driver managed context which the cannot interact with another context in any way. Context text switching at the driver level is an expensive operation. Fermi takes very small steps towards what you might be thinking about - it can be "partitioned" into four independent segments and can run four kernels from the same context simultaneously. But the partitioning is done by the driver and the configuration is mostly static. If a new task arrives in a context and there is enough resources available, the driver will launch it. If there isn't, the task must wait until the GPU is idle, and then it will be started. Because the mechanism is restricted to within a context, it means it is useful for one application to overlap kernel computations on the hardware, but useless for true timesharing for different tasks.
8) Message boards : Application Code Discussion : CUDA compute exclusivity problem? (Message 38873) Posted 19 Apr 2010 by avidday Post: Most of these symptoms turned out to be bugs in the Boinc client rather than the milkyway app, although the problems (at least theoretically anyway) are the same in your application code. I been in touch with the Boinc developers and have proposed a patch for the client scheduler that should fix the worst of the problem. Sorry to have slightly jumped the gun...
9) Message boards : Number crunching : IMPORTANT! Nvidia's 400 series crippled by Nvidia (Message 38724) Posted 14 Apr 2010 by avidday Post: Maybe they think twice and order instead FIVE Raedon 5870 each with the same DP performence and for the same price That is certainly a viable alternative, and for some people it will be the right choice (and that probably includes the majority of people here). But for others it won't be feasible, for a variety of reasons.
10) Message boards : Number crunching : No Milkyway with GTX480 (Message 38719) Posted 14 Apr 2010 by avidday Post: The milkyway application itself doesn't have any influence on when jobs are downloaded, the boinc client does that. I am going to guess that the boinc client contains a compute capability test like this: if (CUDACapabilityMajorrevisionnumber >= 1) and (CUDACapabilityMinorrevisionnumber >= 3) then card is OK whereas the milkway app has a test like this: if (CUDACapabilityMajorrevisionnumber == 1) and (CUDACapabilityMinorrevisionnumber == 3) then card is OK so that compute 2.0 cards are OK with the boinc client and not OK with the milkway app. The first first is a guess, because I haven't seen the boinc client code, but the second part is definitely right - you can see it in the code available for download here.
11) Message boards : Number crunching : IMPORTANT! Nvidia's 400 series crippled by Nvidia (Message 38703) Posted 14 Apr 2010 by avidday Post: Yes, I am getting the same, but only from one ISP I use. From the other, everything is fine.
12) Message boards : Number crunching : No Milkyway with GTX480 (Message 38690) Posted 13 Apr 2010 by avidday Post: Unfortunate the cuda app has compute capability 1.3 hard coded into it, so the app will not find a valid card and exit. The behaviour is effectively the same as if you were running it on older G80 or G90 with no double precision support. The good news is that is it only 1 one line change to the code and a one line change to the Makefile to fix it. It will also require building against Cuda 3.0 and need 195 series drivers on linux or 196/197 series on Windows.
13) Message boards : Number crunching : IMPORTANT! Nvidia's 400 series crippled by Nvidia (Message 38467) Posted 9 Apr 2010 by avidday Post: From what I've been told there is more than one way to gimp GPGPU performance on these chips (just a layman so...), but still have a GPU that meets DX11/10 spec and performs well in games. As someone who has done a lot of disassembly of CUDA, OpenCL and shader language code, I don't see how it can be done. There really isn't anything inside compiled code to say that "this is a single precision compute job" or "this is a tesselation call" or "this is a shader fragment". It is all just the same assembler code that runs on the same shaders and uses the same ALU/FPUs. There isn't anything obvious I can see that would allow you to artificially limit the instruction issue rate of one, without effecting the other. Double precision is different, because the double precision FPUs are separate from the single precision ones, and it would be possible to have the MP scheduler artificially limit the rate of double precision instruction issue without effecting the performance of the rendering pipeline. It all seems silly and self defeating to me so if you think its a reasonable way to do product segmentation than I dunno what to say. I never said it was reasonable, I just was making the observation that this isn't anything new and shouldn't really have been a surprised to anyone. Both hemispheres of the GPU world have been doing this sort of differentiation with their OpenGL acceleration for as long as they have both been making professional OpenGL cards based on their consumer GPUs. As it is, NVIDIA are about to deliver (in very, very limited quantities if rumours are correct) a new high end consumer card which has about double the usable peak single and double precision performance of its predecessor. Down the road you will be able to buy a professional version which offers slightly lower single precision performance, but about 4 times double precision and more twice the memory (along with stuff like ECC memory) for about 5-6 times higher price. Those who truly need the additional double precision will buy the professional card. Those who don't won't.
14) Message boards : Number crunching : IMPORTANT! Nvidia's 400 series crippled by Nvidia (Message 38453) Posted 9 Apr 2010 by avidday Post: There is no reason why they'd have to gimp just double precision work loads on their consumer GPU's you know. If they really wanted to they could gimp all GPGPU work loads to make Tesla/FireStream look even better. Except, of, course, that single precision GPGPU calculations are indistinguishable from DX10 or DX11 programmable shader calculations, so artificially limiting the single precision throughput would have a direct effect on the 3D graphics performance of the GPU, whereas reducing double precision performance has no such effect. This will be the case until Microsoft mandate double precision capability in some future version of DirectX, or OpenCL becomes the dominant compute API and an integral part of consumer operating systems and applications, and Khronos promotes the cl_khr_fp64 option to an integral part of the standard, rather than an optional extension as it is now. Only then double precision capability will become a focus for consumer GPUs, and the current situation might change. It is worth pointing out that this is hardly a new phenomena for NVIDIA. For as long as there has been Quadro cards, they have been using firmware to only enable a number of OpenGL hardware acceleration paths on Quadro cards used with Quadro specific drivers, despite Quadro products being based off identical silicon to their GeForce counterparts.
15) Message boards : Application Code Discussion : CUDA compute exclusivity problem? (Message 38418) Posted 8 Apr 2010 by avidday Post: I have been playing around with the cuda client today and am seeing what I believe is a flaw in the GPU selection logic of the CUDA 2.3 linux client. On my development box (dual GTX-275), I see I lot of this: Thu 08 Apr 2010 10:20:50 PM EEST Milkyway@home Computation for task de_s222_3s_13_118279_1270735874_0 finished Thu 08 Apr 2010 10:20:52 PM EEST Milkyway@home Started upload of de_s222_3s_13_118279_1270735874_0_0 Thu 08 Apr 2010 10:20:55 PM EEST Milkyway@home Finished upload of de_s222_3s_13_118279_1270735874_0_0 Thu 08 Apr 2010 10:21:10 PM EEST Milkyway@home Sending scheduler request: To fetch work. Thu 08 Apr 2010 10:21:10 PM EEST Milkyway@home Reporting 1 completed tasks, requesting new tasks for GPU Thu 08 Apr 2010 10:21:15 PM EEST Milkyway@home Scheduler request completed: got 1 new tasks Thu 08 Apr 2010 10:21:17 PM EEST Milkyway@home Started download of de_s222_3s_13_2263_1270754406_search_parameters Thu 08 Apr 2010 10:21:20 PM EEST Milkyway@home Finished download of de_s222_3s_13_2263_1270754406_search_parameters Thu 08 Apr 2010 10:21:20 PM EEST Milkyway@home Can't get available GPU RAM: 999 Thu 08 Apr 2010 10:22:20 PM EEST Milkyway@home Sending scheduler request: To fetch work. Thu 08 Apr 2010 10:22:20 PM EEST Milkyway@home Requesting new tasks for GPU Thu 08 Apr 2010 10:22:25 PM EEST Milkyway@home Scheduler request completed: got 0 new tasks Where a GPU job finishes, another is requested from your server, the launch is attempted (which fails), then the job sits in the queue unstarted with an idle gpu available to calculate. That job is still sitting in the job queue, probably it has been rescheduled to run on the CPU (I am a bit hazy on how the boinc scheduler works, this is the first time I have tried it). I am pretty sure the reason for this is that I use the nvidia-smi utility to mark one of my two gpus as compute exclusive or compute prohibited so I can use it for other things. When a gpu is marked as either compute exclusive or compute prohibited, the driver limits the number of contexts permitted on the device to 1 or 0 respectively. Any host thread which tries to establish a context on a gpu marked in this way will received an error from the CUDA driver and won't be able to continue. It seems that what is happening is that your code is finding a compute 1.3 capable gpu, but then not checking the compute mode, which results in the failure I am seeing. I do a lot of CUDA development myself and am happy to suggest how to fix this [it should probably be a modification to your choose_cuda_13() function]. In my own apps I usually do something like this: void gpuIdentify(struct gpuThread * g) { char compModeString[maxstring]; char identstring[maxstring]; gpuAssert( cuDeviceGet(&g->deviceHandle, g->deviceNumber) ); gpuAssert( cuDeviceGetName(g->deviceName, maxstring, g->deviceHandle) ); gpuAssert( cuDeviceGetProperties(&g->deviceProps, g->deviceHandle) ); gpuAssert( cuDeviceTotalMem(&g->deviceMemoryTot, g->deviceHandle) ); gpuAssert( cuDeviceGetAttribute(&g->deviceCompMode, CU_DEVICE_ATTRIBUTE_COMPUTE_MODE, g->deviceHandle) ); gpuAssert( cuDeviceComputeCapability(&g->deviceCC[0], &g->deviceCC[1], g->deviceHandle) ); switch (g->deviceCompMode) { case CU_COMPUTEMODE_PROHIBITED: sprintf(compModeString,"Compute Prohibited mode"); break; case CU_COMPUTEMODE_DEFAULT: sprintf(compModeString, "Normal mode"); break; case CU_COMPUTEMODE_EXCLUSIVE: sprintf(compModeString, "Compute Exclusive mode"); break; default: sprintf(compModeString, "Unknown"); break; } sprintf(identstring, "%d %s, %d MHz, %d Mb, Compute Capability %d.%d, %s", g->deviceNumber, g->deviceName, g->deviceProps.clockRate/1000, g->deviceMemoryTot / constMb, g->deviceCC[0], g->deviceCC[1], compModeString); gpuDiagMsg(stderr, identstring, __FILE__, __LINE__); } The query for the CU_DEVICE_ATTRIBUTE_COMPUTE_MODE is the salient part here. This shouldn't be set to anything other than 0 on OS X or Windows (except perhaps with the next compute only Telsa driver NVIDIA released a few months ago), but on Linux can can be set to other values, depending on the GPU administration policies on the box it runs on. Thanks for you time. EDIT: It occurred to me after I posted this that your GPU selection code is going to fail on Fermi, because it will reject compute 2.0 when it probably shouldn't (actually so will that code snippet I posted, for the same reason...)