Message boards :
Application Code Discussion :
CUDA Milkyway Application v0.05
Message board moderation
Author | Message |
---|---|
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
I've released the code in the code release directory. Releases:
|
Send message Joined: 21 Aug 08 Posts: 625 Credit: 558,425 RAC: 0 |
IMO, with all the clamoring over the past, what, few months, I think you should stress that this is a work-in-progress that is at an alpha level and that people should EXPECT there to be problems. Of course, I'm sure it won't stop the inevitable complaining, but maybe it will help lessen it... |
Send message Joined: 22 Mar 08 Posts: 65 Credit: 15,715,071 RAC: 0 |
This is fantastic news. THANK YOU very much to everyone involved in making this dream into reality. I volunteer to help any testing that is needed. Have AMD and Intel boxen. Can use; 8400, 9500GT, 9800GTX, 260, and/or C1060. Also have access to an ATI card. Best of luck for moving project forward. Aloha! -Brandon |
Send message Joined: 19 Feb 09 Posts: 33 Credit: 1,134,826 RAC: 0 |
Awesome news! Good luck with the todo list. |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
The major CUDA code is in evaluation_gpu.cu, right now we're not using gpu_coords and gpu_r_constants, which were just test files to see if I could generate the inputs to the integral calculation on the GPU as well (we can, but there isn't any performance benefit and it sacrifices accuracy for single precision GPUs). I was just scrolling through the code and it looks like the general layout is quite similar to what the ATI app uses now (you are perfectly right, some things are just not worth to do it on the GPU). I was wondering what you have changed on the actual integration regarding the DP<->SP change. I have to get some sleep now (2:35am here already and I have to work tomorrow, err today), but I've seen nothing peculiar. Or have you hidden it somewhere else? As I have said already in some thread here, with single precision the order of summing up all the individual values can get important. Especially as it may deviate quite much between different GPUs. But I guess the simplest thing is to test it. |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
The major CUDA code is in evaluation_gpu.cu, right now we're not using gpu_coords and gpu_r_constants, which were just test files to see if I could generate the inputs to the integral calculation on the GPU as well (we can, but there isn't any performance benefit and it sacrifices accuracy for single precision GPUs). Most of the single/double precision stuff is going to be handled server-side by the MW_gpu@home. As it is, the CUDA code is giving me about 6-7 decimal places accuracy, and while that isn't exactly great, it's more than enough to help out for quite some time in the initial stages of any search. Part of the single precision testing was to see how much crunching I could get away with GPU side and how much it effected accuracy. That's why I'm not using gpu_coords and gpu_r_constants, there's also a second likelihood kernel in there which does more GPU crunching that I don't think we'll be using -- it also happens to be slower. The rest of the single precision testing was using different summation methods for all the little values. What's in there now I think is giving us the best accuracy (each thread has it's own sum, then at the end all those are summed). I tried partial sums and using the CUDA reduce but it seemed to be a bit buggy (i'm not sure why), but they didn't give particularly better results. |
Send message Joined: 20 Mar 08 Posts: 46 Credit: 69,382,802 RAC: 0 |
To get the cuda lib and boinc libs to play nice, my solution has been to split the code into cuda and non-cuda sections and not include the boinc headers in any of the cuda code. Instead, introduce a middle tier in C which is nothing more than a pass-thru between the boinc C code and the cuda code. e.g. instead of Boinc-Code -> CUDA-Code you would do Boinc-Code -> non-Boinc-CUDA-wrapper w/o boinc headers -> CUDA-Code If you need to do IO in the cuda portion of the app, you can pass an MFILE pointer to the cuda code and include only the MFILE.H header and it will still "play nice" with the cuda code. So, you can still write buffered output within the cuda code. |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
To get the cuda lib and boinc libs to play nice, my solution has been to split the code into cuda and non-cuda sections and not include the boinc headers in any of the cuda code. Instead, introduce a middle tier in C which is nothing more than a pass-thru between the boinc C code and the cuda code. This basically is what I'm doing now... there's no I/O in the .cu files. The problem seems to be using a struct from the c code in the cuda code. I'm going to do a little rewrite today and see if that fixes it. |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
As I have said already in some thread here, with single precision the order of summing up all the individual values can get important. Especially as it may deviate quite much between different GPUs. But I guess the simplest thing is to test it. Ahh, I just see that you do the reduction now on the CPU after converting the values to doubles. That of course avoids the precision loss and the problems with different summation orders compared to the GPU reduction. I was already wondering how you got the 6 to 7 digits precision (with a GPU reduction in single precision I would expect maybe 5 to 6). The tradeoff is that you have to transfer the individual values over the PCI-Express interface. But I guess that is bearable (most probably even for old AGP cards). Compared to the scheme the current ATI app is using (reduction on GPU after each integration kernel, not only one reduction for the whole integral) it could even be a bit faster. As you have basically integrals over 3 dimensions (mu, nu, r) and one integration kernel does an integration in a 2-dimensional plane, have you thought about comparing different orientations of the "integration kernel plane"? In your current code one kernel integrates in the nu-r plane and is called mu times. What do you think, could it be better to integrate in the mu-r plane and call it nu times (r should be in the plane either way for better reuse of the lookup tables and cache efficiency)? As mu is normally a factor 10 bigger as nu, one would need less kernel calls (less overhead) and also only a tenth of the values would be added by the GPU (with a potential loss of half a bit precision every time). One would need of course a factor of 10 more memory on the card (should not be an issue) and one has to transfers 10 times as much data back to the CPU (would be about 13MB for a two-stream-WU). But if one considers this has to be done only every 10 seconds or so on high end cards (medium or low end cards would need of course longer for the calculation) it appears not that much. Maybe it's not a huge effect on precision, but I could imagine it results in a smaller deviation from the double precision result on the CPU. So maybe it's worth to look into that? And maybe related, I've told you quite some time ago (was a PM), that I'm only doing the integration on the GPU. The likelihood stuff is completely done on the CPU in the ATI app. I thought there is too much communication involved for that little computation. Maybe it's faster than a CPU (a 2 GHz Core2 needs about a second for it) on a high end graphics card, but with slower GPUs (especially when compared with a vectorized SSE2 build) it isn't worth it. You have said yourself that using the likelihood kernel that does more on the GPU (and less on the CPU) is slower than the one you use now. I would expect that a pure CPU likelihood computation wouldn't be slower than the current solution. But I guess it does'nt matter at all because it concerns less than one percent of the total computation. PS: I miss the evaluation folder in the code release. I think one needs that to build a fully functional app (that can handle the bigger WUs of the GPU project). |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
Actually my first run through the code i had a thread per each convolve (which turned out to only be around a 2x speedup, which was pretty horrible). I was just thinking last night that I could get a pretty good speedup by making the grid dimensions nu by r (instead of r by nu), and then i could put the convolution values in device__r_constants in local memory (which is cached, unlike the global memory) for each iteration of the loop, and this should be quite a big speedup. As mu is normally a factor 10 bigger as nu, one would need less kernel calls (less overhead) and also only a tenth of the values would be added by the GPU (with a potential loss of half a bit precision every time). One would need of course a factor of 10 more memory on the card (should not be an issue) and one has to transfers 10 times as much data back to the CPU (would be about 13MB for a two-stream-WU). But if one considers this has to be done only every 10 seconds or so on high end cards (medium or low end cards would need of course longer for the calculation) it appears not that much. At least with the graphics card I was using, there's a limit of 512 threads for any block, so that's why I put nu as the number of threads (as opposed to r or mu). On better cards it might be worthwhile to do it the other way around. And maybe related, I've told you quite some time ago (was a PM), that I'm only doing the integration on the GPU. The likelihood stuff is completely done on the CPU in the ATI app. I thought there is too much communication involved for that little computation. Maybe it's faster than a CPU (a 2 GHz Core2 needs about a second for it) on a high end graphics card, but with slower GPUs (especially when compared with a vectorized SSE2 build) it isn't worth it. You have said yourself that using the likelihood kernel that does more on the GPU (and less on the CPU) is slower than the one you use now. I would expect that a pure CPU likelihood computation wouldn't be slower than the current solution. But I guess it does'nt matter at all because it concerns less than one percent of the total computation. Actually it seems like the likelihood calculation is a bit faster on my GPU the way I have it set up, its maybe 1-2 seconds if that. But like you said it's not that big of a deal given that the integration is most of the work. However, one of the new astronomy students we have working with us this summer might start doing runs with multiple streams, so we might have significantly larger numbers of stars to crunch. PS: Did you download the v0.02 code? It should have all the stuff required in the evaluation folder. |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
Actually my first run through the code i had a thread per each convolve (which turned out to only be around a 2x speedup, which was pretty horrible). I was just thinking last night that I could get a pretty good speedup by making the grid dimensions nu by r (instead of r by nu), and then i could put the convolution values in device__r_constants in local memory (which is cached, unlike the global memory) for each iteration of the loop, and this should be quite a big speedup. Reads from global memory (which is the RAM on the graphics card if I remember the terminology correct) should be cached (but access to local memory is even faster than the L1 cache). One should choose the layout which makes the best use of the caches (spatial locality). But it does not have much of an effect, as the bandwidth requirements are not that high with the MW code. That's also why I doubt you would see any pronounced effect from using the local memory. It's true it is quite low latency, but as long as you have enough threads in flight, latency does not matter on a GPU. From my experience the kernels are completely compute bound (I have counted about 50 computational instructions per memory fetch, will be a bit less with SP, but you get the point) even on ATI GPUs. They reach more than 95% of theoretical instruction throughput with the DP code, so there are virtually no stalls waiting for memory. As Nvidia cards have generally a higher memory bandwidth per execution unit available as ATI, it will have even less of an effect I think. At least with the graphics card I was using, there's a limit of 512 threads for any block, so that's why I put nu as the number of threads (as opposed to r or mu). On better cards it might be worthwhile to do it the other way around. I'm not that familiar with CUDA, but you can assemble grids made out of several blocks to overcome this, isn't it? As you don't have any communication between the threads in the integration kernels, it doesn't matter how you structure the execution domain (besides some effect on the caching efficiency). So it should be possible to use a block as a twodimensional tile of the integration plane (and not a line). The memory layout may get a bit more complicated in that case. To bad CUDA does not offer some sort of automated memory tile layout (or does it?). That should work quite well for MW. PS: Yes, I've downloaded the 0.02 version and there is no evaluation folder at all. |
Send message Joined: 18 Nov 07 Posts: 280 Credit: 2,442,757 RAC: 0 |
Out of curiosity Travis, have you seen this article? It discusses various techniques for reducing precision loss when doing summation, including their performance penalties. To sum up: the best technique is Kahan summation, although it might involve refactoring some code. |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
Out of curiosity Travis, have you seen this article? It discusses various techniques for reducing precision loss when doing summation, including their performance penalties. To sum up: the best technique is Kahan summation, although it might involve refactoring some code. Nice find! That article is a good explanation of the problem and some solutions to it. To cite the conclusion: Over a large number of runs a pattern becomes obvious: Simple and Sorted Summation are tied for least accurate, Pairwise Summation is a little better, Kahan Summation is better still, and the champion is Extended Precision What a reduction on a GPU normally does is a pairwise summation as used by the current ATI app (it actually adds more than a pair per run, a better name may be "tree summation"). Travis opted now for the extended precision. So a Kahan summation would not improve the result. |
Send message Joined: 18 Nov 07 Posts: 280 Credit: 2,442,757 RAC: 0 |
Yeah, I was wondering if it might be useful for the single precision part of the program. There's also Extended Precision Kahan Summation if you want to squeeze a touch more accuracy out of the extended precision calculations ;) The article is one of the top hits if you google 'Kahan Summation'; I came across the term on the forum for the Netflix Prize, so all credit to them (and Kahan!) |
Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0 |
The standard floating point units based on the 8087 concept almost always convert single and double precision values into 80-bit precision values on the coprocessor and then used programmer selected truncation or rounding back to the output number size. It is hard to find good information about the FP units embedded in more recent processors so I cannot say that this is still true (though it most likely is, at least on Intel processors as to do otherwise would likely cost customers)... On some iterative systems even that is not enough to prevent the gradual erosion of precision. IN some places we had to use double externally and still lost down to 10 digits of accuracy on very long runs. I think on some of the Borland compilers you were able to use the 80-bit precision externally not that it matters, we are not going to be using their compilers ... YMMV :) |
Send message Joined: 18 Nov 07 Posts: 280 Credit: 2,442,757 RAC: 0 |
Yeah, as far as I'm aware double precision floating point values are treated as 80-bit (64-bit mantissa), unless you specify a compilation flag to make them 64-bit, but that doesn't mean they're not susceptible to errors. The Kahan summation concept is pretty clever in that it just saves the error so you can take it into account for the next calculation. |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
One way I think we can squeeze out some extra precision is summing each value during each iteration of the loop, I'm going to check how much slower it is and if we get any extra precision from it: 621 for (i = 0; i < number_integrals; i++) { 622 dim3 dimGrid(r_steps[i], mu_increment); 623 624 gpu__zero_integrals<2><<<dimGrid, nu_steps[i]>>>(device__background_integrals[i], device__stream_integrals[i]); 625 for (j = 0; j < mu_steps[i]; j += mu_increment) { 626 switch(number_streams) { 627 case 1: gpu__integral_kernel<1><<<dimGrid, nu_steps[i]>>>( convolve, j, mu_steps[i], 628 q, r0, 629 device__lb[i], device__r_constants[i], device__V[i], 630 device__background_integrals[i], 631 device__stream_integrals[i]); 632 break; 633 case 2: gpu__integral_kernel<2><<<dimGrid, nu_steps[i]>>>( convolve, j, mu_steps[i], 634 q, r0, 635 device__lb[i], device__r_constants[i], device__V[i], 636 device__background_integrals[i], 637 device__stream_integrals[i]); 638 break; 639 case 3: gpu__integral_kernel<3><<<dimGrid, nu_steps[i]>>>( convolve, j, mu_steps[i], 640 q, r0, 641 device__lb[i], device__r_constants[i], device__V[i], 642 device__background_integrals[i], 643 device__stream_integrals[i]); 644 break; 645 case 4: gpu__integral_kernel<4><<<dimGrid, nu_steps[i]>>>( convolve, j, mu_steps[i], 646 q, r0, 647 device__lb[i], device__r_constants[i], device__V[i], 648 device__background_integrals[i], 649 device__stream_integrals[i]); 650 break; 651 } 652 } 653 cpu__sum_integrals(i, &background_integral, stream_integrals); 654 } So instead of summing at the end, we could sum during each step of the inner loop which might squeeze out some more precision, instead of each output data point being summed in the GPU memory as a float by gpu__integral_kernel I'm not sure how much it'll hurt us in terms of performance and how much we'll gain in accuracy, but only one way to find out :) |
Send message Joined: 27 Feb 09 Posts: 45 Credit: 305,963 RAC: 0 |
Travis, It's been a while since i messed around with code and compiling applications, but we are missing the simple_evaluator.h file. It is referenced in many of the .c files. Also when I go to compile the files it is telling me that their are a lot of syntax errors. I am sure you are more than aware of these though. I did make some progress the other night. I managed to get a init_data.xml file along with a stderr.txt file. "Can't open init data file - running in standalone mode Couldn't find input file [astronomy_parameters.txt] to read astronomy parameters. APP: error reading astronomy parameters: 1 called boinc_finish" - is the contents of the stderr file. Mars rules this confectionery war! |
Send message Joined: 19 Feb 09 Posts: 33 Credit: 1,134,826 RAC: 0 |
There's a bunch of parameters in the test_files folder that have numbers after them that probably go with the stars.txt files. |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
One way I think we can squeeze out some extra precision is summing each value during each iteration of the loop, I'm going to check how much slower it is and if we get any extra precision from it: As it will indeed provide some precision gain, you will have to copy at least 1.4GB (for single stream, dual stream would be 2GB+) of data back from the GPU to the CPU per WU. This will immediately exclude any AGP cards from performing only somehow close to their computational limits. You also shift some more computation back to the CPU leading to a higher CPU load, something not so preferable in my opinion (one should try to get the least interference with CPU projects on the same machine without sacrificing performance of the GPU code). Before doing that I would prefer to perform a better summation (like Kahan) on the GPU (doing it within each convolution loop is really easy and can be also done for the summation of the values from the mu loop, even if it may require the use of additional array[s] for the correction values on the GPU). And have you thought about switching mu and nu as suggested here? That means looping over nu in host code and calculating the integral in the mu-r plane. That would be something of a compromise as one would add only a tenth of the values on the GPU compared to your original version (actually almost the same number, but in a 10 times larger array so you have substantially less additions performed in a row), but retains the performance benefit of transferring much less data back to the CPU (only 1/160 of your proposal above) combined with a lower CPU load. Furthermore it operates on a larger domain of execution (1600x700 instead of 160x700) which increases the number of concurrently processed threads (performance gain) combined with less overhead from the kernel calls, both likely to compensate (or even overcompensates) the performance loss for the need of transferring 10 times as much data back to the CPU. But that would be only about 9 to 13.5 MB per WU, so much more tolarable than the several GBs needed for doing the summation entirely on the CPU. |
©2024 Astroinformatics Group