Welcome to MilkyWay@home

CUDA Milkyway Application v0.05

Message boards : Application Code Discussion : CUDA Milkyway Application v0.05
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 22954 - Posted: 21 May 2009, 22:26:35 UTC
Last modified: 1 Jun 2009, 8:23:50 UTC

I've released the code in the code release directory.

Releases:

  • v0.02 and v0.03 added files missing in the v0.01 release.
  • v0.04 should fix the compile script (it still does need to be edited based on where you have CUDA installed on your computer)
  • v0.05 combined both the linux and osx makefiles into a single makefile. this should also compile the GPU version of the application. added evaluation_gpu2.cu which is a faster version of the GPU code.



The major CUDA code is in evaluation_gpu.cu and evaluation_gpu2.cu, right now we're not using gpu_coords and gpu_r_constants, which were just test files to see if I could generate the inputs to the integral calculation on the GPU as well (we can, but there isn't any performance benefit and it sacrifices accuracy for single precision GPUs).


Anyways, the TODO for the GPU application as of now is:


  • get checkpointing working
  • update the GPU code allowing for compilation in single or double precision (currently it's single precision only)
  • general code cleanup (it's pretty rough)
  • test kahan summation for accuracy and performance on single precision GPUs
  • test code on parameter sets with multiple integrals (pretty sure this is working)



Also, we're going to be setting up a SVN repository with anonymous access so you guys can download current code and stable versions, which should make releasing our code easier.

--Travis


ID: 22954 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Silvers

Send message
Joined: 21 Aug 08
Posts: 625
Credit: 558,425
RAC: 0
Message 22965 - Posted: 21 May 2009, 23:39:24 UTC

IMO, with all the clamoring over the past, what, few months, I think you should stress that this is a work-in-progress that is at an alpha level and that people should EXPECT there to be problems. Of course, I'm sure it won't stop the inevitable complaining, but maybe it will help lessen it...
ID: 22965 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Westsail and *Pyxey*
Avatar

Send message
Joined: 22 Mar 08
Posts: 65
Credit: 15,715,071
RAC: 0
Message 22967 - Posted: 22 May 2009, 0:09:04 UTC

This is fantastic news. THANK YOU very much to everyone involved in making this dream into reality.

I volunteer to help any testing that is needed. Have AMD and Intel boxen. Can use; 8400, 9500GT, 9800GTX, 260, and/or C1060. Also have access to an ATI card.

Best of luck for moving project forward. Aloha! -Brandon
ID: 22967 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[B^S] Beremat

Send message
Joined: 19 Feb 09
Posts: 33
Credit: 1,134,826
RAC: 0
Message 22970 - Posted: 22 May 2009, 0:32:33 UTC


Awesome news! Good luck with the todo list.

ID: 22970 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 22972 - Posted: 22 May 2009, 0:38:01 UTC - in response to Message 22954.  

The major CUDA code is in evaluation_gpu.cu, right now we're not using gpu_coords and gpu_r_constants, which were just test files to see if I could generate the inputs to the integral calculation on the GPU as well (we can, but there isn't any performance benefit and it sacrifices accuracy for single precision GPUs).

I was just scrolling through the code and it looks like the general layout is quite similar to what the ATI app uses now (you are perfectly right, some things are just not worth to do it on the GPU). I was wondering what you have changed on the actual integration regarding the DP<->SP change. I have to get some sleep now (2:35am here already and I have to work tomorrow, err today), but I've seen nothing peculiar. Or have you hidden it somewhere else?
As I have said already in some thread here, with single precision the order of summing up all the individual values can get important. Especially as it may deviate quite much between different GPUs. But I guess the simplest thing is to test it.
ID: 22972 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 22977 - Posted: 22 May 2009, 1:04:59 UTC - in response to Message 22972.  

The major CUDA code is in evaluation_gpu.cu, right now we're not using gpu_coords and gpu_r_constants, which were just test files to see if I could generate the inputs to the integral calculation on the GPU as well (we can, but there isn't any performance benefit and it sacrifices accuracy for single precision GPUs).

I was just scrolling through the code and it looks like the general layout is quite similar to what the ATI app uses now (you are perfectly right, some things are just not worth to do it on the GPU). I was wondering what you have changed on the actual integration regarding the DP<->SP change. I have to get some sleep now (2:35am here already and I have to work tomorrow, err today), but I've seen nothing peculiar. Or have you hidden it somewhere else?
As I have said already in some thread here, with single precision the order of summing up all the individual values can get important. Especially as it may deviate quite much between different GPUs. But I guess the simplest thing is to test it.


Most of the single/double precision stuff is going to be handled server-side by the MW_gpu@home. As it is, the CUDA code is giving me about 6-7 decimal places accuracy, and while that isn't exactly great, it's more than enough to help out for quite some time in the initial stages of any search.

Part of the single precision testing was to see how much crunching I could get away with GPU side and how much it effected accuracy. That's why I'm not using gpu_coords and gpu_r_constants, there's also a second likelihood kernel in there which does more GPU crunching that I don't think we'll be using -- it also happens to be slower.

The rest of the single precision testing was using different summation methods for all the little values. What's in there now I think is giving us the best accuracy (each thread has it's own sum, then at the end all those are summed). I tried partial sums and using the CUDA reduce but it seemed to be a bit buggy (i'm not sure why), but they didn't give particularly better results.
ID: 22977 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Slicker [TopGun]

Send message
Joined: 20 Mar 08
Posts: 46
Credit: 69,382,802
RAC: 0
Message 23019 - Posted: 22 May 2009, 13:19:09 UTC

To get the cuda lib and boinc libs to play nice, my solution has been to split the code into cuda and non-cuda sections and not include the boinc headers in any of the cuda code. Instead, introduce a middle tier in C which is nothing more than a pass-thru between the boinc C code and the cuda code.

e.g. instead of
Boinc-Code -> CUDA-Code

you would do
Boinc-Code -> non-Boinc-CUDA-wrapper w/o boinc headers -> CUDA-Code

If you need to do IO in the cuda portion of the app, you can pass an MFILE pointer to the cuda code and include only the MFILE.H header and it will still "play nice" with the cuda code. So, you can still write buffered output within the cuda code.
ID: 23019 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 23032 - Posted: 22 May 2009, 18:08:50 UTC - in response to Message 23019.  

To get the cuda lib and boinc libs to play nice, my solution has been to split the code into cuda and non-cuda sections and not include the boinc headers in any of the cuda code. Instead, introduce a middle tier in C which is nothing more than a pass-thru between the boinc C code and the cuda code.

e.g. instead of
Boinc-Code -> CUDA-Code

you would do
Boinc-Code -> non-Boinc-CUDA-wrapper w/o boinc headers -> CUDA-Code

If you need to do IO in the cuda portion of the app, you can pass an MFILE pointer to the cuda code and include only the MFILE.H header and it will still "play nice" with the cuda code. So, you can still write buffered output within the cuda code.


This basically is what I'm doing now... there's no I/O in the .cu files.

The problem seems to be using a struct from the c code in the cuda code. I'm going to do a little rewrite today and see if that fixes it.
ID: 23032 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 23042 - Posted: 22 May 2009, 19:56:05 UTC - in response to Message 22977.  
Last modified: 22 May 2009, 20:09:39 UTC

As I have said already in some thread here, with single precision the order of summing up all the individual values can get important. Especially as it may deviate quite much between different GPUs. But I guess the simplest thing is to test it.

The rest of the single precision testing was using different summation methods for all the little values. What's in there now I think is giving us the best accuracy (each thread has it's own sum, then at the end all those are summed). I tried partial sums and using the CUDA reduce but it seemed to be a bit buggy (i'm not sure why), but they didn't give particularly better results.

Ahh, I just see that you do the reduction now on the CPU after converting the values to doubles. That of course avoids the precision loss and the problems with different summation orders compared to the GPU reduction. I was already wondering how you got the 6 to 7 digits precision (with a GPU reduction in single precision I would expect maybe 5 to 6). The tradeoff is that you have to transfer the individual values over the PCI-Express interface. But I guess that is bearable (most probably even for old AGP cards). Compared to the scheme the current ATI app is using (reduction on GPU after each integration kernel, not only one reduction for the whole integral) it could even be a bit faster.

As you have basically integrals over 3 dimensions (mu, nu, r) and one integration kernel does an integration in a 2-dimensional plane, have you thought about comparing different orientations of the "integration kernel plane"? In your current code one kernel integrates in the nu-r plane and is called mu times. What do you think, could it be better to integrate in the mu-r plane and call it nu times (r should be in the plane either way for better reuse of the lookup tables and cache efficiency)?
As mu is normally a factor 10 bigger as nu, one would need less kernel calls (less overhead) and also only a tenth of the values would be added by the GPU (with a potential loss of half a bit precision every time). One would need of course a factor of 10 more memory on the card (should not be an issue) and one has to transfers 10 times as much data back to the CPU (would be about 13MB for a two-stream-WU). But if one considers this has to be done only every 10 seconds or so on high end cards (medium or low end cards would need of course longer for the calculation) it appears not that much.
Maybe it's not a huge effect on precision, but I could imagine it results in a smaller deviation from the double precision result on the CPU. So maybe it's worth to look into that?

And maybe related, I've told you quite some time ago (was a PM), that I'm only doing the integration on the GPU. The likelihood stuff is completely done on the CPU in the ATI app. I thought there is too much communication involved for that little computation. Maybe it's faster than a CPU (a 2 GHz Core2 needs about a second for it) on a high end graphics card, but with slower GPUs (especially when compared with a vectorized SSE2 build) it isn't worth it. You have said yourself that using the likelihood kernel that does more on the GPU (and less on the CPU) is slower than the one you use now. I would expect that a pure CPU likelihood computation wouldn't be slower than the current solution. But I guess it does'nt matter at all because it concerns less than one percent of the total computation.

PS:
I miss the evaluation folder in the code release. I think one needs that to build a fully functional app (that can handle the bigger WUs of the GPU project).
ID: 23042 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 23050 - Posted: 22 May 2009, 21:13:06 UTC - in response to Message 23042.  


As you have basically integrals over 3 dimensions (mu, nu, r) and one integration kernel does an integration in a 2-dimensional plane, have you thought about comparing different orientations of the "integration kernel plane"? In your current code one kernel integrates in the nu-r plane and is called mu times. What do you think, could it be better to integrate in the mu-r plane and call it nu times (r should be in the plane either way for better reuse of the lookup tables and cache efficiency)?


Actually my first run through the code i had a thread per each convolve (which turned out to only be around a 2x speedup, which was pretty horrible). I was just thinking last night that I could get a pretty good speedup by making the grid dimensions nu by r (instead of r by nu), and then i could put the convolution values in device__r_constants in local memory (which is cached, unlike the global memory) for each iteration of the loop, and this should be quite a big speedup.

As mu is normally a factor 10 bigger as nu, one would need less kernel calls (less overhead) and also only a tenth of the values would be added by the GPU (with a potential loss of half a bit precision every time). One would need of course a factor of 10 more memory on the card (should not be an issue) and one has to transfers 10 times as much data back to the CPU (would be about 13MB for a two-stream-WU). But if one considers this has to be done only every 10 seconds or so on high end cards (medium or low end cards would need of course longer for the calculation) it appears not that much.
Maybe it's not a huge effect on precision, but I could imagine it results in a smaller deviation from the double precision result on the CPU. So maybe it's worth to look into that?


At least with the graphics card I was using, there's a limit of 512 threads for any block, so that's why I put nu as the number of threads (as opposed to r or mu). On better cards it might be worthwhile to do it the other way around.

And maybe related, I've told you quite some time ago (was a PM), that I'm only doing the integration on the GPU. The likelihood stuff is completely done on the CPU in the ATI app. I thought there is too much communication involved for that little computation. Maybe it's faster than a CPU (a 2 GHz Core2 needs about a second for it) on a high end graphics card, but with slower GPUs (especially when compared with a vectorized SSE2 build) it isn't worth it. You have said yourself that using the likelihood kernel that does more on the GPU (and less on the CPU) is slower than the one you use now. I would expect that a pure CPU likelihood computation wouldn't be slower than the current solution. But I guess it does'nt matter at all because it concerns less than one percent of the total computation.


Actually it seems like the likelihood calculation is a bit faster on my GPU the way I have it set up, its maybe 1-2 seconds if that. But like you said it's not that big of a deal given that the integration is most of the work. However, one of the new astronomy students we have working with us this summer might start doing runs with multiple streams, so we might have significantly larger numbers of stars to crunch.

PS:
I miss the evaluation folder in the code release. I think one needs that to build a fully functional app (that can handle the bigger WUs of the GPU project).


Did you download the v0.02 code? It should have all the stuff required in the evaluation folder.
ID: 23050 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 23067 - Posted: 23 May 2009, 0:12:22 UTC - in response to Message 23050.  

Actually my first run through the code i had a thread per each convolve (which turned out to only be around a 2x speedup, which was pretty horrible). I was just thinking last night that I could get a pretty good speedup by making the grid dimensions nu by r (instead of r by nu), and then i could put the convolution values in device__r_constants in local memory (which is cached, unlike the global memory) for each iteration of the loop, and this should be quite a big speedup.

Reads from global memory (which is the RAM on the graphics card if I remember the terminology correct) should be cached (but access to local memory is even faster than the L1 cache). One should choose the layout which makes the best use of the caches (spatial locality). But it does not have much of an effect, as the bandwidth requirements are not that high with the MW code. That's also why I doubt you would see any pronounced effect from using the local memory. It's true it is quite low latency, but as long as you have enough threads in flight, latency does not matter on a GPU. From my experience the kernels are completely compute bound (I have counted about 50 computational instructions per memory fetch, will be a bit less with SP, but you get the point) even on ATI GPUs. They reach more than 95% of theoretical instruction throughput with the DP code, so there are virtually no stalls waiting for memory. As Nvidia cards have generally a higher memory bandwidth per execution unit available as ATI, it will have even less of an effect I think.

At least with the graphics card I was using, there's a limit of 512 threads for any block, so that's why I put nu as the number of threads (as opposed to r or mu). On better cards it might be worthwhile to do it the other way around.

I'm not that familiar with CUDA, but you can assemble grids made out of several blocks to overcome this, isn't it? As you don't have any communication between the threads in the integration kernels, it doesn't matter how you structure the execution domain (besides some effect on the caching efficiency). So it should be possible to use a block as a twodimensional tile of the integration plane (and not a line). The memory layout may get a bit more complicated in that case. To bad CUDA does not offer some sort of automated memory tile layout (or does it?). That should work quite well for MW.

PS:
I miss the evaluation folder in the code release. I think one needs that to build a fully functional app (that can handle the bigger WUs of the GPU project).

Did you download the v0.02 code? It should have all the stuff required in the evaluation folder.

Yes, I've downloaded the 0.02 version and there is no evaluation folder at all.
ID: 23067 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Emanuel

Send message
Joined: 18 Nov 07
Posts: 280
Credit: 2,442,757
RAC: 0
Message 23102 - Posted: 23 May 2009, 13:07:22 UTC

Out of curiosity Travis, have you seen this article? It discusses various techniques for reducing precision loss when doing summation, including their performance penalties. To sum up: the best technique is Kahan summation, although it might involve refactoring some code.
ID: 23102 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 23120 - Posted: 23 May 2009, 16:07:42 UTC - in response to Message 23102.  
Last modified: 23 May 2009, 16:13:15 UTC

Out of curiosity Travis, have you seen this article? It discusses various techniques for reducing precision loss when doing summation, including their performance penalties. To sum up: the best technique is Kahan summation, although it might involve refactoring some code.

Nice find! That article is a good explanation of the problem and some solutions to it. To cite the conclusion:
Over a large number of runs a pattern becomes obvious: Simple and Sorted Summation are tied for least accurate, Pairwise Summation is a little better, Kahan Summation is better still, and the champion is Extended Precision

What a reduction on a GPU normally does is a pairwise summation as used by the current ATI app (it actually adds more than a pair per run, a better name may be "tree summation"). Travis opted now for the extended precision. So a Kahan summation would not improve the result.
ID: 23120 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Emanuel

Send message
Joined: 18 Nov 07
Posts: 280
Credit: 2,442,757
RAC: 0
Message 23165 - Posted: 24 May 2009, 0:24:35 UTC

Yeah, I was wondering if it might be useful for the single precision part of the program. There's also Extended Precision Kahan Summation if you want to squeeze a touch more accuracy out of the extended precision calculations ;)

The article is one of the top hits if you google 'Kahan Summation'; I came across the term on the forum for the Netflix Prize, so all credit to them (and Kahan!)
ID: 23165 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 12 Apr 08
Posts: 621
Credit: 161,934,067
RAC: 0
Message 23167 - Posted: 24 May 2009, 0:54:46 UTC

The standard floating point units based on the 8087 concept almost always convert single and double precision values into 80-bit precision values on the coprocessor and then used programmer selected truncation or rounding back to the output number size. It is hard to find good information about the FP units embedded in more recent processors so I cannot say that this is still true (though it most likely is, at least on Intel processors as to do otherwise would likely cost customers)...

On some iterative systems even that is not enough to prevent the gradual erosion of precision. IN some places we had to use double externally and still lost down to 10 digits of accuracy on very long runs.

I think on some of the Borland compilers you were able to use the 80-bit precision externally not that it matters, we are not going to be using their compilers ...

YMMV :)
ID: 23167 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Emanuel

Send message
Joined: 18 Nov 07
Posts: 280
Credit: 2,442,757
RAC: 0
Message 23211 - Posted: 24 May 2009, 16:50:24 UTC
Last modified: 24 May 2009, 16:51:00 UTC

Yeah, as far as I'm aware double precision floating point values are treated as 80-bit (64-bit mantissa), unless you specify a compilation flag to make them 64-bit, but that doesn't mean they're not susceptible to errors. The Kahan summation concept is pretty clever in that it just saves the error so you can take it into account for the next calculation.
ID: 23211 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 23347 - Posted: 25 May 2009, 20:56:50 UTC - in response to Message 23211.  
Last modified: 25 May 2009, 20:58:08 UTC

One way I think we can squeeze out some extra precision is summing each value during each iteration of the loop, I'm going to check how much slower it is and if we get any extra precision from it:

621         for (i = 0; i < number_integrals; i++) {
622                 dim3 dimGrid(r_steps[i], mu_increment);
623 
624                 gpu__zero_integrals<2><<<dimGrid, nu_steps[i]>>>(device__background_integrals[i], device__stream_integrals[i]);
625                 for (j = 0; j < mu_steps[i]; j += mu_increment) {
626                         switch(number_streams) {
627                                 case 1: gpu__integral_kernel<1><<<dimGrid, nu_steps[i]>>>(      convolve, j, mu_steps[i],
628                                                         q, r0,
629                                                         device__lb[i], device__r_constants[i], device__V[i],
630                                                         device__background_integrals[i],
631                                                         device__stream_integrals[i]);
632                                         break;          
633                                 case 2: gpu__integral_kernel<2><<<dimGrid, nu_steps[i]>>>(      convolve, j, mu_steps[i],
634                                                         q, r0,
635                                                         device__lb[i], device__r_constants[i], device__V[i],
636                                                         device__background_integrals[i],
637                                                         device__stream_integrals[i]);
638                                         break;          
639                                 case 3: gpu__integral_kernel<3><<<dimGrid, nu_steps[i]>>>(      convolve, j, mu_steps[i],
640                                                         q, r0,
641                                                         device__lb[i], device__r_constants[i], device__V[i],
642                                                         device__background_integrals[i],
643                                                         device__stream_integrals[i]);
644                                         break;          
645                                 case 4: gpu__integral_kernel<4><<<dimGrid, nu_steps[i]>>>(      convolve, j, mu_steps[i],
646                                                         q, r0,
647                                                         device__lb[i], device__r_constants[i], device__V[i],
648                                                         device__background_integrals[i],
649                                                         device__stream_integrals[i]);
650                                         break;          
651                         }               
652                 }       
653                 cpu__sum_integrals(i, &background_integral, stream_integrals);
654         }


So instead of summing at the end, we could sum during each step of the inner loop which might squeeze out some more precision, instead of each output data point being summed in the GPU memory as a float by gpu__integral_kernel

I'm not sure how much it'll hurt us in terms of performance and how much we'll gain in accuracy, but only one way to find out :)
ID: 23347 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
SATAN
Avatar

Send message
Joined: 27 Feb 09
Posts: 45
Credit: 305,963
RAC: 0
Message 23567 - Posted: 28 May 2009, 16:56:27 UTC
Last modified: 28 May 2009, 16:57:13 UTC

Travis,

It's been a while since i messed around with code and compiling applications, but we are missing the simple_evaluator.h file.

It is referenced in many of the .c files.

Also when I go to compile the files it is telling me that their are a lot of syntax errors. I am sure you are more than aware of these though.

I did make some progress the other night. I managed to get a init_data.xml file along with a stderr.txt file.

"Can't open init data file - running in standalone mode
Couldn't find input file [astronomy_parameters.txt] to read astronomy parameters.
APP: error reading astronomy parameters: 1
called boinc_finish"
- is the contents of the stderr file.
Mars rules this confectionery war!
ID: 23567 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[B^S] Beremat

Send message
Joined: 19 Feb 09
Posts: 33
Credit: 1,134,826
RAC: 0
Message 23568 - Posted: 28 May 2009, 18:42:21 UTC

There's a bunch of parameters in the test_files folder that have numbers after them that probably go with the stars.txt files.

ID: 23568 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 23605 - Posted: 28 May 2009, 23:23:27 UTC - in response to Message 23347.  
Last modified: 28 May 2009, 23:24:52 UTC

One way I think we can squeeze out some extra precision is summing each value during each iteration of the loop, I'm going to check how much slower it is and if we get any extra precision from it:

[..]

So instead of summing at the end, we could sum during each step of the inner loop which might squeeze out some more precision, instead of each output data point being summed in the GPU memory as a float by gpu__integral_kernel

I'm not sure how much it'll hurt us in terms of performance and how much we'll gain in accuracy, but only one way to find out :)

As it will indeed provide some precision gain, you will have to copy at least 1.4GB (for single stream, dual stream would be 2GB+) of data back from the GPU to the CPU per WU. This will immediately exclude any AGP cards from performing only somehow close to their computational limits. You also shift some more computation back to the CPU leading to a higher CPU load, something not so preferable in my opinion (one should try to get the least interference with CPU projects on the same machine without sacrificing performance of the GPU code).

Before doing that I would prefer to perform a better summation (like Kahan) on the GPU (doing it within each convolution loop is really easy and can be also done for the summation of the values from the mu loop, even if it may require the use of additional array[s] for the correction values on the GPU).

And have you thought about switching mu and nu as suggested here? That means looping over nu in host code and calculating the integral in the mu-r plane. That would be something of a compromise as one would add only a tenth of the values on the GPU compared to your original version (actually almost the same number, but in
a 10 times larger array so you have substantially less additions performed in a row), but retains the performance benefit of transferring much less data back to the CPU (only 1/160 of your proposal above) combined with a lower CPU load. Furthermore it operates on a larger domain of execution (1600x700 instead of 160x700) which increases the number of concurrently processed threads (performance gain) combined with less overhead from the kernel calls, both likely to compensate (or even overcompensates) the performance loss for the need of transferring 10 times as much data back to the CPU. But that would be only about 9 to 13.5 MB per WU, so much more tolarable than the several GBs needed for doing the summation entirely on the CPU.
ID: 23605 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Application Code Discussion : CUDA Milkyway Application v0.05

©2024 Astroinformatics Group