Joined: 30 Aug 07
I'm going to be updating to v0.05 tonight. I've gotten a working makefile that combines the osx/linux makefiles and also can compile the GPU application. I don't have a linux machine with a GPU yet so you'll have to test and see if the linux stuff in the makefile is working.
I've got two seperate evaluation kernels in the new code:
evaluation_gpu.cu is the same as before: the grid dimensions are r_steps by mu_steps, iterating over mu in the outside loop, the number of threads per block in the grid is nu_steps.
evaluation_gpu2.cu has new grid dimensions, mu_steps by r_steps, iterating over r in the outside loop, with the number of threads the same, still nu_steps.
evaluation_gpu2 moves the required r_constants into constant memory before doing the calculation (so they can be cached across all the blocks/threads (AFAIK). Unfortunately this was only around a 2-4% improvement in performance. It might faster to have the r_constants on the device and move them into constant memory each iteration of the outer loop, so I'll be trying this in another kernel.
Right now what gets transferred back to the CPU (in evaluation_gpu2) is mu_steps * nu_steps * R_INCREMENT * (1 + number_streams) floats. I tried summing the data on each iteration of the loop and really didn't see any performance difference. I think my GPU just isn't particularly fast so it's highly computation bound.
Once our new undergrad/grad student (who has worked with CUDA for quite awhile and knows a lot more than I do) gets back, which should be next week, we'll probably be able to make a bit more progress since he knows what he's doing and has access to faster GPUs :) I've been learning as I go.
As to swapping mu and nu, I think I said this in another post, it would be problematic because (at least on my GPU) the thread limit per block is 512, which is less than nu. I suppose I could do multiple nu steps in each thread, which might be better on faster GPUs which aren't compute bound.
Joined: 26 Jul 08
Have you seen my post in the other thread? It looks like a Kahan summation on the GPU may be already enough to get the desired precision. I ditched the reduction on the CPU and doing it now completely on the GPU. But I don't know, how much the precision of the final result is influenced by the likelihood calculation (done in DP on the CPU in my version, have to check what a change to SP will do). But comparing only the pure integral values I also get about 7 decimal digits of precision compared to a DP calculation.
I guess the layout of the integral (either mu-r plane and looping over nu or nu-r plane and looping over mu) isn't that crucial. My choice only lowers the amount of kernel calls and offers more parallel threads. It may have not much influence on the performance with current cards as a kernel call costs in the region of ten(s of) microseconds or so. But I guess the summations are the place where one really starts to loose precision. I've chosen now a somewhat save route and do really every sum (also within the convolution loop) with a variant of the Kahan method.
©2021 Astroinformatics Group