Nbody on CUDA

Author	Message
Ian&Steve C. Send message Joined: 18 Nov 22 Posts: 97 Credit: 653,649,234 RAC: 14,805	Message 77957 - Posted: 7 May 2026, 15:57:05 UTC Last modified: 7 May 2026, 16:02:32 UTC The past several days I've been porting the N-body CPU application to CUDA with the help of AI (Claude Code). I want to be very clear that AI has done almost all of the work, but under my direction/supervision. I don't know enough about software dev for GPU applications to be able to do this myself, but I do know how to do testing/validation and simulations and enough knowledge to be able to guide the AI in the right direction. it's about 10k lines of new code right now, half of which is porting over some code from crlibm functions to be used in cuda. there is already an intial OpenCL baseline code in the nbody src, but it mostly doesnt work, or doesnt work well enough. so this was used as a template to translate into CUDA kernels. and through exhaustive investigating and testing, many bugs from the OpenCL were fixed in this CUDA implementation. I'm sure many will ask why port to CUDA instead of leaving it OpenCL, well the reality is that it's far easier to work in CUDA due to consistency and predictability in the Nvidia ecosystem (drivers, hardware). some of the bugs that were fixed in CUDA would be a nightmare to fix in OpenCL (and some things I needed to do in CUDA do not have fp64 analogs in OpenCL) and would probably require a separate build specific to all Nvidia/AMD/Intel to deal with all the differences in the architectures and how the compiler generates code. while CUDA excludes everything but Nvidia, I don't think I could have managed to get the app in an acceptable and portable state with OpenCL. plus I have very limited OpenCL devices to work and test with anyway. this app is almost pure FP64 compute, not really limited by memory. so heavy hitting FP64 GPUs like P100/V100/A100 etc are really the only worthwhile candidates to run this on GPU. i mean you CAN run it on any CUDA GPU, but it wont make much sense to as CPU crunching will be a lot faster than normal Geforce/RTX cards due to their relatively poor FP64 performance. the precision required demands FP64, and trying to use FP32 is basically impossible. even now with full FP64, i'm still getting invalids on ~45% of results because the result precision just isnt high enough. based on what i've seen pass/fail validation it seems the validator is requiring ~1e-6 precision and sometimes i make that, sometimes I dont (E-4 or or more). there's a still some 1-2ULP slip somewhere that I'm trying to pin down. some longer running tasks just drift out of validator precision, and a smaller percentage randomly blow up earlier that leads to larger differences in the likelihood code is here for anyone interested https://github.com/IanSteveC/milkywayathome_client_cuda I'll update it as I continue working on it and try to better the accuracy of results. currently exploring the BH algorithm to try to bit-match the CPU code. as far as performance, a V100 seems to be around 5-10x faster than a CPU running 4-threads. maybe not quite as efficient as a high core count CPU, but V100s (SXM2->PCIe conversion) are quite cheap these days and it's not too hard to put together a multi-GPU rig. there has also been zero attempts at performance optimization yet, another thing CUDA makes easier, this is only trying to match results to CPU. also keep in mind, this is very much a work in progress. if you attempt to build this yourself to use, you will need to setup an app_info that makes BOINC treat this as a CPU task since the project will not send you GPU work as there are no GPU apps. this also limits you to running on only 1 GPU systems. some kind of wrapper would be needed for multi-GPU to work effectively. that's out of scope for right now. a few questions for the Project Admins: 1. for the validator, is only the top likelihood checked? or the match of all 8 parameters is important and need to match? 2. how much precision is ACTUALLY required for the result to be "valid" E-6 is an incredibly high bar to reach. CPU-CPU runs will match bit for bit, but it's very difficult to do the same thing efficiently though GPU. if a result matched all parameters to E-3, is that really a "different" result? or just an artifact of the validator being too tight? at which level of precision do you consider it a different result? loosening the validator a bit would certainly help a lot of the invalids I've seen already. 3. this app adopted existing GPU constraints for the supported potentials in the Lua file. currently Halo:Logarithmic, NFW, and NFWMass (I added this one) with LMC:Plummer are supported. but other Halo types (Triaxial, AS, WE, Plummer halo, etc) are not implemented in CUDA. it could probably be added, but it seems the current tasks arent using them anyway. are there plans to use these other potential types in the future that should be added? unsupported types failover to CPU right now. 4. if we can get some consensus on required precision and maybe adjust the validator to match and I can get the app to consistently meet that goal, would the project be interested in setting this up to be supported in the normal way not with run-as-CPU hacks with anonymous platform? an app, plan_class, app_version would need to be setup project-side so the scheduler can send tasks and the app to CUDA devices. welcome any feedback. ID: 77957 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 18 Nov 22 Posts: 97 Credit: 653,649,234 RAC: 14,805	Message 77958 - Posted: 7 May 2026, 19:04:24 UTC Last modified: 7 May 2026, 19:36:19 UTC results looking a lot better on most tasks now with the latest update. the tasks that are going invalid or inconclusive (with a compare) seem to be missing the mark around E-4 difference from CPU reference values. if we can raise the validation threshold to E-3 that would probably clear a lot of these being marked invalid. is that a possibility? ID: 77958 · Rating: 0 · rate: / Reply Quote

Xterelle Send message Joined: 7 Aug 22 Posts: 33 Credit: 22,705,144 RAC: 18,514	Message 77959 - Posted: 8 May 2026, 15:12:11 UTC Should we consider splitting the load between the CPU and GPU, like on einstein@home? ID: 77959 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 18 Nov 22 Posts: 97 Credit: 653,649,234 RAC: 14,805	Message 77960 - Posted: 8 May 2026, 15:37:58 UTC - in response to Message 77959. Last modified: 8 May 2026, 15:38:35 UTC Should we consider splitting the load between the CPU and GPU, like on einstein@home? One thing to note is that Einstein does fundamentally different science between CPU and GPU. They are not crunching the same thing. What I’ve done here is simply port the existing CPU Nbody app to run CUDA. It’s the same science so ideally the results should match, at least very closely. But due to very large architectural differences between GPU and CPU, summing orders, other factors it’s incredibly difficult to get bit-identical results from such different devices. The CUDA app right now does match very closely in most cases in my opinion. About half of the cases are a good enough match to pass the existing validation limits, and many of the invalids are just outside validation limits but still strong agreement to the canonical result, just not enough for the validator at present. There are still some outliers that I’m using for re-analysis to identify more bugs. And IMO these are correctly being marked invalid. Splitting Nbody into GPU and CPU subprojects would likely mask this and could lead to accepting some invalid results which I do not think is the right approach. I haven’t found a clear “type” of WU that seems more susceptible to errors caused by chaos. Loosening the validator limits a bit would allow validation of those very close matches, while still marking truly invalids correctly. ID: 77960 · Rating: 0 · rate: / Reply Quote

Xterelle Send message Joined: 7 Aug 22 Posts: 33 Credit: 22,705,144 RAC: 18,514	Message 77961 - Posted: 8 May 2026, 16:11:58 UTC By the way, I was once looking for information about n-body simulation. Maybe it will somehow help in development, I don’t even know... https://developer.nvidia.com/gpugems/gpugems3/part-v-physics-simulation/chapter-31-fast-n-body-simulation-cuda ID: 77961 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 18 Nov 22 Posts: 97 Credit: 653,649,234 RAC: 14,805	Message 77962 - Posted: 8 May 2026, 18:07:41 UTC - in response to Message 77961. Last modified: 8 May 2026, 18:12:00 UTC Interesting. Maybe something in there could be helpful. Though I can tell it’s woefully out of date as they are talking about running on very old G80 (8800GT) architecture. Also not sure if they sacrificed precision in their methods for optimizing the speed. The latest update that was just made should bring this CUDA app to produce exactly identical results to CPU now (barring any more outliers). Will monitor and reinvestigate as needed. But if all is good then this app is prime for adoption, at the project’s discretion. Otherwise I’ll have to make some kind of wrapper to properly assign tasks to GPUs in multi-GPU systems. An official project application would negate needing to do that. Updating the validator won’t be necessary for the latest code I think. But could still be helpful when I move to optimization ID: 77962 · Rating: 0 · rate: / Reply Quote

gimmyk Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 11 Sep 24 Posts: 33 Credit: 710,354 RAC: 477	Message 77963 - Posted: 9 May 2026, 0:50:29 UTC To answer some of your questions: I believe only the total likelihood is checked for the validator. I personally think we probably could get away with loosening the validator a little bit, but I don't think we will change it unless we have good reason to. We may have a reason for wanting it tighter that I am not thinking of at the moment (for the record, our notes recommend using 1e-10). We plan to soon begin getting results for different potentials, but most of these halo types will probably go unused. Logarithmic, NFW, and spherical NFW are the ones I can say we have immediate plans for. The most likely after that is probably the other LMC functions. We will probably only support this if we see a lot of people saying that it is something they are interested in. We have a hard enough time keeping up with our own code and this would be a large patch that we would have to learn and then continue to keep up with. I'll bring this up to the group, but we will probably decide there are other things we want to focus on first. It's always nice getting more tasks running though, so we might come back to this later on. Even if this doesn't get in right away, we really appreciate you helping to improve things! ID: 77963 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 18 Nov 22 Posts: 97 Credit: 653,649,234 RAC: 14,805	Message 77964 - Posted: 9 May 2026, 1:28:47 UTC - in response to Message 77963. Thanks gimmyk! The latest code seems very solid. It’s doing pretty much exactly the same calculation and returns exactly identical results to CPU now (all 8 metrics). At least from my test system (V100). So I think all the bugs for the current potentials are sorted. Right now, any unsupported potentials would failover to CPU. So at least the app wouldn’t straight up fail tasks. Actually the same binary can be used for CPU or GPU crunching. It uses a new parameter --use-cuda to enable the GPU processing. Without it just does everything on CPU. ID: 77964 · Rating: 0 · rate: / Reply Quote

Mark Send message Joined: 30 Aug 18 Posts: 1 Credit: 1,959,359,695 RAC: 366,685	Message 77965 - Posted: 9 May 2026, 1:39:58 UTC I am interested, but could you also make a windows client ? ID: 77965 · Rating: 0 · rate: / Reply Quote

Icecold Send message Joined: 22 May 13 Posts: 4 Credit: 844,658,667 RAC: 498,610	Message 77966 - Posted: 9 May 2026, 1:45:46 UTC I've tested the Linux app and it seems to work well. I haven't ran a ton of tasks yet but haven't had any fail or fail validation so far. It would be great if it could be ported in and be a supported part of MW without needing to use an app_info and app_config. ID: 77966 · Rating: 0 · rate: / Reply Quote

Xterelle Send message Joined: 7 Aug 22 Posts: 33 Credit: 22,705,144 RAC: 18,514	Message 77967 - Posted: 9 May 2026, 2:24:49 UTC - in response to Message 77965. I am interested, but could you also make a windows client ? + ID: 77967 · Rating: 0 · rate: / Reply Quote

bobsmith18 Send message Joined: 1 Nov 10 Posts: 43 Credit: 3,436,407 RAC: 6,140	Message 77968 - Posted: 9 May 2026, 6:28:32 UTC For a Windows build - I'd be interested in testing on older "consumer grade" GPUs as I have a pair of gtx1070ti gathering dust Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 77968 · Rating: 0 · rate: / Reply Quote

bobsmith18 Send message Joined: 1 Nov 10 Posts: 43 Credit: 3,436,407 RAC: 6,140	Message 77969 - Posted: 9 May 2026, 9:57:52 UTC - in response to Message 77968. Forgot to add - This presupposes that you are going to do such an app Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 77969 · Rating: 0 · rate: / Reply Quote