New Poll Regarding GPU Application of N-Body

Author	Message
Eric Mendelsohn Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 21 Aug 18 Posts: 59 Credit: 5,350,675 RAC: 0	Message 70988 - Posted: 21 Jul 2021, 17:46:03 UTC Hey everyone, We are currently looking at making a GPU version of N-Body. This code has been under development for quite some time, and the base code is finally working, though we would still need to implement some other features to run it alongside the CPU version. However, due to the complexity of our code and our need for double precision, the GPU version has a similar runtime to that of the CPU version, though there may be some speed-up on professional grade GPU cards. For reference, the GPU version of the Separation code is roughly 50-60 times faster than its CPU counterpart depending on the machine. Keeping that in mind, do you guys still want a GPU version of N-Body? I have put up a basic straw poll on https://www.strawpoll.me/45510486. If you wish to elaborate on your choice, please feel free to comment below. Thank you all for your input, time, and consideration, -Eric ID: 70988 · Rating: 0 · rate: / Reply Quote

DJStarfox Send message Joined: 29 Sep 10 Posts: 54 Credit: 1,439,592 RAC: 605	Message 70990 - Posted: 21 Jul 2021, 21:06:11 UTC - in response to Message 70988. Last modified: 21 Jul 2021, 21:06:37 UTC When you say it's not much faster than a CPU, do you mean single-core CPU vs single GPU? Or do you mean an 8 or 12 thread CPU vs a 512 CUDA Core GPU? Because N-Body can scale itself to multiple threads, and I image the CPU also uses all CUDA cores on a GPU.... ID: 70990 · Rating: 0 · rate: / Reply Quote

Speedy51 Send message Joined: 12 Jun 10 Posts: 57 Credit: 6,506,692 RAC: 712	Message 70991 - Posted: 21 Jul 2021, 22:28:02 UTC Eric if you are wanting to increase the speed of N â€“ body project, you believe you have the server resources available and you are able to increase the speed. I say go for it. Have voted accordingly ID: 70991 · Rating: 0 · rate: / Reply Quote

FurryGuy Send message Joined: 1 Aug 11 Posts: 10 Credit: 51,374,490 RAC: 0	Message 70992 - Posted: 21 Jul 2021, 23:08:07 UTC Not just yes for a GPU app for N-Body, but HELL YES!. Faster, slower, same time, is no big deal for me. ID: 70992 · Rating: 0 · rate: / Reply Quote

dylansheils0241 Send message Joined: 10 Jan 21 Posts: 4 Credit: 56 RAC: 0	Message 70994 - Posted: 22 Jul 2021, 3:41:32 UTC - in response to Message 70990. The comparison specifically preformed was between a i9-10900k and RTX 3070, using all available compute cores for both. ID: 70994 · Rating: 0 · rate: / Reply Quote

bozz4science Send message Joined: 3 May 20 Posts: 1 Credit: 2,683,043 RAC: 0	Message 70995 - Posted: 22 Jul 2021, 8:41:43 UTC If you already have the code (almost) ready to be deployed, why not just go for it? I'd definitely like to see for myself how a 10/12/14-threaded CPU WU compares to a mid- to low-end NVIDIA GPU. Especially curious about util and power consumption. After running a while, everyone can decide for themselves what app to run. Anyway, this will likely boost overall productivity! Thanks for your effort! ID: 70995 · Rating: 0 · rate: / Reply Quote

zioriga Send message Joined: 30 Aug 07 Posts: 17 Credit: 67,440,459 RAC: 7,190	Message 70996 - Posted: 22 Jul 2021, 9:07:36 UTC And do you mean NVidia or ATI GPUs ?? Usually ATI cards are faster compared to NVidia when double precision is required ID: 70996 · Rating: 0 · rate: / Reply Quote

dylansheils0241 Send message Joined: 10 Jan 21 Posts: 4 Credit: 56 RAC: 0	Message 70997 - Posted: 22 Jul 2021, 11:55:10 UTC - in response to Message 70996. The comparison was between a RTX 3090 and i9-10900K, using all compute cores for each platform. You are, additionally, correct AMD typically invests more in FP64 performance for consumer cards than Nvidia; I, myself, am quite curious to see how AMD cards preform. ID: 70997 · Rating: 0 · rate: / Reply Quote

Ironslug Send message Joined: 4 Nov 10 Posts: 1 Credit: 563,717,626 RAC: 0	Message 70999 - Posted: 22 Jul 2021, 14:36:59 UTC - in response to Message 70988. Last modified: 22 Jul 2021, 15:05:21 UTC My current AMD GPUs complete a Separation 1.46 WU in about 2.5 minutes. My 16-Core AMD CPU takes approximately 57 minutes to complete a single WU. If you are suggesting the adoption of a GPU implementation that will keep my GPU tied up for nearly an hour to only complete only one WU I would consider it a considerable waste of power and efficiency. Please correct me if I'm missing something here, but I'm failing to see advantage in throughput or efficiency. Both CPU and GPU tasks are similar computational size (approximately 42.6 GFLOPs) and my GPUs are each powering through 24-plus WUs and hour. I run several projects and at the moment MilkyWay@home is primarily a GPU-focused task for me precisely because of the efficiency of the GPU work. I suppose that I will have to wait until you actually release your GPU product to test it live, but if I lose the efficiency I'm presently estimating, I will likely have to push my GPUs into other distributed-science projects. ID: 70999 · Rating: 0 · rate: / Reply Quote

[AF>Le_Pommier] Jerome_C2005 Send message Joined: 1 Apr 08 Posts: 30 Credit: 84,816,708 RAC: 0	Message 71000 - Posted: 22 Jul 2021, 15:02:26 UTC I'm afraid the poll is a bit light : as said above we need more details to be able to compare, - would 1 GPU task do the same amount of work than 1 CPU thread task ? (since you already say the process time will be equivalent) - would 1 GPU task also need a % of a CPU thread ? (as it is often the case with GPU tasks, especially with OpenCL, I can only run AMD OpenCL tasks on my iMac) and then do you know what % ? and again, also to do the same amount of of work than the regular 1 thread CPU task ? (if yes it would then be counter productive and not equivalent !) ID: 71000 · Rating: 0 · rate: / Reply Quote

Sebastian* Send message Joined: 8 Apr 09 Posts: 70 Credit: 11,035,472,190 RAC: 22,265	Message 71001 - Posted: 22 Jul 2021, 15:14:52 UTC Last modified: 22 Jul 2021, 15:15:23 UTC I would be very happy to see longer running work units on GPUs. Especially on high performance AMD cards (with a lot of double precision performance) i have to run several WUs in parallel, which causes driver issues. I got some Fixed by AMD, but not all. If i could run a N-Body WU on my GPU and it takes several hours and loads the GPU well, it would be great. The comparison specifically preformed was between a i9-10900k and RTX 3070, using all available compute cores for both. When is is the comparison, then AMD cards with a lot of double precision performance should do well, as well as Nvidia cards. A 3070 has roughly 0.3 TFlop double precision performance. A 10900K should turn out the same, since memory bandwith is limited on the CPU. I would expect a R9 280X to be 3 times as fast as a 3070 then. ID: 71001 · Rating: 0 · rate: / Reply Quote

pututu Send message Joined: 24 Aug 17 Posts: 8 Credit: 226,778,180 RAC: 0	Message 71002 - Posted: 22 Jul 2021, 15:55:09 UTC Voted. Some of us are credit-whores, so if you set the credit accordingly, I'm sure these crunchers will gladly participate and help out with their gpus ;). Don't recommend to use BOINC CreditScrew, I mean CreditNew system. ID: 71002 · Rating: 0 · rate: / Reply Quote

adrianxw Send message Joined: 25 May 14 Posts: 31 Credit: 56,773,485 RAC: 0	Message 71003 - Posted: 22 Jul 2021, 16:18:52 UTC If the GPU version is doing the same work as the CPU version in roughly the same time, then, no, don't distribute it. GPU versions have a habit of also needing some CPU, so the net result is some CPU plus a whole GPU for the same effect as a CPU. Waste of resources. ID: 71003 · Rating: 0 · rate: / Reply Quote

Sebastian* Send message Joined: 8 Apr 09 Posts: 70 Credit: 11,035,472,190 RAC: 22,265	Message 71004 - Posted: 22 Jul 2021, 16:54:51 UTC - in response to Message 71003. Last modified: 22 Jul 2021, 16:55:20 UTC If the GPU version is doing the same work as the CPU version in roughly the same time, then, no, don't distribute it. GPU versions have a habit of also needing some CPU, so the net result is some CPU plus a whole GPU for the same effect as a CPU. Waste of resources. That is the reason why we asked for the comparison. What CPU was used and what GPU. Especially the consumer Nvidia cards don't have a lot of double precision performance. But even tho they have a lot more memory bandwith, and will be more efficient that way. A 10900k is also not a very common CPU. People will likely have worse CPUs but with GPUs which have even more double precision performance then the 3070. So it makes sense to release a GPU app :) ID: 71004 · Rating: 0 · rate: / Reply Quote

dylansheils0241 Send message Joined: 10 Jan 21 Posts: 4 Credit: 56 RAC: 0	Message 71005 - Posted: 22 Jul 2021, 16:55:11 UTC To clarify, running the GPU was testing doing the same task as the CPU. AMD GPUs from a purely teraflop performance standpoint should preform much faster than the RTX 3070 when simply looking at spec sheet as AMD invests more into this feature, however it remains to be seen weather this advantage will be realized in practice. The main idea here is to recognize that if the GPU and CPU are doing the same amount of work and an average computer has both a GPU and CPU; it is likely to double the amount of computation preformed overall on part of the network. I would like to point out that, although this slightly effects CPU performance, it is a minimal cost as the CPU section of the GPU code is designed to be lightweight and serve only for control purposes; this fact was realized when testing running both at the same time. If anyone wants to get an idea of performance for their specific system, the Github repo does have a working version of the GPU code although it does not support as many features like the LMC at the moment. ID: 71005 · Rating: 0 · rate: / Reply Quote

kk4jo Send message Joined: 17 Jul 21 Posts: 1 Credit: 418,570 RAC: 0	Message 71006 - Posted: 22 Jul 2021, 17:21:08 UTC - in response to Message 70988. Is this GPU code targeted at a specific GPU or is generic where it would run on my m1 chipped apple? That would be great! Thanks, Kerry ID: 71006 · Rating: 0 · rate: / Reply Quote

dylansheils0241 Send message Joined: 10 Jan 21 Posts: 4 Credit: 56 RAC: 0	Message 71007 - Posted: 22 Jul 2021, 18:51:11 UTC - in response to Message 71006. At the moment, the OpenCL code is targeted to AMD and Nvidia cards only. ID: 71007 · Rating: 0 · rate: / Reply Quote

Wisesooth Send message Joined: 2 Oct 14 Posts: 43 Credit: 55,516,331 RAC: 0	Message 71008 - Posted: 23 Jul 2021, 2:20:45 UTC - in response to Message 70988. There is an "anomaly" in the latest version of BOINC client's use of tasks that use the GPU option. It has something to do with the Intel GPU. It can hang a task indefinitely. Trod carefully before checking this out. As Redbeard the pirate might say if he were still alive, "Matey, ye be warned!" ID: 71008 · Rating: 0 · rate: / Reply Quote

DaiKiwi Send message Joined: 2 Apr 10 Posts: 5 Credit: 169,338,716 RAC: 24,744	Message 71010 - Posted: 23 Jul 2021, 8:10:13 UTC Well, it'd be a 'maybe' for me. AMD's older Cypress/Cayman/Tahiti/Hawaii cards have 2x-4x the raw FP64 performance of that 3070. Perhaps a test with a 'Tahiti' generation card of the same workunit would give us a bit more of an idea? ID: 71010 · Rating: 0 · rate: / Reply Quote

Astro 1940 Send message Joined: 29 Aug 12 Posts: 2 Credit: 204,504,206 RAC: 0	Message 71013 - Posted: 23 Jul 2021, 16:32:46 UTC - in response to Message 70988. When addressing the Apple M1 GPU? ID: 71013 · Rating: 0 · rate: / Reply Quote