Nvidia RTX 4090 faster than Radeon Pro VII, despite much worse FP64 performance?

Author	Message
moyang_mm Send message Joined: 31 May 23 Posts: 4 Credit: 530,468 RAC: 0	Message 75422 - Posted: 2 Jun 2023, 3:55:22 UTC Last modified: 2 Jun 2023, 4:03:34 UTC Hello everyone! I am very new to MilkyWay@home so forgive me if this question is too dumb. I am running separation on my two GPUs: a Nvidia RTX 4090 and an AMD Radeon Pro VII (not Radeon VII). The former only has ~1.3Tflops FP64 throughput, and the latter has about 6Tflops. However, the stats shows the average processing rate on my RTX 4090 is 1,284.17 GFLOPS, and my Radeon Pro VII is only 937.93 GFLOPS. I wonder why performance on the Radeon VII is so low, and whether there is a way to improve. Any suggestions is much appreciated! link to my host details: https://milkyway.cs.rpi.edu/milkyway/host_app_versions.php?hostid=990067 Milkyway@home Separation 1.46 windows_x86_64 (opencl_nvidia_101) Number of tasks completed 89 Max tasks per day 20089 Number of tasks today 37 Consecutive valid tasks 89 Average processing rate 1,284.17 GFLOPS Average turnaround time 0.27 days Milkyway@home Separation 1.46 windows_x86_64 (opencl_ati_101) Number of tasks completed 40 Max tasks per day 20004 Number of tasks today 0 Consecutive valid tasks 4 Average processing rate 937.93 GFLOPS Average turnaround time 0.76 days ID: 75422 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 624 Credit: 19,299,762 RAC: 2,614	Message 75426 - Posted: 2 Jun 2023, 9:50:40 UTC - in response to Message 75422. You need to run more WUs simultaneously to fully utilize the Radeon GPU and same probably for the Nvidia. For the standard Radeon VII are IIRC at least 4 recommended, no idea for the Nvidia. You'll need to create an app_config.xml file for this, if you need help, just ask. ID: 75426 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 18 Nov 22 Posts: 84 Credit: 640,530,847 RAC: 0	Message 75433 - Posted: 3 Jun 2023, 16:11:03 UTC - in response to Message 75422. ignore the flops estimates. they are not accurate or reliable. I sampled your last 100 reported results and averaged the runtimes for both the nvidia and "ati" tasks. it showed that under your current configuration, the Radeon Pro VII is actually running a bit faster than the 4090 (83s average, vs 90s average) like Link said, you can get better performance from the Radeon Pro VII by running many tasks at the same time without increasing runtime very much, thereby increasing overall production. you can try the same thing with the 4090, but it probably wont scale in the same way that the Radeon will. FYI, a Titan V will outperform both of these cards. ID: 75433 · Rating: 0 · rate: / Reply Quote

San-Fernando-Valley Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0	Message 75480 - Posted: 11 Jun 2023, 7:02:45 UTC - in response to Message 75433. ... FYI, a Titan V will outperform both of these cards. Yes, definitely. FYI: The GV100 "kills" the Titan V ID: 75480 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 18 Nov 22 Posts: 84 Credit: 640,530,847 RAC: 0	Message 75481 - Posted: 11 Jun 2023, 13:05:23 UTC - in response to Message 75480. Last modified: 11 Jun 2023, 13:06:29 UTC The GV100 performs like a V100. Something weird/different happens with the kernel compiling at runtime between the TitanV and the G/V100 even though they are basically the same GPU die (TV just has one HBM module disabled). Also the TV is held back by the clock speeds (1335 MHz), but they can be unlocked with an nvidia-smi command. However, with tweaks the Titan V performs much closer to the G/V100 as it should. G/V100 ends up still be a little faster but not by much. All of these cards are fast enough that going to faster cards is diminishing returns. A lot of time is lost in the starting/stopping of the sub tasks as well as starting and stopping each WU. I tried out some A100s and they were barely any faster. Really the application needs to be updated. Or even have the WUs repackaged in a different way. ID: 75481 · Rating: 0 · rate: / Reply Quote

San-Fernando-Valley Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0	Message 75482 - Posted: 12 Jun 2023, 14:20:48 UTC - in response to Message 75481. ... However, with tweaks the Titan V performs much closer to the G/V100 as it should. G/V100 ends up still be a little faster but not by much. ... Yes, but I look at it this way: TITAN V 8 tasks at once -- stable (10 ok but sometimes unstable) GV100 14 tasks at once -- stable (16 ok but sometimes unstable) My wording was VERY misunderstandable Thanks for the interesting feedback. I appreciate it ! S-F-V ID: 75482 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 18 Nov 22 Posts: 84 Credit: 640,530,847 RAC: 0	Message 75483 - Posted: 12 Jun 2023, 15:33:01 UTC - in response to Message 75482. Last modified: 12 Jun 2023, 15:47:36 UTC I do 3x tasks on my titan Vs with tweaks. doing about 2.8-2.9M ppd per card, making it about 15% faster than other Titan Vs here ;) I've been able to push that to up to 3.1M ppd with increased power limits (but it's wasteful for such little gain) under the stock application, titan V's are limited by VRAM. peak VRAM use from each task is about 1500MB and anything more than 7x will inevitably start producing errors from running out of VRAM when you get to 8+ and the stars align with all tasks hitting peak VRAM use at the same time. the V100 and GV100 are better about this since they have more VRAM to spare (16/32GB respectively). but the added VRAM on the G/V100 aren't even necessary because the way the application compiles the opencl kernels for them they don't even need to run so many tasks in parallel to get max performance, so it's moot. the trick is to get the Titan V to "act" like the V100, so that you get better performance while not using as much VRAM. this can only be done by modifying the application itself (or injecting new code on the fly) ID: 75483 · Rating: 0 · rate: / Reply Quote

moyang_mm Send message Joined: 31 May 23 Posts: 4 Credit: 530,468 RAC: 0	Message 75633 - Posted: 17 Jun 2023, 5:04:41 UTC - in response to Message 75480. ... FYI, a Titan V will outperform both of these cards. Yes, definitely. FYI: The GV100 "kills" the Titan V And I am pretty sure that A100 and H100 will outperform Titan V ID: 75633 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 18 Nov 22 Posts: 84 Credit: 640,530,847 RAC: 0	Message 75654 - Posted: 17 Jun 2023, 12:56:11 UTC - in response to Message 75633. Check the run times of the A100 in my profile. Itâ€™s only slightly faster. Maybe. If itâ€™s faster itâ€™s only slightly, and not in line with the rated FP64 specs. ID: 75654 · Rating: 0 · rate: / Reply Quote