Message boards :
Number crunching :
Nvidia RTX 4090 faster than Radeon Pro VII, despite much worse FP64 performance?
Message board moderation
Author | Message |
---|---|
Send message Joined: 31 May 23 Posts: 4 Credit: 530,468 RAC: 0 |
Hello everyone! I am very new to MilkyWay@home so forgive me if this question is too dumb. I am running separation on my two GPUs: a Nvidia RTX 4090 and an AMD Radeon Pro VII (not Radeon VII). The former only has ~1.3Tflops FP64 throughput, and the latter has about 6Tflops. However, the stats shows the average processing rate on my RTX 4090 is 1,284.17 GFLOPS, and my Radeon Pro VII is only 937.93 GFLOPS. I wonder why performance on the Radeon VII is so low, and whether there is a way to improve. Any suggestions is much appreciated! link to my host details: https://milkyway.cs.rpi.edu/milkyway/host_app_versions.php?hostid=990067 Milkyway@home Separation 1.46 windows_x86_64 (opencl_nvidia_101) Number of tasks completed 89 Max tasks per day 20089 Number of tasks today 37 Consecutive valid tasks 89 Average processing rate 1,284.17 GFLOPS Average turnaround time 0.27 days Milkyway@home Separation 1.46 windows_x86_64 (opencl_ati_101) Number of tasks completed 40 Max tasks per day 20004 Number of tasks today 0 Consecutive valid tasks 4 Average processing rate 937.93 GFLOPS Average turnaround time 0.76 days |
Send message Joined: 19 Jul 10 Posts: 624 Credit: 19,299,762 RAC: 2,614 |
You need to run more WUs simultaneously to fully utilize the Radeon GPU and same probably for the Nvidia. For the standard Radeon VII are IIRC at least 4 recommended, no idea for the Nvidia. You'll need to create an app_config.xml file for this, if you need help, just ask. |
Send message Joined: 18 Nov 22 Posts: 84 Credit: 640,530,847 RAC: 0 |
ignore the flops estimates. they are not accurate or reliable. I sampled your last 100 reported results and averaged the runtimes for both the nvidia and "ati" tasks. it showed that under your current configuration, the Radeon Pro VII is actually running a bit faster than the 4090 (83s average, vs 90s average) like Link said, you can get better performance from the Radeon Pro VII by running many tasks at the same time without increasing runtime very much, thereby increasing overall production. you can try the same thing with the 4090, but it probably wont scale in the same way that the Radeon will. FYI, a Titan V will outperform both of these cards. |
Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0 |
... Yes, definitely. FYI: The GV100 "kills" the Titan V |
Send message Joined: 18 Nov 22 Posts: 84 Credit: 640,530,847 RAC: 0 |
The GV100 performs like a V100. Something weird/different happens with the kernel compiling at runtime between the TitanV and the G/V100 even though they are basically the same GPU die (TV just has one HBM module disabled). Also the TV is held back by the clock speeds (1335 MHz), but they can be unlocked with an nvidia-smi command. However, with tweaks the Titan V performs much closer to the G/V100 as it should. G/V100 ends up still be a little faster but not by much. All of these cards are fast enough that going to faster cards is diminishing returns. A lot of time is lost in the starting/stopping of the sub tasks as well as starting and stopping each WU. I tried out some A100s and they were barely any faster. Really the application needs to be updated. Or even have the WUs repackaged in a different way. |
Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0 |
... Yes, but I look at it this way: TITAN V 8 tasks at once -- stable (10 ok but sometimes unstable) GV100 14 tasks at once -- stable (16 ok but sometimes unstable) My wording was VERY misunderstandable Thanks for the interesting feedback. I appreciate it ! S-F-V |
Send message Joined: 18 Nov 22 Posts: 84 Credit: 640,530,847 RAC: 0 |
I do 3x tasks on my titan Vs with tweaks. doing about 2.8-2.9M ppd per card, making it about 15% faster than other Titan Vs here ;) I've been able to push that to up to 3.1M ppd with increased power limits (but it's wasteful for such little gain) under the stock application, titan V's are limited by VRAM. peak VRAM use from each task is about 1500MB and anything more than 7x will inevitably start producing errors from running out of VRAM when you get to 8+ and the stars align with all tasks hitting peak VRAM use at the same time. the V100 and GV100 are better about this since they have more VRAM to spare (16/32GB respectively). but the added VRAM on the G/V100 aren't even necessary because the way the application compiles the opencl kernels for them they don't even need to run so many tasks in parallel to get max performance, so it's moot. the trick is to get the Titan V to "act" like the V100, so that you get better performance while not using as much VRAM. this can only be done by modifying the application itself (or injecting new code on the fly) |
Send message Joined: 31 May 23 Posts: 4 Credit: 530,468 RAC: 0 |
... And I am pretty sure that A100 and H100 will outperform Titan V |
Send message Joined: 18 Nov 22 Posts: 84 Credit: 640,530,847 RAC: 0 |
Check the run times of the A100 in my profile. It’s only slightly faster. Maybe. If it’s faster it’s only slightly, and not in line with the rated FP64 specs. |
©2024 Astroinformatics Group