Welcome to MilkyWay@home

Nvidia RTX 4090 faster than Radeon Pro VII, despite much worse FP64 performance?

Message boards : Number crunching : Nvidia RTX 4090 faster than Radeon Pro VII, despite much worse FP64 performance?
Message board moderation

To post messages, you must log in.

AuthorMessage
moyang_mm

Send message
Joined: 31 May 23
Posts: 4
Credit: 530,468
RAC: 0
Message 75422 - Posted: 2 Jun 2023, 3:55:22 UTC
Last modified: 2 Jun 2023, 4:03:34 UTC

Hello everyone!

I am very new to MilkyWay@home so forgive me if this question is too dumb.

I am running separation on my two GPUs: a Nvidia RTX 4090 and an AMD Radeon Pro VII (not Radeon VII). The former only has ~1.3Tflops FP64 throughput, and the latter has about 6Tflops. However, the stats shows the average processing rate on my RTX 4090 is 1,284.17 GFLOPS, and my Radeon Pro VII is only 937.93 GFLOPS. I wonder why performance on the Radeon VII is so low, and whether there is a way to improve. Any suggestions is much appreciated!

link to my host details: https://milkyway.cs.rpi.edu/milkyway/host_app_versions.php?hostid=990067

Milkyway@home Separation 1.46 windows_x86_64 (opencl_nvidia_101)
Number of tasks completed	89
Max tasks per day	20089
Number of tasks today	37
Consecutive valid tasks	89
Average processing rate	1,284.17 GFLOPS
Average turnaround time	0.27 days
Milkyway@home Separation 1.46 windows_x86_64 (opencl_ati_101)
Number of tasks completed	40
Max tasks per day	20004
Number of tasks today	0
Consecutive valid tasks	4
Average processing rate	937.93 GFLOPS
Average turnaround time	0.76 days
ID: 75422 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 601
Credit: 19,029,256
RAC: 5,895
Message 75426 - Posted: 2 Jun 2023, 9:50:40 UTC - in response to Message 75422.  

You need to run more WUs simultaneously to fully utilize the Radeon GPU and same probably for the Nvidia. For the standard Radeon VII are IIRC at least 4 recommended, no idea for the Nvidia. You'll need to create an app_config.xml file for this, if you need help, just ask.
ID: 75426 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 18 Nov 22
Posts: 81
Credit: 637,715,137
RAC: 9,478
Message 75433 - Posted: 3 Jun 2023, 16:11:03 UTC - in response to Message 75422.  

ignore the flops estimates. they are not accurate or reliable.

I sampled your last 100 reported results and averaged the runtimes for both the nvidia and "ati" tasks. it showed that under your current configuration, the Radeon Pro VII is actually running a bit faster than the 4090 (83s average, vs 90s average)

like Link said, you can get better performance from the Radeon Pro VII by running many tasks at the same time without increasing runtime very much, thereby increasing overall production. you can try the same thing with the 4090, but it probably wont scale in the same way that the Radeon will.

FYI, a Titan V will outperform both of these cards.

ID: 75433 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 256
Credit: 604,411,638
RAC: 0
Message 75480 - Posted: 11 Jun 2023, 7:02:45 UTC - in response to Message 75433.  

...
FYI, a Titan V will outperform both of these cards.

Yes, definitely.
FYI: The GV100 "kills" the Titan V
ID: 75480 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 18 Nov 22
Posts: 81
Credit: 637,715,137
RAC: 9,478
Message 75481 - Posted: 11 Jun 2023, 13:05:23 UTC - in response to Message 75480.  
Last modified: 11 Jun 2023, 13:06:29 UTC

The GV100 performs like a V100. Something weird/different happens with the kernel compiling at runtime between the TitanV and the G/V100 even though they are basically the same GPU die (TV just has one HBM module disabled). Also the TV is held back by the clock speeds (1335 MHz), but they can be unlocked with an nvidia-smi command.

However, with tweaks the Titan V performs much closer to the G/V100 as it should. G/V100 ends up still be a little faster but not by much.

All of these cards are fast enough that going to faster cards is diminishing returns. A lot of time is lost in the starting/stopping of the sub tasks as well as starting and stopping each WU. I tried out some A100s and they were barely any faster. Really the application needs to be updated. Or even have the WUs repackaged in a different way.

ID: 75481 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 256
Credit: 604,411,638
RAC: 0
Message 75482 - Posted: 12 Jun 2023, 14:20:48 UTC - in response to Message 75481.  

...
However, with tweaks the Titan V performs much closer to the G/V100 as it should. G/V100 ends up still be a little faster but not by much.
...

Yes, but I look at it this way:

TITAN V 8 tasks at once -- stable (10 ok but sometimes unstable)

GV100 14 tasks at once -- stable (16 ok but sometimes unstable)

My wording was VERY misunderstandable

Thanks for the interesting feedback.
I appreciate it !

S-F-V
ID: 75482 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 18 Nov 22
Posts: 81
Credit: 637,715,137
RAC: 9,478
Message 75483 - Posted: 12 Jun 2023, 15:33:01 UTC - in response to Message 75482.  
Last modified: 12 Jun 2023, 15:47:36 UTC

I do 3x tasks on my titan Vs with tweaks. doing about 2.8-2.9M ppd per card, making it about 15% faster than other Titan Vs here ;) I've been able to push that to up to 3.1M ppd with increased power limits (but it's wasteful for such little gain)

under the stock application, titan V's are limited by VRAM. peak VRAM use from each task is about 1500MB and anything more than 7x will inevitably start producing errors from running out of VRAM when you get to 8+ and the stars align with all tasks hitting peak VRAM use at the same time. the V100 and GV100 are better about this since they have more VRAM to spare (16/32GB respectively). but the added VRAM on the G/V100 aren't even necessary because the way the application compiles the opencl kernels for them they don't even need to run so many tasks in parallel to get max performance, so it's moot.

the trick is to get the Titan V to "act" like the V100, so that you get better performance while not using as much VRAM. this can only be done by modifying the application itself (or injecting new code on the fly)

ID: 75483 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
moyang_mm

Send message
Joined: 31 May 23
Posts: 4
Credit: 530,468
RAC: 0
Message 75633 - Posted: 17 Jun 2023, 5:04:41 UTC - in response to Message 75480.  

...
FYI, a Titan V will outperform both of these cards.

Yes, definitely.
FYI: The GV100 "kills" the Titan V


And I am pretty sure that A100 and H100 will outperform Titan V
ID: 75633 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 18 Nov 22
Posts: 81
Credit: 637,715,137
RAC: 9,478
Message 75654 - Posted: 17 Jun 2023, 12:56:11 UTC - in response to Message 75633.  

Check the run times of the A100 in my profile.

It’s only slightly faster. Maybe. If it’s faster it’s only slightly, and not in line with the rated FP64 specs.

ID: 75654 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Nvidia RTX 4090 faster than Radeon Pro VII, despite much worse FP64 performance?

©2024 Astroinformatics Group