Welcome to MilkyWay@home

AMD Radeon R9 Fury X - app_info.xml and apps - optimizations

Message boards : Number crunching : AMD Radeon R9 Fury X - app_info.xml and apps - optimizations
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Sutaru Tsureku

Send message
Joined: 30 Apr 09
Posts: 99
Credit: 29,853,513
RAC: 1,056
Message 64939 - Posted: 25 Jul 2016, 20:50:28 UTC

Because of my Messsage 64935...


I have inter alia four AMD Radeon R9 Fury X VGA cards and it's still not possible to get work, if the project is 'stock'. I need to use (the following entries and apps are correct?) an app_info.xml file.


[Windows 64Bit, (N-Body Sim. non-MultiThread (Single-Thread) CPU app)]

<app_info>
<app>
<name>milkyway_nbody</name>
<user_friendly_name>Milkyway N-Body Sim.</user_friendly_name>
</app>
<file_info>
<name>milkyway_nbody_1.62_windows_x86_64.exe</name>
<executable/>
</file_info>
<app_version>
<app_name>milkyway_nbody</app_name>
<version_num>162</version_num>
<platform>windows_x86_64</platform>
<file_ref>
<file_name>milkyway_nbody_1.62_windows_x86_64.exe</file_name>
<main_program/>
</file_ref>
</app_version>

<app>
<name>milkyway</name>
<user_friendly_name>Milkyway</user_friendly_name>
</app>
<file_info>
<name>milkyway_1.36_windows_x86_64.exe</name>
<executable/>
</file_info>
<app_version>
<app_name>milkyway</app_name>
<version_num>136</version_num>
<platform>windows_x86_64</platform>
<file_ref>
<file_name>milkyway_1.36_windows_x86_64.exe</file_name>
<main_program/>
</file_ref>
</app_version>
<app>
<name>milkyway</name>
</app>
<file_info>
<name>milkyway_1.36_windows_x86_64__opencl_ati_101.exe</name>
<executable/>
</file_info>
<app_version>
<app_name>milkyway</app_name>
<version_num>136</version_num>
<platform>windows_x86_64</platform>
<avg_ncpus>1</avg_ncpus>
<max_ncpus>1</max_ncpus>
<plan_class>opencl_ati_101</plan_class>
<cmdline></cmdline>
<coproc>
<type>ATI</type>
<count>1</count>
</coproc>
<file_ref>
<file_name>milkyway_1.36_windows_x86_64__opencl_ati_101.exe</file_name>
<main_program/>
</file_ref>
</app_version>

<app>
<name>milkyway_separation__modified_fit</name>
<user_friendly_name>Milkyway Sep. (Mod. Fit)</user_friendly_name>
</app>
<file_info>
<name>milkyway_separation__modified_fit_1.36_windows_x86_64.exe</name>
<executable/>
</file_info>
<app_version>
<app_name>milkyway_separation__modified_fit</app_name>
<version_num>136</version_num>
<platform>windows_x86_64</platform>
<file_ref>
<file_name>milkyway_separation__modified_fit_1.36_windows_x86_64.exe</file_name>
<main_program/>
</file_ref>
</app_version>
<app>
<name>milkyway_separation__modified_fit</name>
</app>
<file_info>
<name>milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_ati_101.exe</name>
<executable/>
</file_info>
<app_version>
<app_name>milkyway_separation__modified_fit</app_name>
<version_num>136</version_num>
<platform>windows_x86_64</platform>
<avg_ncpus>1</avg_ncpus>
<max_ncpus>1</max_ncpus>
<plan_class>opencl_ati_101</plan_class>
<cmdline></cmdline>
<coproc>
<type>ATI</type>
<count>1</count>
</coproc>
<file_ref>
<file_name>milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_ati_101.exe</file_name>
<main_program/>
</file_ref>
</app_version>

</app_info>


http://milkyway.cs.rpi.edu/milkyway/download/milkyway_nbody_1.62_windows_x86_64.exe
http://milkyway.cs.rpi.edu/milkyway/download/milkyway_1.36_windows_x86_64.exe
http://milkyway.cs.rpi.edu/milkyway/download/milkyway_1.36_windows_x86_64__opencl_ati_101.exe
http://milkyway.cs.rpi.edu/milkyway/download/milkyway_separation__modified_fit_1.36_windows_x86_64.exe
http://milkyway.cs.rpi.edu/milkyway/download/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_ati_101.exe


BTW. Is the milkyway_separation__modified_fit part superfluous now, so I could delete this part in red (and this two apps)?

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

I made a forum search because of the project prefs point:
Frequency (in Hz) that should try to complete individual work chunks. Higher numbers may run slower but will provide a more responsive system. Lower may be faster but more laggy.
default 60 (corresponds to 60 fps)


It looks like just the outdated Milkyway 1.20 ATI app used this settings.
The currently Milkyway 1.36 ATI app don't use it. I set '1' (one) but the app use/show:
Using a target frequency of 60.0

Is this OK, or a bug?

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

I'm a bit disappointed that a project task last ~ 16 seconds with the Milkyway 1.36 ATI app on one FuryX VGA card (1 WU/GPU).
I looked to other PCs, e.g. hostid=590597 with 'R9 200 Series - Hawaii' VGA cards. This VGA card have just 44 ComputeUnits (CUs) A task last ~ 15 seconds.
The/my FuryX have 64 CUs. But a task last ~ 16 seconds.

Is there something wrong - possibilities to optimize/fine tune?


Thanks.
ID: 64939 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,941,565
RAC: 22,487
Message 64941 - Posted: 26 Jul 2016, 11:18:51 UTC - in response to Message 64939.  


I'm a bit disappointed that a project task last ~ 16 seconds with the Milkyway 1.36 ATI app on one FuryX VGA card (1 WU/GPU).
I looked to other PCs, e.g. hostid=590597 with 'R9 200 Series - Hawaii' VGA cards. This VGA card have just 44 ComputeUnits (CUs) A task last ~ 15 seconds.
The/my FuryX have 64 CUs. But a task last ~ 16 seconds.

Is there something wrong - possibilities to optimize/fine tune?

Thanks.


Take a look at memory speed and whether the R9 280 has 128bit, 256 bit or 384 bit thruput speeds, the faster the memory and the faster the bits get transferred at one time, the faster the card crunches. It's not JUST the CU's anymore.
ID: 64941 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kylinblue

Send message
Joined: 23 Aug 11
Posts: 7
Credit: 498,188
RAC: 0
Message 64942 - Posted: 26 Jul 2016, 15:07:18 UTC - in response to Message 64941.  


I'm a bit disappointed that a project task last ~ 16 seconds with the Milkyway 1.36 ATI app on one FuryX VGA card (1 WU/GPU).
I looked to other PCs, e.g. hostid=590597 with 'R9 200 Series - Hawaii' VGA cards. This VGA card have just 44 ComputeUnits (CUs) A task last ~ 15 seconds.
The/my FuryX have 64 CUs. But a task last ~ 16 seconds.

Is there something wrong - possibilities to optimize/fine tune?

Thanks.


Take a look at memory speed and whether the R9 280 has 128bit, 256 bit or 384 bit thruput speeds, the faster the memory and the faster the bits get transferred at one time, the faster the card crunches. It's not JUST the CU's anymore.


Don't you know that Fury X beats R9 290x memory bandwidth by 1.6 (512gb/s vs 320)
ID: 64942 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Sutaru Tsureku

Send message
Joined: 30 Apr 09
Posts: 99
Credit: 29,853,513
RAC: 1,056
Message 64945 - Posted: 27 Jul 2016, 1:08:39 UTC
Last modified: 27 Jul 2016, 1:43:24 UTC

I read a lot in the forum and found that the GPU apps can be fine tuned.

I found:

  • --non-responsive
  • --gpu-target-frequency N
  • --gpu-polling-mode N
  • --gpu-wait-factor N
  • --process-priority N
  • --gpu-disable-checkpointing




On my both PCs I use currently:
--non-responsive --gpu-target-frequency 1 --gpu-polling-mode 0 --gpu-wait-factor 0.01 --process-priority 4 --gpu-disable-checkpointing

[and on the/my FuryX: 3 WUs/GPU - GT730: 2 WUs/GPU
At SETI just 1 WU/GPU possible (on FuryX) - if more tasks/GPU simultaneously invalid results or computation errors.]



BTW.
Still with Crimson 15.12 (1912.5 driver included).
From my experiences the 'best' driver so far. Although with a downclock bug on FuryX. Either after a few minutes or a few days one or more VGA cards downclock continuously - just a reboot help.
I tested all (also Beta/Hotfix) up to Crimson 16.3.2 and Crimson 16.7.2 (Hotfix? but WHQL?)
16.3.2 invalid AP results at SETI, 16.7.2 (July/19) frozen VGA cards.



I don't know if this are the 'strongest/fastest' possible settings - and maybe it could be counterproductively...?

Is somewhere an overview with explanation of all possible cmdline settings - and how to set them?
If not, it's possible to do this (maybe sticky thread in Number Crunching)?



This is the currently used app_info.xml file of my FuryX PC (apps like above):

<app_info>
<app>
<name>milkyway_nbody</name>
<user_friendly_name>Milkyway N-Body Sim.</user_friendly_name>
</app>
<file_info>
<name>milkyway_nbody_1.62_windows_x86_64.exe</name>
<executable/>
</file_info>
<app_version>
<app_name>milkyway_nbody</app_name>
<version_num>162</version_num>
<platform>windows_x86_64</platform>
<file_ref>
<file_name>milkyway_nbody_1.62_windows_x86_64.exe</file_name>
<main_program/>
</file_ref>
</app_version>
<app>
<name>milkyway</name>
<user_friendly_name>Milkyway</user_friendly_name>
</app>
<file_info>
<name>milkyway_1.36_windows_x86_64.exe</name>
<executable/>
</file_info>
<app_version>
<app_name>milkyway</app_name>
<version_num>136</version_num>
<platform>windows_x86_64</platform>
<file_ref>
<file_name>milkyway_1.36_windows_x86_64.exe</file_name>
<main_program/>
</file_ref>
</app_version>
<app>
<name>milkyway</name>
</app>
<file_info>
<name>milkyway_1.36_windows_x86_64__opencl_ati_101.exe</name>
<executable/>
</file_info>
<app_version>
<app_name>milkyway</app_name>
<version_num>136</version_num>
<platform>windows_x86_64</platform>
<avg_ncpus>0.17</avg_ncpus>
<max_ncpus>0.17</max_ncpus>
<plan_class>opencl_ati_101</plan_class>
<cmdline>--non-responsive --gpu-target-frequency 1 --gpu-polling-mode 0 --gpu-wait-factor 0.01 --process-priority 4 --gpu-disable-checkpointing</cmdline>
<coproc>
<type>ATI</type>
<count>0.33</count>
</coproc>
<file_ref>
<file_name>milkyway_1.36_windows_x86_64__opencl_ati_101.exe</file_name>
<main_program/>
</file_ref>
</app_version>
<app>
<name>milkyway_separation__modified_fit</name>
<user_friendly_name>Milkyway Sep. (Mod. Fit)</user_friendly_name>
</app>
<file_info>
<name>milkyway_separation__modified_fit_1.36_windows_x86_64.exe</name>
<executable/>
</file_info>
<app_version>
<app_name>milkyway_separation__modified_fit</app_name>
<version_num>136</version_num>
<platform>windows_x86_64</platform>
<file_ref>
<file_name>milkyway_separation__modified_fit_1.36_windows_x86_64.exe</file_name>
<main_program/>
</file_ref>
</app_version>
<app>
<name>milkyway_separation__modified_fit</name>
</app>
<file_info>
<name>milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_ati_101.exe</name>
<executable/>
</file_info>
<app_version>
<app_name>milkyway_separation__modified_fit</app_name>
<version_num>136</version_num>
<platform>windows_x86_64</platform>
<avg_ncpus>0.17</avg_ncpus>
<max_ncpus>0.17</max_ncpus>
<plan_class>opencl_ati_101</plan_class>
<cmdline>--non-responsive --gpu-target-frequency 1 --gpu-polling-mode 0 --gpu-wait-factor 0.01 --process-priority 4 --gpu-disable-checkpointing</cmdline>
<coproc>
<type>ATI</type>
<count>0.33</count>
</coproc>
<file_ref>
<file_name>milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_ati_101.exe</file_name>
<main_program/>
</file_ref>
</app_version>
</app_info>


('milkyway_separation__modified_fit' part(s) still needed, because old resends around.)



The FuryX PC have two Xeon CPUs, 6Cores/12Threads each, but HT off for faster GPU app calculation, so in whole 12 CPU-Cores available.

With this above app_info.xml file entries:
2 CPU-Cores for GPU app support
10 CPU-Cores for CPU tasks

With this settings the whole PC use ~ 1,100 W at the wall plug. But no problem, the PC have a 2,000 W PSU.
At SETI it was ~ 850 W.
The GPU temps are higher (~ + 4 °C) and the (PC-)room hotter - ambient temps currently 38.5 °C. :-(

With this settings, it looks like ~ 10 seconds / GPU task (theoretically).
The tasks last ~ 30 seconds, but then are 3 tasks finished.

It's really a hard work for BOINC to feed the PC with enough tasks for 24/7. :-( and ;-)



If someone have also a FuryX VGA card, you are welcome to write here also in this thread to share your experiences.



Thanks.


ID: 64945 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
JHMarshall

Send message
Joined: 24 Jul 12
Posts: 40
Credit: 7,123,301,054
RAC: 0
Message 64946 - Posted: 27 Jul 2016, 2:57:10 UTC - in response to Message 64941.  


I'm a bit disappointed that a project task last ~ 16 seconds with the Milkyway 1.36 ATI app on one FuryX VGA card (1 WU/GPU).
I looked to other PCs, e.g. hostid=590597 with 'R9 200 Series - Hawaii' VGA cards. This VGA card have just 44 ComputeUnits (CUs) A task last ~ 15 seconds.
The/my FuryX have 64 CUs. But a task last ~ 16 seconds.

Is there something wrong - possibilities to optimize/fine tune?

Thanks.


Take a look at memory speed and whether the R9 280 has 128bit, 256 bit or 384 bit thruput speeds, the faster the memory and the faster the bits get transferred at one time, the faster the card crunches. It's not JUST the CU's anymore.


It's not just memory speed here. Since MW use double precision calculations (DP) the card's DP compute capability is the real driver.

R9 280 series DP = 1/4 SP
R9 290 series DP = 1/8 SP
R9 Fury X series DP = 1/16 SP

The R9 280 series may not be the fastest single precision performer but the 1/4 DP to SP ratio makes it the leader in double precision for the $.

Joe
ID: 64946 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Jozef J

Send message
Joined: 4 Mar 10
Posts: 65
Credit: 639,958,626
RAC: 0
Message 65010 - Posted: 8 Aug 2016, 20:44:33 UTC - in response to Message 64946.  


I'm a bit disappointed that a project task last ~ 16 seconds with the Milkyway 1.36 ATI app on one FuryX VGA card (1 WU/GPU).
I looked to other PCs, e.g. hostid=590597 with 'R9 200 Series - Hawaii' VGA cards. This VGA card have just 44 ComputeUnits (CUs) A task last ~ 15 seconds.
The/my FuryX have 64 CUs. But a task last ~ 16 seconds.

Is there something wrong - possibilities to optimize/fine tune?

Thanks.


Take a look at memory speed and whether the R9 280 has 128bit, 256 bit or 384 bit thruput speeds, the faster the memory and the faster the bits get transferred at one time, the faster the card crunches. It's not JUST the CU's anymore.


It's not just memory speed here. Since MW use double precision calculations (DP) the card's DP compute capability is the real driver.

R9 280 series DP = 1/4 SP
R9 290 series DP = 1/8 SP
R9 Fury X series DP = 1/16 SP

The R9 280 series may not be the fastest single precision performer but the 1/4 DP to SP ratio makes it the leader in double precision for the $.

Joe



https://www.primegrid.com/forum_thread.php?id=6113
if is all right in this thread
GPU______________________FP32 GFLOPS__FP64 GFLOPS__Ratio

Radeon R9 295X2__________11264________1408_________FP64 = 1/8 FP32
Radeon HD 7990___________7782_________1946_________FP64 = 1/4 FP32
GeForce GTX Titan Black____5645_________1881_________FP64 = 1/3 FP32
GeForce GTX 690___________5622_________234__________FP64 = 1/24 FP32
Radeon R9 290X___________5632_________704__________FP64 = 1/8 FP32
GeForce GTX 780 Ti_________5345_________223__________FP64 = 1/24 FP32
Radeon HD 6990___________5099_________1276_________FP64 = 1/4 FP32
GeForce GTX 980___________4981_________156__________FP64 = 1/32 FP32
Radeon R9 290_____________4849_________606__________FP64 = 1/8 FP32
GeForce GTX Titan__________4709_________1523_________FP64 = 1/3 FP32
Radeon HD 7970 GHz_______4301_________1075_________FP64 = 1/4 FP32....................
"Niceee ..7970 rocks :-))" actually i have 15-16 second per one task.
ID: 65010 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Jozef J

Send message
Joined: 4 Mar 10
Posts: 65
Credit: 639,958,626
RAC: 0
Message 65012 - Posted: 8 Aug 2016, 22:42:42 UTC - in response to Message 65010.  

run 13 sec now. no errors . im wonder if i can run more tasks per one 7970 gfcard withou afect results.. where is thread with app info?
ID: 65012 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [AF>EDLS]GuL
Avatar

Send message
Joined: 5 Jun 08
Posts: 21
Credit: 245,803,013
RAC: 0
Message 65013 - Posted: 9 Aug 2016, 7:23:53 UTC - in response to Message 65012.  

ID: 65013 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Sutaru Tsureku

Send message
Joined: 30 Apr 09
Posts: 99
Credit: 29,853,513
RAC: 1,056
Message 65390 - Posted: 5 Oct 2016, 18:55:29 UTC
Last modified: 5 Oct 2016, 18:56:34 UTC

I don't know if this ratios are correct...

If I look online the 1.39 app say in the <stderr_txt>:
(...)
Estimated AMD GPU GFLOP/s: 672 SP GFLOP/s, 134 DP FLOP/s
(...)

This would be a 5:1 ratio for the R9 Fury X.
ID: 65390 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rtennill

Send message
Joined: 22 Mar 09
Posts: 6
Credit: 778,044,893
RAC: 0
Message 65529 - Posted: 24 Oct 2016, 16:15:44 UTC - in response to Message 64939.  

Unfortunately this is expected with modern consumer cards.

This project is all about double precision compute capability. The trend for both Nvidia and AMD over the last several years has been to significantly reduce the double precision compute capability on consumer/gaming cards in favor of single precision and other features.

Wikipedia has good tables for comparing the stats across generation and model. My 6970 has more double precision compute than the fury x. My new card, the 7970, will have almost twice the double precision capability.

https://en.wikipedia.org/wiki/AMD_Radeon_Rx_300_series

https://en.wikipedia.org/wiki/Radeon_HD_7000_Series
ID: 65529 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : AMD Radeon R9 Fury X - app_info.xml and apps - optimizations

©2024 Astroinformatics Group