Welcome to MilkyWay@home

AMD FirePro S9150

Message boards : Number crunching : AMD FirePro S9150
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
363fc9cda368b2e14d4322e60afde2...

Send message
Joined: 28 Sep 17
Posts: 19
Credit: 60,732,047
RAC: 0
Message 67533 - Posted: 25 May 2018, 6:42:52 UTC - in response to Message 67407.  

perhaps the errors are caused by overheating if doing too many WUs. AMD specs for the S9150 require 20 cfm at 45 degree max inlet temps.
ID: 67533 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 67534 - Posted: 25 May 2018, 7:41:33 UTC - in response to Message 67533.  
Last modified: 25 May 2018, 7:46:34 UTC

perhaps the errors are caused by overheating if doing too many WUs. AMD specs for the S9150 require 20 cfm at 45 degree max inlet temps.


System requirements: 20 CFM airflow cooling at 45° C maximum inlet temperature, Available PCI Express x16 (dual slot),
 3.0 for optimal performance
 Power supply plus one 2x4 (8-pin) and one 2x3 (6-pin)
AUX power connectors, 2GB system memory


Yes, I saw that, but even in my garage it never gets that hot. The attic on a hot day probably hits that 114f or higher in the middle of the summer.

My S9100 has only a single 8pin unlike the s9150 and less memory. I have ECC enabled and have never seen an error. There are no MW errors (invalids) when running one concurrent task at a time. The MW invalids increase exponsntially as more concurrent tasks are added.

Currently it is in the garage due to the 20cfm (or higher) blower as it makes too much noise. gpu-z measured temp is 65c for the S9100 and slighly less for the Q9550s cpu as reported by tthrottle. It is running 3 WUs at a time and ratio of valid to invalid stays about 500:1 When I was running 10 concurrent tasks I was getting an 8:1 ratio and that test was run inside with A/C probably 75f way under 45c.
ID: 67534 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
363fc9cda368b2e14d4322e60afde2...

Send message
Joined: 28 Sep 17
Posts: 19
Credit: 60,732,047
RAC: 0
Message 67535 - Posted: 25 May 2018, 7:55:20 UTC - in response to Message 67534.  

did you by chance notice how hot the VRMs were?
ID: 67535 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 67536 - Posted: 25 May 2018, 15:21:56 UTC - in response to Message 67535.  
Last modified: 25 May 2018, 15:59:26 UTC

did you by chance notice how hot the VRMs were?


Unfortunately, S9xxx information is not as complete as most HD7950. In addition to missing measurements, the clock frequency on my S9100 is not fixed at its maximum value like the S9000 or HD79xx series. It varies with load but it does jump to its minimum (300) with no load like the other AMD boards. Just does not stay at 800 like one would expect. Maybe this is by design. Note the s9000 (equivalent to HD7950) is locked at 900, its maximum. According to AMD docs, the 9100 supports OpenCL 2.1 but is being used at 1.2 according to the MW stdout report. My guess NM did not test their program against this board to optimize their code but I dont blame them as this is not a widely used board as it is designed for servers and has no video output. The S9000 does have video but not the 9100.

[EDIT] While both HD79xx and S9000 are the same basic chip, I have given up trying to get them to co-exist on the same motherboard.



ID: 67536 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
363fc9cda368b2e14d4322e60afde2...

Send message
Joined: 28 Sep 17
Posts: 19
Credit: 60,732,047
RAC: 0
Message 67540 - Posted: 27 May 2018, 5:01:39 UTC - in response to Message 67536.  

On paper, s9150s and s9100 has so much potential with milkyway@home. hopefully you guys can figure out what causes the invalids.
ID: 67540 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 67546 - Posted: 29 May 2018, 2:08:31 UTC - in response to Message 67540.  

On paper, s9150s and s9100 has so much potential with milkyway@home. hopefully you guys can figure out what causes the invalids.


Found out a couple of things. (1) a new driver released 5-24 and (2) Found out how to better identify the S9xxx which is not being recognized properly (requires binary patch to exe).

The following info can be observed by adding
<cmdline>--verbose</cmdline>
to the app_config file.

Warp size:           64
ALU per CU:          5
Double extension:    cl_khr_fp64
Double fraction:     1/5
---
---
Estimated AMD GPU GFLOP/s: 360 SP GFLOP/s, 72 DP FLOP/s
Warning: Bizarrely low flops (72). Defaulting to 100


The above shows that MW assumes only 5 arithmetic logical units (ALU) are in a compute unit and assumes that 5 single precision operations can take place in the time it takes a single double precision to complete. In actuality, according to wiki, there are 64 ALUs and the S9150 can easily complete a double precision in only twice the time it takes to do a single 5070:2530

I looked at the source code and spotted a deficiency:"Hawaii" was missing from the list of AMD boards. S9xxx boards are hawaii series.


This can be fixed (at least for me) by changing "Thames" to"Hawaii" as I do not have a Thames graphics product.

I used the free binary editor "neo" to make the change as shown here



After making the change in that ati.exe I got the following results
Warp size:           64
ALU per CU:          64
Double extension:    cl_khr_fp64
Double fraction:     1/4
---
---
Estimated AMD GPU GFLOP/s: 4608 SP GFLOP/s, 1152 DP FLOP/s
Using a target frequency of 1.0
Using a block size of 10240 with 55 blocks/chunk



Anyway, after all this work, I still do not have TESLA performance, but it is running about %20 faster. I suspect there are other factors involved, but at least my S9100 is better identified.

I am processing 4 WUs at a time on both S9100 and S9000. Each WU is bundled as either 4 or 5 units. When I compare performance improvement I have to make sure that i am comparing the same bundles.
ID: 67546 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
363fc9cda368b2e14d4322e60afde2...

Send message
Joined: 28 Sep 17
Posts: 19
Credit: 60,732,047
RAC: 0
Message 67551 - Posted: 30 May 2018, 0:38:48 UTC - in response to Message 67546.  



After making the change in that ati.exe I got the following results
Warp size:           64
ALU per CU:          64
Double extension:    cl_khr_fp64
Double fraction:     1/4
---
---
Estimated AMD GPU GFLOP/s: 4608 SP GFLOP/s, 1152 DP FLOP/s
Using a target frequency of 1.0
Using a block size of 10240 with 55 blocks/chunk





Isn't FP32 to FP64 ratio 1/2?
ID: 67551 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 67552 - Posted: 30 May 2018, 2:29:27 UTC - in response to Message 67551.  



After making the change in that ati.exe I got the following results
Warp size:           64
ALU per CU:          64
Double extension:    cl_khr_fp64
Double fraction:     1/4
---
---
Estimated AMD GPU GFLOP/s: 4608 SP GFLOP/s, 1152 DP FLOP/s
Using a target frequency of 1.0
Using a block size of 10240 with 55 blocks/chunk





Isn't FP32 to FP64 ratio 1/2?


Must I do everything?

where the source code shows 16*4,4,64 that is 40,4,40 hex
change the 4 to a 2 as shown here


and get the following 1/2 ratio

Device 'Hawaii' (Advanced Micro Devices, Inc.:0x1002) (CL_DEVICE_TYPE_GPU)
Board: AMD FirePro S9100
Driver version:      2527.9
Version:             OpenCL 1.2 AMD-APP (2527.9)
Compute capability:  0.0
Max compute units:   40
Clock frequency:     900 Mhz
Global mem size:     3221225472
Local mem size:      32768
Max const buf size:  3221225472
Double extension:    cl_khr_fp64
Build log:
--------------------------------------------------------------------------------
C:\Users\JSTATE~1\AppData\Local\Temp\\OCL7972T4.cl:183:72: warning: unknown attribute 'max_constant_size' ignored
                            __constant real* _ap_consts __attribute__((max_constant_size(18 * sizeof(real)))),
                                                                       ^
C:\Users\JSTATE~1\AppData\Local\Temp\\OCL7972T4.cl:185:62: warning: unknown attribute 'max_constant_size' ignored
                            __constant SC* sc __attribute__((max_constant_size(NSTREAM * sizeof(SC)))),
                                                             ^
C:\Users\JSTATE~1\AppData\Local\Temp\\OCL7972T4.cl:186:67: warning: unknown attribute 'max_constant_size' ignored
                            __constant real* sg_dx __attribute__((max_constant_size(256 * sizeof(real)))),
                                                                  ^
3 warnings generated.

--------------------------------------------------------------------------------
Estimated AMD GPU GFLOP/s: 4608 SP GFLOP/s, 2304 DP FLOP/s
Using a target frequency of 60.0
ID: 67552 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
363fc9cda368b2e14d4322e60afde2...

Send message
Joined: 28 Sep 17
Posts: 19
Credit: 60,732,047
RAC: 0
Message 67553 - Posted: 30 May 2018, 3:49:36 UTC - in response to Message 67552.  



Must I do everything?


Yes =) I'm trying to learn source codes, but I'm a noob compared to you.

You're on to something! Will you test it and see if you get any errors? Your work rate is probably almost double compared to the original setting. Definitely a Kepler Titan killer right here.
ID: 67553 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
melk

Send message
Joined: 10 Dec 17
Posts: 47
Credit: 695,662,962
RAC: 0
Message 67554 - Posted: 30 May 2018, 16:35:31 UTC

Nice investigative work Beemer. I will need to look into this more and likely attempt the same patch you have just discovered.
ID: 67554 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Toby Broom

Send message
Joined: 13 Jun 09
Posts: 24
Credit: 137,536,729
RAC: 0
Message 67727 - Posted: 26 Aug 2018, 19:44:00 UTC

Couldn't the project team fix this in the source code? Seems like you did all the work for them
ID: 67727 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
2wxUDmCPSR1kfBuQdVKTDZsEPpcM

Send message
Joined: 10 Mar 13
Posts: 9
Credit: 523,622,956
RAC: 0
Message 67734 - Posted: 28 Aug 2018, 12:04:05 UTC - in response to Message 67727.  

Has anyone tried to PM the project admin or developers about this? Thanks to BeemerBike making the above changes to the binary as described above is easy enough, but I assume the binary will be overwritten next update.
ID: 67734 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Toby Broom

Send message
Joined: 13 Jun 09
Posts: 24
Credit: 137,536,729
RAC: 0
Message 67736 - Posted: 28 Aug 2018, 16:38:21 UTC

I just did, will post back if there is some news.
ID: 67736 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 25 Feb 13
Posts: 580
Credit: 94,200,158
RAC: 0
Message 67740 - Posted: 29 Aug 2018, 16:24:58 UTC

Hey Everyone,

I just got pointed over to this thread. Wow you guys did a deep dive into this issue. Thank you for that. If someone already has the fix for this coded up, you can make a pull request on the source on Github. If you do that, I will do some internal testing on the new code and push it out to users sometime next week. If you do not want to do that, I will read through this thread and try to implement the fix myself.

Thank you all for all of your help,

Jake
ID: 67740 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
neofob

Send message
Joined: 4 Mar 18
Posts: 23
Credit: 264,815,255
RAC: 18
Message 67905 - Posted: 20 Nov 2018, 2:27:17 UTC
Last modified: 20 Nov 2018, 2:30:39 UTC

Hi everyone, I created a PR on github regarding to this issue.
Thanks BeemerBiker!

https://github.com/Milkyway-at-home/milkywayathome_client/pull/67

CC: Jake Weiss
ID: 67905 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
neofob

Send message
Joined: 4 Mar 18
Posts: 23
Credit: 264,815,255
RAC: 18
Message 67929 - Posted: 7 Dec 2018, 4:48:27 UTC

FYI, here is a snippet of the output from the newly merged code from github, sample from 82465335

Platform 0 information:
  Name:       AMD Accelerated Parallel Processing
  Version:    OpenCL 2.1 AMD-APP (2671.3)
  Vendor:     Advanced Micro Devices, Inc.
  Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices 
  Profile:    FULL_PROFILE
Using device 0 on platform 0
Found 1 CL device
Device 'Hawaii' (Advanced Micro Devices, Inc.:0x1002) (CL_DEVICE_TYPE_GPU)
Board: AMD Radeon FirePro W9100
Driver version:      2671.3
Version:             OpenCL 1.2 AMD-APP (2671.3)
Compute capability:  0.0
Max compute units:   44
Clock frequency:     900 Mhz
Global mem size:     16790622208
Local mem size:      32768
Max const buf size:  4244635648
Double extension:    cl_khr_fp64
Build log:
--------------------------------------------------------------------------------
/tmp/OCL16522T3.cl:183:72: warning: unknown attribute 'max_constant_size' ignored
                            __constant real* _ap_consts __attribute__((max_constant_size(18 * sizeof(real)))),
                                                                       ^
/tmp/OCL16522T3.cl:185:62: warning: unknown attribute 'max_constant_size' ignored
                            __constant SC* sc __attribute__((max_constant_size(NSTREAM * sizeof(SC)))),
                                                             ^
/tmp/OCL16522T3.cl:186:67: warning: unknown attribute 'max_constant_size' ignored
                            __constant real* sg_dx __attribute__((max_constant_size(256 * sizeof(real)))),
                                                                  ^
3 warnings generated.

--------------------------------------------------------------------------------
Estimated AMD GPU GFLOP/s: 5069 SP GFLOP/s, 2534 DP FLOP/s
Using a target frequency of 60.0
Using a block size of 11264 with 49 blocks/chunk
Using clWaitForEvents() for polling (mode -1)
Range:          { nu_steps = 320, mu_steps = 800, r_steps = 700 }
Iteration area: 560000
Chunk estimate: 1
Num chunks:     2
Chunk size:     551936
Added area:     543872
Effective area: 1103872
Initial wait:   0 ms
Integration time: 15.889514 s. Average time per iteration = 49.654731 ms
Integral 0 time = 16.441450 s
Running likelihood with 84044 stars
Likelihood time = 1.241216 s



Build steps:

mkdir /tmp/MW /tmp/MW/build
cd /tmp
git clone https://github.com/Milkyway-at-home/milkywayathome_client
cd milkywayathome_client
git submodule init
git submodule update --recursive
cd ../build
cmake -DBUILD_32=OFF -DSEPARATION=ON -DNBODY=OFF -DSEPARATION_OPENCL=ON \
  -DOPENCL_LIBRARIES=/opt/amdgpu-pro/lib/x86_64-linux-gnu/libOpenCL.so.1 \
  -DOPENCL_INCLUDE_DIRS=/opt/amdgpu-pro/include/CL/ ../milkywayathome_client
make
ID: 67929 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bluestang

Send message
Joined: 13 Oct 16
Posts: 112
Credit: 1,174,293,644
RAC: 0
Message 68160 - Posted: 15 Feb 2019, 16:41:28 UTC

So is the MilkyWay code fixed to include "Hawaii" based GPUs now? Or is this still a manual fix?

Thanks,
blue
ID: 68160 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
neofob

Send message
Joined: 4 Mar 18
Posts: 23
Credit: 264,815,255
RAC: 18
Message 68175 - Posted: 20 Feb 2019, 3:46:44 UTC - in response to Message 68160.  

So is the MilkyWay code fixed to include "Hawaii" based GPUs now? Or is this still a manual fix?

Thanks,
blue


It's in the public github source repo but not in the public binary release yet.

I compile and overwrite the current installed binary with this latest one.
ID: 68175 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 68176 - Posted: 20 Feb 2019, 15:03:02 UTC - in response to Message 68175.  



It's in the public github source repo but not in the public binary release yet.

I compile and overwrite the current installed binary with this latest one.


I never got cmake to work but will look at this again. As you got it working then my guess that not all code was available was wrong.

However, the problem I see with the 9xx0 is not the failure to identify the board but the exponential increase in invalids that occur when the number of concurrent tasks increase. Looking for "results" shows nothing. Unlike setiathome "result" errors are not discernable, at least to me. On seti I can compare my results to those of the wingmen and on the few occasions I have an error, clearly the computed results of the 2 wingmen match, and are different than my computed result. I am guessing that the invalid is caused by some scheduling failure in the opencl. There are actually 5 tasks for each work unit, shown (for example in task result...)

    <number_WUs> 5 </number_WUs>
    <number_params_per_WU> 20 </number_params_per_WU



I assume the 5 tasks are "unzipped" (for lack of a better word) and each worked on possibly concurrently. On top of this, I am running 4 concurrent tasks


    <app_version>
    <app_name>milkyway</app_name>
    <plan_class>opencl_ati_101</plan_class>
    <avg_ncpus>0.25</avg_ncpus>
    <ngpus>0.25</ngpus>
    <cmdline>--non-responsive --verbose --gpu-target-frequency 1 --gpu-polling-mode -1 --gpu-wait-factor 0 --process-priority 4 --gpu-disable-checkpointing</cmdline>
    </app_version>


which make a total of 20 supposedly concurrent tasks.

My guess (from an actual experience I had years ago) is that those 5 tasks are not completely "unzipped" before the coprocessor starts on them and the situation gets worse as more tasks are added. When I was running 20 concurrent (total of 100 on an s9100) I got a huge amount of valid tasks, but the number invalids was so high the total throughput was worse then when 4 were running. However, I could have left it running like that but it would have caused delays in validation for other users I was a wingman to.

ID: 68176 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
neofob

Send message
Joined: 4 Mar 18
Posts: 23
Credit: 264,815,255
RAC: 18
Message 68177 - Posted: 20 Feb 2019, 17:54:41 UTC

@BeemerBiker: There is thread somewhere about Nvidia Titan V running MH. In that thread, the author mentions about each WU uses about 1.5GB of VRAM. So you can run about 6 WUs on a S9100 (12GB) or 8-10 WUs on S9150 (16GB).

As for compiling, here is roughly how I do it in Linux:
https://gist.github.com/neofob/8a73e2f44787541c11c0445763953950
ID: 68177 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : AMD FirePro S9150

©2024 Astroinformatics Group