Welcome to MilkyWay@home

Need help to build the Milkyway apps? Look here


Advanced search

Message boards : Application Code Discussion : Need help to build the Milkyway apps? Look here
Message board moderation

To post messages, you must log in.

AuthorMessage
ProfileJoseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 233
Credit: 1,277,266,415
RAC: 0
1 billion credit badge10 year member badge
Message 69402 - Posted: 27 Dec 2019, 0:10:10 UTC
Last modified: 27 Dec 2019, 0:11:02 UTC

The sources for the Milkyway apps are available on-line here
https://github.com/Milkyway-at-home/milkywayathome_client

However, a few things are missing and they are not easy to find which
makes the apps hard to build. They also need to be built under Linux,
not windows.

I put the missing items and some notes on what to do along with the most
recent windows executables I built here

https://github.com/JStateson/MilkywaySeparation

Been running my home built version and it seems fine. Wont know how much of an improvement for another couple of days. The new version recognizes my "Hawaii" graphics board unlike the old app..
ID: 69402 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileJoseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 233
Credit: 1,277,266,415
RAC: 0
1 billion credit badge10 year member badge
Message 69405 - Posted: 27 Dec 2019, 16:57:33 UTC
Last modified: 27 Dec 2019, 17:04:46 UTC

Reason I looking at the sources and building the app:

1. My S9100 gets 1 invalid result out of every 4 or 5 it processes. The slower S9000 do not have this problem
When comparing my S9100 invalid results to my wingmen (AMD VII, Nvidia Titan for one example) I observe a very close even identical match for all three wingmen through 4 iterations** then on the 5th iteration there is a huge difference between my S9100 and the other two wingmen that causes a rejection.

2. The flops are being calculated incorrectly. Getting the following warnings:
Estimated AMD GPU GFLOP/s: 360 SP GFLOP/s, 72 DP FLOP/s
Warning: Bizarrely low flops (72). Defaulting to 100

I fixed this by making a change in the app: I hard coded the actual flops for the S9100 and S9000.
The actual DP flops are 2109 and 806 respectively. However, that change had no effect on the 1 out of 4 invalids but it made a very slight speed up.

3. It seems that ALL AMD boards running milkyway generate three warning messages on each iteration. Pick any of the top computers that have AMD and you will spot warning message like the following on each iteration**.
Build log:
--------------------------------------------------------------------------------
C:\Users\Andre\AppData\Local\Temp\\OCL14436T1.cl:183:72: warning: unknown attribute 'max_constant_size' ignored
                            __constant real* _ap_consts __attribute__((max_constant_size(18 * sizeof(real)))),
                                                                       ^
C:\Users\Andre\AppData\Local\Temp\\OCL14436T1.cl:185:62: warning: unknown attribute 'max_constant_size' ignored
                            __constant SC* sc __attribute__((max_constant_size(NSTREAM * sizeof(SC)))),
                                                             ^
C:\Users\Andre\AppData\Local\Temp\\OCL14436T1.cl:186:67: warning: unknown attribute 'max_constant_size' ignored
                            __constant real* sg_dx __attribute__((max_constant_size(256 * sizeof(real)))),
                                                                 ^
3 warnings generated.


However, these are just warning messages and do not seem to affect the results.

On my S9100 I observed that if I do NOT run concurrent tasks then there are no invalids. When running 10 concurrent tasks the number of invalids is so high there is no advantage to having 10 at a time. I have found that 5 at a time seems to be OK but I would like to find out what is happening.

** Iterations: I don't know what to call what is happening but (I am guessing) that the algorithm converges to get an answer and it seems it goes through 5 iterations. Since my S9100 seens to have identical results as the VII and Titan through 4 iterations, I was guessing that if I figure a way to terminate the result early (4th iteration) that this would fix the invalid result problem. I looked at only one work unit and I made the observation that my 4th result was close enough to the 5th iteration of the Titan and the VII that I would have passed. This is just a thought, not planning on doing it

[EDIT] The S9000 is the "pro" version of the HD7950. It has more memory and supports ECC. The big advantage of the S9000 and S9100 is there is only a single 8pin power connector as opposed to 6+6 and 8+6 like some other HD7950's. Cooling is a problem but a blower can be attached with AL or CU foil if using an open air mining rig. If in a case then cant use a blower, need to use the fan and cooling assembly from a "parts only" Hd7950
ID: 69405 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 343
Credit: 224,936,029
RAC: 321,545
200 million credit badge9 year member badgeextraordinary contributions badge
Message 69406 - Posted: 27 Dec 2019, 19:26:58 UTC

The S9000 series of cards can't be too common. Do you think it possible that the AMD API is not returning valid device parameters for the statement in gpu_amd.cpp?
 p_calDeviceGetAttribs =(ATI_ATTRIBS)GetProcAddress(callib, "calDeviceGetAttribs" );


I see that they had to change various cards SIMD counts explicitly in the file to correct incorrect returns from the API.
        case CAL_TARGET_610:
            gpu_name="ATI Radeon HD 2300/2400/3200/4200 (RV610)";
            attribs.numberOfSIMD=1;        // set correct values (reported wrong by driver)
            attribs.wavefrontSize=32;
            break;
        case CAL_TARGET_630:
            gpu_name="ATI Radeon HD 2600/3650 (RV630/RV635)";
            // set correct values (reported wrong by driver)
            attribs.numberOfSIMD=3;
            attribs.wavefrontSize=32;


Could this be the case with the S9000 cards also?
ID: 69406 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileJoseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 233
Credit: 1,277,266,415
RAC: 0
1 billion credit badge10 year member badge
Message 69407 - Posted: 27 Dec 2019, 21:45:50 UTC - in response to Message 69406.  
Last modified: 27 Dec 2019, 22:12:24 UTC

The S9000 series of cards can't be too common.

correct, but not many know the S9000 is just a beefed up version of the HD7950 and a new / unused / open-box can be had for under 100 USD. The S91x0 are in a different class so can't recommend.

Do you think it possible that the AMD API is not returning valid device parameters for the statement in gpu_amd.cpp?
           attribs.wavefrontSize=32;

Could this be the case with the S9000 cards also?

I just grep'ed that and got a hit at /boinc/boinc/lib/coproc.cpp and discovered that the Milkyway version is using a much older version.
They are picking up the GPU compute capabilities by using NVidia library modules**. even on an ATI GPU system. They just used boinc code so it is not like they did something wrong. The major value is used in the case switch below
        if (opencl_prop.nv_compute_capability_major) major = opencl_prop.nv_compute_capability_major;
        if (opencl_prop.nv_compute_capability_minor) minor = opencl_prop.nv_compute_capability_minor;

The 7.16.3 boinc version has more (newer?) floating point info. Compare the following
github source code shows the folowing in coproc.cpp when calculating flops  
      case 3:
            flops_per_clock = 2;
            cores_per_proc = 192;
            break;
        case 5:
        default:
            flops_per_clock = 2;
            cores_per_proc = 128;
            break;
        }
Boinc 7.16.3 shows the following

        case 3:
            flops_per_clock = 2;
            cores_per_proc = 192;
            break;
        case 5:
            flops_per_clock = 2;
            cores_per_proc = 128;
            break;
        case 6:
            flops_per_clock = 2;
            switch (minor) {
            case 0:    // special for Tesla P100 (GP100)
                cores_per_proc = 64;
                break;
            default:
                cores_per_proc = 128;
                break;
            }
            break;
        case 7:    // for both cc7.0 (Titan V, Tesla V100) and cc7.5 (RTX, Tesla T4)
        default:
            flops_per_clock = 2;
            cores_per_proc = 64;
            break;
        }


This is probably why my S9x00 boards do not have flops calculated correctly. Looking at the code it appears there is a "better" starting point if the flops are high.

** The NVidia and ATI OpenCL "library" appear to all be from the same source. ATI does not have it on their DEV site but has a link to GitHub. NVidia provides it with their Windows SDK. Both files have identical length but binary contents are only slightly different (I dumped the binary out and eyeballed it)
jstateson@tb85-nvidia:~/Projects$ ls -l *.lib
-rw-r--r-- 1 jstateson jstateson 25870 Dec 26 15:16 ATIopencl.lib
-rw-r--r-- 1 jstateson jstateson 25870 Dec 27 13:06 NVopencl.lib


I built a version "1.56" of separation using first the ATI and the NVidia library and both "worked"
<core_client_version>7.16.33</core_client_version>
<![CDATA[
<stderr_txt>
<search_application> milkyway_separation 1.56 Windows x86_64 double OpenCL </search_application>
Reading preferences ended prematurely
BOINC GPU type suggests using OpenCL vendor 'Advanced Micro Devices, Inc.'
Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4' 
Switching to Parameter File 'astronomy_parameters.txt'
<number_WUs> 5 </number_WUs>


Using the NVidia OpenCL made no difference with those strange warnings.

[EDIT] JUST NOTICE THE PROJECT HAS BUNCHED UP 5 !
wonder if more credit is going to be given out. All my previous work has been
<number_WUs> 4 </number_WUs>


[EDIT2] FWIW, I created an "issue" on the coproc over at MW's githuib
ID: 69407 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 343
Credit: 224,936,029
RAC: 321,545
200 million credit badge9 year member badgeextraordinary contributions badge
Message 69408 - Posted: 27 Dec 2019, 22:45:15 UTC - in response to Message 69407.  
Last modified: 27 Dec 2019, 22:45:58 UTC

deleted
ID: 69408 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 343
Credit: 224,936,029
RAC: 321,545
200 million credit badge9 year member badgeextraordinary contributions badge
Message 69409 - Posted: 27 Dec 2019, 22:45:33 UTC

Yes, the code changes for calculating GFLOPs was broken for AMD cards running the AP tasks at Seti. The tasks were all erroring out because the calculated time to compute was ridiculously short and they had "exceeded time limit errors" after 14 seconds or so. This was merged into the 7.16.3 branch back in February. #2988

The introduction of the Nvidia Turing cards introduced another issue with correctly calculating GFLOPs because of the cores_per_proc = 64 being changed from the previous standard of 128 for all previous generations. This got changed in #2706

If they are simply reusing the OpenCL API library from Nvidia as you state, then I can see that if the S9000 series had a sufficiently different architecture from the previous AMD generation, it too will calculate the wrong peak flops and probably get back the wrong device parameters from that API interrogation.
ID: 69409 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileJoseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 233
Credit: 1,277,266,415
RAC: 0
1 billion credit badge10 year member badge
Message 69412 - Posted: 29 Dec 2019, 15:57:35 UTC
Last modified: 29 Dec 2019, 16:40:00 UTC

I built the separation app using first the ATI and then the NVidia library using my hard coded S9x00 flops. I then removed the gflop "mod" and built using the latest coproc.cpp file from BOINC which got the changes you mentioned I gave up trying to get the anonymous platform to work and simple put
<dont_check_file_sizes>1</dont_check_file_sizes>
into the cc_config.xml file after renaming my home built app to the same name as the projects app.

Observations
1. File size using any the mod's above was 1.6mb compared to default size of 1.2mb
12/29/2019  12:43 AM         1,583,104 ati_milkyway_separation.exe
12/29/2019  12:43 AM           791,412 lua.exe
12/29/2019  12:43 AM         1,583,616 milkyway_1.46_windows_x86_64__opencl_ati_101.exe
11/29/2019  07:48 PM         1,184,768 milkyway_1.46_windows_x86_64__opencl_ati_101_orig.exe
12/29/2019  12:43 AM         1,583,616 milkyway_separation.exe
12/27/2019  01:19 PM         1,583,104 nv_milkyway_separation.exe


The executables starting with ati_ and nv_ are identical size and contain the ati and nv libraries respectively.
milkyway_separation.exe was built with the new coproc.cpp file and the ati library.
I assume the difference in file size is due to debug symbols in the files but that is a guess.

2. The size of the files, 1.2-1.6mb is tiny compared to the SETI code which was statically built.
220581672 Dec  6 22:23 setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda101
==1184768 milkyway_1.46_windows_x86_64__opencl_ati_101_orig.exe

so it seem all the library does is simply direct the calls to the driver which on my system is the much older "fire pro" which has not been updated since 2015 for the s9x00 series of AMD boards.

3. I noticed a discrepancy between my app and the projects app (the ati 101) that I cannot account for. Possibly a compiler switch the project uses that is not mentioned at the GitHub account. I discovered the discrepancy by searching through the client_state.xml file to get the block count and my own debugging info. First: the results from my "app" built with the newer coproc.cpp file
C:\ProgramData\BOINC>find  "Num chunks" client_state.xml 

---------- CLIENT_STATE.XML
Num chunks:     2
--repeat--
Num chunks:     2

C:\ProgramData\BOINC>find  "jys" client_state.xml
jys Device Type: 4
jys qf:806  ef:644.8  iter:19.3603
jys estI:30.0253   timeP:16.6667   nChunk:1
jys Device Type: 4
jys qf:2304  ef:1843.2  iter:19.3603
jys estI:10.5036   timeP:16.6667   nChunk:0


The debugging info (I labeled jys to make it easy to find) shows the the s9000 and sl9100 flops are calculate correctly. However, the calculation for the chunk count to be used for the S9100 is 0 which the programs detects and the value of 1 is returned which is logical.
However, that value, the "1" is actually the chunk estimate and the actual value for the chunk to be iterated with is eventually calculated as
Chunk estimate: 1
Num chunks:     2


Discrepancy: when running the project default app the S9100 would occasionally get 14, and 19 in addition to 2 for the chunk count. The S9000 board was always 2. More on this later.

At this point I went to the project statistics and looked at the board leaders to see what they were calculating for the block sizes. I only looked at a few as I do not have a tool to scrape the values.

4. Systems with HD7900, HD7800, S9x00 on the few that I looked at all had the value 2 for the chunk count.
systems with nvidia, amd VII and other would have various values such as 5,7 etc. I don't recall seeing "2" on any others. Only the "island" class of AMD boards.

Analysis

Unfortunately, this part is weak as I cannot do a cross correlation of the results unless I can scrape values from the board leaders so basically this is from eyeballing a few results

When comparing the S9100 with wingmen who are using island class "HD" my S9100 is granted credit.
When comparing the S9100 with VII, Titan, etc, who have nchunks much greater than 2 then the S9100 is declared invalid and when I eyeball the results there actually are differences, though very small..
When comparing the S9100 that had 14 and 19 for nchunks (few and far between) then it seems the board NOT declared invalid when the wingmen are VII, titan etc. However, I was unable to see the same type of behavior for the S9000 boards. They compared fine with VII, Titan, and the HD series even though the block count was just "2"

Conclusion: The S9100 series has enough "rounding problems" that cause it to fail the MW correlation test when compared against boards not in its class. This accounts for why 1 out of 4 or 5 work units are declared invalid and the count is getting worse as more of the VII and the X5700 and nvidia are being added to the project. My guess: The S9100 is not calculating bad values, it is just not close enough to pass a correlation test against VII, Titan, etc when the block count is much different. Different block counts mean more or less iterations when converging to get an answer and more opportunities for rounding problems in DP arithmetic.

[EDIT] There are NVidia boards with block count of just 2. Did not want to suggest they are all greater than 2
ID: 69412 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Application Code Discussion : Need help to build the Milkyway apps? Look here

©2020 Astroinformatics Group