milkyway & milkywayGPU makefile

Author	Message
The Gas Giant Send message Joined: 24 Dec 07 Posts: 1947 Credit: 240,884,648 RAC: 0	Message 24441 - Posted: 7 Jun 2009, 5:18:51 UTC - in response to Message 24391. looks like the cpu is faster... Or the GPU calculates quite a bit more ;) ;)) As long as the credit is appropriate ;) ID: 24441 · Rating: 0 · rate: / Reply Quote

Cluster Physik Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0	Message 24691 - Posted: 9 Jun 2009, 16:58:28 UTC - in response to Message 24391. Last modified: 9 Jun 2009, 17:02:04 UTC looks like the cpu is faster... Or the GPU calculates quite a bit more ;) ;)) I guess the real reason is that the test units are much smaller than the real production units and the performance is severly limited by the administrative overhead. The integrals of current production WUs are a factor of 16 larger than those of the test WUs. And I don't know what the CPU has actually done to arrive at the numbers posted by trisf. Be assured that even mainstream GPUs (let alone high end ones) will be faster than the CPU. ID: 24691 · Rating: 0 · rate: / Reply Quote

trisf Send message Joined: 30 Nov 08 Posts: 11 Credit: 25,658 RAC: 0	Message 25124 - Posted: 12 Jun 2009, 7:04:05 UTC How to make GPU CUDA static binary? dynamic: ok static: /usr/bin/ld: cannot find -lcudart ID: 25124 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 25591 - Posted: 15 Jun 2009, 22:59:31 UTC - in response to Message 25124. How to make GPU CUDA static binary? dynamic: ok static: /usr/bin/ld: cannot find -lcudart You need to have the CUDA compiler (nvcc), runtime and drivers installed to be able to compile the application. ID: 25591 · Rating: 0 · rate: / Reply Quote

Cluster Physik Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0	Message 26112 - Posted: 21 Jun 2009, 4:00:39 UTC - in response to Message 23998. Last modified: 21 Jun 2009, 4:24:58 UTC initial likelihood: -2.98530684176687044484 Sweet. Looks like you're getting what I'm getting. Is that for stripe 20? If that is okay, I guess my single precision ATI implementation will do it, too. I'm getting a fitness of -2.985312812926748 for the stripe20 test unit. As a reference point, the stock CPU app (using a complete DP calculation) arrives at a value of -2.985312797571472. So it appears my approach ist actually a bit (two digits, i.e. 6 bits) more precise :D In the moment I'm using the integration layout as mentioned in the other thread (mu-r plane) and doing all summations with the Kahan method. This includes the convolution loop and the summation of all the values between different mu-r planes as well as the final reduction (done on GPU in SP as a treelike Kahan sum). That way I have to transfer virtually nothing (16 bytes or so for the whole integral) back from the GPU to the CPU. As it appears to me, it is unnecessary to do the reduction on the CPU. But I have to mention, that I do all CPU operations (including the likelihood compution) in DP. I have to test, if one looses the precision there (or have you tried it already, Travis?). I have an update to this. Sorry for putting it here and not in the other thread, but the comparison values posted by trisf are here. I do now the likelihood computation also on the GPU. As one does it some hundred times with the GPU project code (opposed to only once at the legacy version), it is definitely faster than the CPU. And I can now clearly state that the likelihood computation is not hurting the precision. Tuned it a bit and I'm getting now an initial likelihood value of -2.985312794995786 (double precision CPU arrives at -2.985312797571472). Not that bad for single precision I would say :D The complete output file (sorry for the font size ;): hessian [14 x 14]: -390.01297330587551000 19.52481024081187000 -1533.2734804029969000 0.36626257582383914000 0.02027267242965535500 -0.05197370311904592200 -0.03842967610800939600 0.69457634088720965000 1579.36089206778270000 0.17400710502120850000 0.47681164572210827000 -0.03224550256438382000 -0.38525987955395630000 -2.27792784635028060000 19.52481024081187000 -0.89935716859890202000 79.88489592047896800000 -0.00695203154303195800 -0.00148870499261377610 -0.00311565588143973040 -0.00698965191281430890 -0.06931503981899565800 -77.84448502468065100000 0.01937431696556283400 0.03816721244609410500 0.00344173763563067730 -0.01042395336714463200 0.23676720306564644000 -1533.27348040299690000 79.88489592047896800000 -6506.29849996420310000 -10.64588417420964100000 0.71608552421054117000 -0.34682442103436034000 -0.69671490798839375000 -10.06847383244746700000 6213.89129040750320000 1.37108842797791410000 0.75474071437042756000 0.17694549529304973000 0.88234974882084305000 -405.46663249152459000 0.36626257582383914000 -0.00695203154303195800 -10.64588417420964100000 -0.35369263073903312000 0.14107354173731321000 0.13820025371005487000 0.23365800035553733000 0.46932180364223086000 8.19744272462230580000 -0.03979594431768873600 -0.02920173362378856000 0.01743864312212887700 0.00161407924063420640 -0.14854321476557666000 0.02027267242965535500 -0.00148870499261377610 0.71608552421054117000 0.14107354173731321000 -0.16454788920317040000 0.01043720665450109500 0.01784725145448362200 -0.13678780330650397000 -6.07685013420677840000 0.00068149189994907511 0.01583087827494722100 -0.01504449342881741600 -0.03803707349092632500 0.32327057697401068000 -0.05197370311904592200 -0.00311565588143973040 -0.34682442103436034000 0.13820025371005487000 0.01043720665450109500 -0.08778481028512727400 0.03903026050503891100 0.09120667184466431400 0.24538889438948294000 0.02641849701963868200 -0.00499808527898437570 -0.01218546868347263800 0.00932971292814480800 -0.03992223218673984800 -0.03842967610800939600 -0.00698965191281430890 -0.69671490798839375000 0.23365800035553733000 0.01784725145448362200 0.03903026050503891100 -0.06100904503814063400 0.21312882014790088000 -0.70542460761657810000 -0.02401597439434984000 0.01272565386400969900 0.01001337901485044100 -0.01750821709833871500 0.00449987269668383670 0.69457634088720965000 -0.06931503981899565800 -10.06847383244746700000 0.46932180364223086000 -0.13678780330650397000 0.09120667184466431400 0.21312882014790088000 -12.38068675357695300000 10.24474949318232600000 -0.18561448674366451000 -0.34162186968167413000 0.08297251774536107400 -0.04154662724964452300 0.52725879218229466000 1579.36089206778270000000 -77.84448502468065100000 6213.89129040750320000000 8.19744272462230580000 -6.07685013420677840000 0.24538889438948294000 -0.70542460761657810000 10.24474949318232600000 -6570.86651756344510000000 -1.84772567616657100000 -0.98881736132483400000 -2.05113148687985360000 -0.64424021672948573000 -423.39445838202039000000 0.17400710502120850000 0.01937431696556283400 1.37108842797791410000 -0.03979594431768873600 0.00068149189994907511 0.02641849701963868200 -0.02401597439434984000 -0.18561448674366451000 -1.84772567616657100000 -1.68289518123445860000 0.43500184935633485000 -0.83326524692574189000 0.08769457382484800700 0.43554789404727973000 0.47681164572210827000 0.03816721244609410500 0.75474071437042756000 -0.02920173362378856000 0.01583087827494722100 -0.00499808527898437570 0.01272565386400969900 -0.34162186968167413000 -0.98881736132483400000 0.43500184935633485000 -0.32984580344841413000 -0.03374971598487282200 0.00027394753132625732 10.65527388544040700000 -0.03224550256438382000 0.00344173763563067730 0.17694549529304973000 0.01743864312212887700 -0.01504449342881741600 -0.01218546868347263800 0.01001337901485044100 0.08297251774536107400 -2.05113148687985360000 -0.83326524692574189000 -0.03374971598487282200 0.03124775130005888100 -0.79111966977407622000 -6.42424817047052970000 -0.38525987955395630000 -0.01042395336714463200 0.88234974882084305000 0.00161407924063420640 -0.03803707349092632500 0.00932971292814480800 -0.01750821709833871500 -0.04154662724964452300 -0.64424021672948573000 0.08769457382484800700 0.00027394753132625732 -0.79111966977407622000 0.02997525838654979300 -10.10300454407086900000 -2.27792784635028060000 0.23676720306564644000 -405.46663249152459000000 -0.14854321476557666000 0.32327057697401068000 -0.03992223218673984800 0.00449987269668383670 0.52725879218229466000 -423.39445838202039000000 0.43554789404727973000 10.65527388544040700000 -6.42424817047052970000 -10.10300454407086900000 -98.01699035749678000000 gradient[14]: -0.002031011692160689, 3.870287978990916e-005, -0.00391933197008143, -4.165342145275493e-006, 3.936584391794895e-006, -4.114278547480884e-005, 3.338801457530849e-005, -3.847977492199561e-006, 0.005981095174689699, 6.980741170300082e-005, 4.713984758097922e-005, -2.486414777772931e-006, 4.213692172960747e-005, 0.0004124264263438704 initial_fitness: -2.98531279499578610000 inital_parameters[14]: 0.571713, 12.312119, -3.305187, 148.010257, 22.453902, 0.42035, -0.468858, 0.760579, -1.361644, 177.884238, 23.882892, 1.210639, -1.611974, 8.534378 result_fitness: -2.98531254274110090000 result_parameters[14]: 0.570715379831475, 12.31644736374576, -3.305076710884835, 148.0107118875137, 22.46190531351187, 0.4232579677285489, -0.4614999752737721, 0.7603950396852607, -1.361832050323567, 177.8835356891613, 23.88085791689922, 1.211781892132873, -1.610141258371338, 8.534315057698125 number_evaluations: 445 metadata: it: 5, ev: 588 By the way, that small test unit took about 4 minutes, i.e. half a second per evaluation ;) But as it is a two stream WU with 80 x 800 x 350 spatial and 60 convolution steps, one can estimate the time for the bigger WUs (neglecting that the efficiency raises slightly with the size). Current production WUs have up to 320 x 1600 x 700 spatial and 120 convolution steps. That is a factor of 4 x 2 x 2 x 2 = 32 bigger. It would take about two hours total for about 445 * 3.6 TFlop ~ 1.6 Peta(!)Flop. That is about 220 GFlop/ second with an overclocked (10%) HD3870. I should not start thinking about what the newer GPUs or even the next generation can do. I would be prepared for north of 600GFlops on a fast HD4800 :o But the GT200 based nvidia cards have a chance to get close on single precision. That is a difference to the double precision performance. ID: 26112 · Rating: 0 · rate: / Reply Quote

Emanuel Send message Joined: 18 Nov 07 Posts: 280 Credit: 2,442,757 RAC: 0	Message 26153 - Posted: 21 Jun 2009, 19:55:58 UTC - in response to Message 26112. By the way, that small test unit took about 4 minutes, i.e. half a second per evaluation ;) But as it is a two stream WU with 80 x 800 x 350 spatial and 60 convolution steps, one can estimate the time for the bigger WUs (neglecting that the efficiency raises slightly with the size). Current production WUs have up to 320 x 1600 x 700 spatial and 120 convolution steps. That is a factor of 4 x 2 x 2 x 2 = 32 bigger. It would take about two hours total Could you give us a quick summary of what the GPUs are doing that the CPU code doesn't? It sounds like they're computing a great deal more! ID: 26153 · Rating: 0 · rate: / Reply Quote

Cluster Physik Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0	Message 26159 - Posted: 21 Jun 2009, 20:41:42 UTC - in response to Message 26153. By the way, that small test unit took about 4 minutes, i.e. half a second per evaluation ;) But as it is a two stream WU with 80 x 800 x 350 spatial and 60 convolution steps, one can estimate the time for the bigger WUs (neglecting that the efficiency raises slightly with the size). Current production WUs have up to 320 x 1600 x 700 spatial and 120 convolution steps. That is a factor of 4 x 2 x 2 x 2 = 32 bigger. It would take about two hours total Could you give us a quick summary of what the GPUs are doing that the CPU code doesn't? It sounds like they're computing a great deal more! The CPU code is doing just one "evaluation". That means the server sends a bunch of parameters (a WU) for a small volume of the Milky Way and the CPU code checks how good these parameters fit with the reality, i.e. the observed stars in that region. The result (called "fitness" or "likelihood") is then send back. From all the results the server tries to determine (using different algorithms like genetic search [gs_ WUs] or particle search [ps_ WUs]) in what directions the parameters have to evolve to get a better fitness. The difference with the GPU project is that not a sole set of parameters is checked, but more or less a region of parameter sets. That's why there are so many numbers in the result file of the GPU code posted above ;) In principle the scientific app takes over a small part of the search algorithm. It does not only one simple check, it looks around a bit to see in which direction it gets better. In case of the double stream WUs it means the app does 445 evaluations of different parameter sets instead of a single one. So it is really a great deal more work. In case of triple stream WUs it would be actually about 900 evaluations, as there are more parameter combinations possible (but only about 150 for single stream WUs). ID: 26159 · Rating: 0 · rate: / Reply Quote

Emanuel Send message Joined: 18 Nov 07 Posts: 280 Credit: 2,442,757 RAC: 0	Message 26176 - Posted: 22 Jun 2009, 0:38:13 UTC - in response to Message 26159. Thanks :) I'm looking forward to the GPU apps even more now! ID: 26176 · Rating: 0 · rate: / Reply Quote

Cluster Physik Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0	Message 26257 - Posted: 22 Jun 2009, 21:10:59 UTC - in response to Message 26112. By the way, that small test unit took about 4 minutes, i.e. half a second per evaluation ;) But as it is a two stream WU with 80 x 800 x 350 spatial and 60 convolution steps, one can estimate the time for the bigger WUs (neglecting that the efficiency raises slightly with the size). Current production WUs have up to 320 x 1600 x 700 spatial and 120 convolution steps. That is a factor of 4 x 2 x 2 x 2 = 32 bigger. It would take about two hours total for about 445 * 3.6 TFlop ~ 1.6 Peta(!)Flop. That is about 220 GFlop/ second with an overclocked (10%) HD3870. Just done a real counting of all the operations on the GPU (one can neglect the preparation of lookup tables done on the CPU). One evaluation for the stripe20 test WU represents about 113.94 GFlop with the single precision code. 445 evaluations * 113.94 GFlop / 215.14 seconds = 235 GFlop/s for that HD3870@860MHz (theoretical peak would be 550 GFlops, that of a 3GHz quad core only 96 GFlops). I used quite simple counting rules: addition and multiplication count as 1 flop, a division (only real ones, not those transformed to a multiplication by the compiler) and square root count as 4 flops and pow/exp/log also as only 4 flops. This a small deviation from the "standard" practice (if such thing exists in this respect) to count 8 flops for the latter. The reason is that the GPUs have hardware support for these instructions (opposed to double precision) with a fourth to a fifth of the througput of the simple instructions (same throughput as with divisions or square roots). Furthermore I was a bit conservative, as exp(x) for instance is executed as pow(2, xlog2(e)) and log(x) as log2(x)log(2) but I counted that whole construct as 4 flops. The actual implementation is dependent on the exact GPU and may change in the future. Therefore I decided to count the instructions in the C code and not in the GPU assembly like for the double precision version. It results in a considerably lower flops count than for double precision, but this should be compensated by the far higher execution speed for single precision. @Travis: You will get a PM. ID: 26257 · Rating: 0 · rate: / Reply Quote