Message boards :
Application Code Discussion :
milkyway & milkywayGPU makefile
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 24 Dec 07 Posts: 1947 Credit: 240,884,648 RAC: 0 |
looks like the cpu is faster... As long as the credit is appropriate ;) |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
looks like the cpu is faster... I guess the real reason is that the test units are much smaller than the real production units and the performance is severly limited by the administrative overhead. The integrals of current production WUs are a factor of 16 larger than those of the test WUs. And I don't know what the CPU has actually done to arrive at the numbers posted by trisf. Be assured that even mainstream GPUs (let alone high end ones) will be faster than the CPU. |
Send message Joined: 30 Nov 08 Posts: 11 Credit: 25,658 RAC: 0 |
How to make GPU CUDA static binary? dynamic: ok static: /usr/bin/ld: cannot find -lcudart |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
How to make GPU CUDA static binary? You need to have the CUDA compiler (nvcc), runtime and drivers installed to be able to compile the application. |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
initial likelihood: -2.98530684176687044484 I have an update to this. Sorry for putting it here and not in the other thread, but the comparison values posted by trisf are here. I do now the likelihood computation also on the GPU. As one does it some hundred times with the GPU project code (opposed to only once at the legacy version), it is definitely faster than the CPU. And I can now clearly state that the likelihood computation is not hurting the precision. Tuned it a bit and I'm getting now an initial likelihood value of -2.985312794995786 (double precision CPU arrives at -2.985312797571472). Not that bad for single precision I would say :D The complete output file (sorry for the font size ;): hessian [14 x 14]: -390.01297330587551000 19.52481024081187000 -1533.2734804029969000 0.36626257582383914000 0.02027267242965535500 -0.05197370311904592200 -0.03842967610800939600 0.69457634088720965000 1579.36089206778270000 0.17400710502120850000 0.47681164572210827000 -0.03224550256438382000 -0.38525987955395630000 -2.27792784635028060000 19.52481024081187000 -0.89935716859890202000 79.88489592047896800000 -0.00695203154303195800 -0.00148870499261377610 -0.00311565588143973040 -0.00698965191281430890 -0.06931503981899565800 -77.84448502468065100000 0.01937431696556283400 0.03816721244609410500 0.00344173763563067730 -0.01042395336714463200 0.23676720306564644000 -1533.27348040299690000 79.88489592047896800000 -6506.29849996420310000 -10.64588417420964100000 0.71608552421054117000 -0.34682442103436034000 -0.69671490798839375000 -10.06847383244746700000 6213.89129040750320000 1.37108842797791410000 0.75474071437042756000 0.17694549529304973000 0.88234974882084305000 -405.46663249152459000 0.36626257582383914000 -0.00695203154303195800 -10.64588417420964100000 -0.35369263073903312000 0.14107354173731321000 0.13820025371005487000 0.23365800035553733000 0.46932180364223086000 8.19744272462230580000 -0.03979594431768873600 -0.02920173362378856000 0.01743864312212887700 0.00161407924063420640 -0.14854321476557666000 0.02027267242965535500 -0.00148870499261377610 0.71608552421054117000 0.14107354173731321000 -0.16454788920317040000 0.01043720665450109500 0.01784725145448362200 -0.13678780330650397000 -6.07685013420677840000 0.00068149189994907511 0.01583087827494722100 -0.01504449342881741600 -0.03803707349092632500 0.32327057697401068000 -0.05197370311904592200 -0.00311565588143973040 -0.34682442103436034000 0.13820025371005487000 0.01043720665450109500 -0.08778481028512727400 0.03903026050503891100 0.09120667184466431400 0.24538889438948294000 0.02641849701963868200 -0.00499808527898437570 -0.01218546868347263800 0.00932971292814480800 -0.03992223218673984800 -0.03842967610800939600 -0.00698965191281430890 -0.69671490798839375000 0.23365800035553733000 0.01784725145448362200 0.03903026050503891100 -0.06100904503814063400 0.21312882014790088000 -0.70542460761657810000 -0.02401597439434984000 0.01272565386400969900 0.01001337901485044100 -0.01750821709833871500 0.00449987269668383670 0.69457634088720965000 -0.06931503981899565800 -10.06847383244746700000 0.46932180364223086000 -0.13678780330650397000 0.09120667184466431400 0.21312882014790088000 -12.38068675357695300000 10.24474949318232600000 -0.18561448674366451000 -0.34162186968167413000 0.08297251774536107400 -0.04154662724964452300 0.52725879218229466000 1579.36089206778270000000 -77.84448502468065100000 6213.89129040750320000000 8.19744272462230580000 -6.07685013420677840000 0.24538889438948294000 -0.70542460761657810000 10.24474949318232600000 -6570.86651756344510000000 -1.84772567616657100000 -0.98881736132483400000 -2.05113148687985360000 -0.64424021672948573000 -423.39445838202039000000 0.17400710502120850000 0.01937431696556283400 1.37108842797791410000 -0.03979594431768873600 0.00068149189994907511 0.02641849701963868200 -0.02401597439434984000 -0.18561448674366451000 -1.84772567616657100000 -1.68289518123445860000 0.43500184935633485000 -0.83326524692574189000 0.08769457382484800700 0.43554789404727973000 0.47681164572210827000 0.03816721244609410500 0.75474071437042756000 -0.02920173362378856000 0.01583087827494722100 -0.00499808527898437570 0.01272565386400969900 -0.34162186968167413000 -0.98881736132483400000 0.43500184935633485000 -0.32984580344841413000 -0.03374971598487282200 0.00027394753132625732 10.65527388544040700000 -0.03224550256438382000 0.00344173763563067730 0.17694549529304973000 0.01743864312212887700 -0.01504449342881741600 -0.01218546868347263800 0.01001337901485044100 0.08297251774536107400 -2.05113148687985360000 -0.83326524692574189000 -0.03374971598487282200 0.03124775130005888100 -0.79111966977407622000 -6.42424817047052970000 -0.38525987955395630000 -0.01042395336714463200 0.88234974882084305000 0.00161407924063420640 -0.03803707349092632500 0.00932971292814480800 -0.01750821709833871500 -0.04154662724964452300 -0.64424021672948573000 0.08769457382484800700 0.00027394753132625732 -0.79111966977407622000 0.02997525838654979300 -10.10300454407086900000 -2.27792784635028060000 0.23676720306564644000 -405.46663249152459000000 -0.14854321476557666000 0.32327057697401068000 -0.03992223218673984800 0.00449987269668383670 0.52725879218229466000 -423.39445838202039000000 0.43554789404727973000 10.65527388544040700000 -6.42424817047052970000 -10.10300454407086900000 -98.01699035749678000000 gradient[14]: -0.002031011692160689, 3.870287978990916e-005, -0.00391933197008143, -4.165342145275493e-006, 3.936584391794895e-006, -4.114278547480884e-005, 3.338801457530849e-005, -3.847977492199561e-006, 0.005981095174689699, 6.980741170300082e-005, 4.713984758097922e-005, -2.486414777772931e-006, 4.213692172960747e-005, 0.0004124264263438704 initial_fitness: -2.98531279499578610000 inital_parameters[14]: 0.571713, 12.312119, -3.305187, 148.010257, 22.453902, 0.42035, -0.468858, 0.760579, -1.361644, 177.884238, 23.882892, 1.210639, -1.611974, 8.534378 result_fitness: -2.98531254274110090000 result_parameters[14]: 0.570715379831475, 12.31644736374576, -3.305076710884835, 148.0107118875137, 22.46190531351187, 0.4232579677285489, -0.4614999752737721, 0.7603950396852607, -1.361832050323567, 177.8835356891613, 23.88085791689922, 1.211781892132873, -1.610141258371338, 8.534315057698125 number_evaluations: 445 metadata: it: 5, ev: 588 By the way, that small test unit took about 4 minutes, i.e. half a second per evaluation ;) But as it is a two stream WU with 80 x 800 x 350 spatial and 60 convolution steps, one can estimate the time for the bigger WUs (neglecting that the efficiency raises slightly with the size). Current production WUs have up to 320 x 1600 x 700 spatial and 120 convolution steps. That is a factor of 4 x 2 x 2 x 2 = 32 bigger. It would take about two hours total for about 445 * 3.6 TFlop ~ 1.6 Peta(!)Flop. That is about 220 GFlop/ second with an overclocked (10%) HD3870. I should not start thinking about what the newer GPUs or even the next generation can do. I would be prepared for north of 600GFlops on a fast HD4800 :o But the GT200 based nvidia cards have a chance to get close on single precision. That is a difference to the double precision performance. |
Send message Joined: 18 Nov 07 Posts: 280 Credit: 2,442,757 RAC: 0 |
By the way, that small test unit took about 4 minutes, i.e. half a second per evaluation ;) Could you give us a quick summary of what the GPUs are doing that the CPU code doesn't? It sounds like they're computing a great deal more! |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
By the way, that small test unit took about 4 minutes, i.e. half a second per evaluation ;) The CPU code is doing just one "evaluation". That means the server sends a bunch of parameters (a WU) for a small volume of the Milky Way and the CPU code checks how good these parameters fit with the reality, i.e. the observed stars in that region. The result (called "fitness" or "likelihood") is then send back. From all the results the server tries to determine (using different algorithms like genetic search [gs_ WUs] or particle search [ps_ WUs]) in what directions the parameters have to evolve to get a better fitness. The difference with the GPU project is that not a sole set of parameters is checked, but more or less a region of parameter sets. That's why there are so many numbers in the result file of the GPU code posted above ;) In principle the scientific app takes over a small part of the search algorithm. It does not only one simple check, it looks around a bit to see in which direction it gets better. In case of the double stream WUs it means the app does 445 evaluations of different parameter sets instead of a single one. So it is really a great deal more work. In case of triple stream WUs it would be actually about 900 evaluations, as there are more parameter combinations possible (but only about 150 for single stream WUs). |
Send message Joined: 18 Nov 07 Posts: 280 Credit: 2,442,757 RAC: 0 |
Thanks :) I'm looking forward to the GPU apps even more now! |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
By the way, that small test unit took about 4 minutes, i.e. half a second per evaluation ;) Just done a real counting of all the operations on the GPU (one can neglect the preparation of lookup tables done on the CPU). One evaluation for the stripe20 test WU represents about 113.94 GFlop with the single precision code. 445 evaluations * 113.94 GFlop / 215.14 seconds = 235 GFlop/s for that HD3870@860MHz (theoretical peak would be 550 GFlops, that of a 3GHz quad core only 96 GFlops). I used quite simple counting rules: addition and multiplication count as 1 flop, a division (only real ones, not those transformed to a multiplication by the compiler) and square root count as 4 flops and pow/exp/log also as only 4 flops. This a small deviation from the "standard" practice (if such thing exists in this respect) to count 8 flops for the latter. The reason is that the GPUs have hardware support for these instructions (opposed to double precision) with a fourth to a fifth of the througput of the simple instructions (same throughput as with divisions or square roots). Furthermore I was a bit conservative, as exp(x) for instance is executed as pow(2, x*log2(e)) and log(x) as log2(x)*log(2) but I counted that whole construct as 4 flops. The actual implementation is dependent on the exact GPU and may change in the future. Therefore I decided to count the instructions in the C code and not in the GPU assembly like for the double precision version. It results in a considerably lower flops count than for double precision, but this should be compensated by the far higher execution speed for single precision. @Travis: You will get a PM. |
©2024 Astroinformatics Group