Message boards :
Application Code Discussion :
Suggestions for performance improvements
Message board moderation
Author | Message |
---|---|
Send message Joined: 1 Oct 08 Posts: 106 Credit: 24,162,445 RAC: 0 |
As Dave pointed out to me, Well we're improving the code. We did in fact implement a few of the changes you pointed out. If you'd like to point out more we'd be happy to take a look and implement those if they pan out as well. But we are looking into our code. So I promised to make some further suggestions. But let's first summarize again the ones already given. (i) In the function stPbxConvolved, the parameters numpoints and wedge were switched. This leads to 82 instead of 30 iterations in the innermost loop with current WUs (the actual numbers are supplied by the WUs). Despite the minor difference in the numerical outcome with current WUs, this bug was looming large if the parameters would be changed in the future. Furthermore, the unrequested additional iterations lead of course to a performance degragation of about factor two with the current official app. (ii) The function qgaus performs a Gauss-Legendre integration. For this numerical method, one needs a weight function, which is calculated every time the qgaus function is called (done by the the call to gaussLegendre). As the number of integration points is every time the same (numpoints), one only needs to calculate these weights once for any given WU. This brings up the performance another factor of two. There are some other small things to note here. The values a and b passed to the function are calculated just before the calls to qgaus in the functions stPxxConvolved. Ironically, a and b are not directly used but only the mean value xm=(a+b)*0.5, which is identical to gPrime in the functions calling qgaus and from which the values a and by are derived. It would be better, just to pass gPrime instead, as the xr=(a-b)*0.5 is a constant (3*stdev), which could also be incorporated to the arrays x und w in gaussLegendre. =============================================== And now to some new ones. I should mention, that some of the changes can only be applied when the functions are called from the calculate_integrals part of the app (and not when coming from calculate_likelyhood). But as the app spends more than 99% in the former, these are very important ones and justify to create own version of the functions if necessary. (iii) As stPsgConvolved and stPbxConvolved are called as a pair and with the same coordinates, one can think about merging these two functions, as they share some calculations. Performance potential 20%. (iv) The same is true for streamConvolve and backgroundConvolve (one have to create another version of qgaus for this, too). Here the potential sharing is far bigger (call to mag2r, computation of N). And try to avoid constructs like this: exponent = pow( (g-gPrime), 2 ) / (2*stdev*stdev); at all costs. For a square, use a multiplication, it is far faster than a call to pow(). performance potential 40%. (v) And again, stPsgFunction and stPbxFunction share some things (like the call lbr2xyz(coordpar, xyz)), which don't need to be computed twice. Performance potential 15%. (vi) One notices, that a lot of the computation in calculate_integrals, stPxxConvolved, streamConvolve, backgroundConvolve and stPxxFunction is only dependent on the parameters mu and nu (coordpar[0] and coordpar[1]) and not r (coordpar[2]). Nevertheless these computations are done every time in the innermost loop (where only r changes). They need only to be done the first time for a combination of mu and nu. Later runs through the loops can simply reuse these values. Together these changes improve the performance more than a factor of 2. More specificall this includes: (vi)a the calculation of id (quite costly with 2 cosines in it), and the calls to atGCToEq and atEqToGal in calculate_integrals (vi)b in the stPxxFunction (or the merged version) all that stuff pertaining to bpars[] and spars[] is redundant, as these are actually constant for the every loop (not only for constant r), that includes the calls to atGCToEq, atEqToGal and lbr2xyz(lbr, c) and the calculation of a[] (again quite costly with all those trigonometric functions). (vi)c the call lbr2xyz(coordpar, xyz) (or lbr2xyz(coordpar, xg) in stPbxFunction) can be "manually inlined" to that function(s). Because only coordpar[2] is changing between calls for constant mu and nu, the time consuming two sin() and two cos() calls only have to be made the first time for a combination of mu and nu (reducing tremendously the amount of calls here). The calculation of xyz[] reduces then to three multiplications and one addition. (vii) The last suggestion for today is especially helpful for older CPUs like a P4, but also AthlonXPs. When looking at a current 260credit WU, one sees, that the variable r can only have 700 different values (when integrating with qgaus the different r values increase to 21,000). But a lot of calculations (like mag2r and r2mag) are done 180 million times or even more than 8 billion times with these values. Therefore it is sensible to calculate the results of these calls only once, save it in some array and simply look it up, when needed. A memory access (especially when cached) is far faster than those computations (involving exp and pow calls). What I would suggest are r_steps (maximum 700 with current WUs) values of: rm = (next_r+r)*0.5 (from calculate_integrals) ir = (next_r^3 - r^3)/3.0 (from calculate_integrals) reff_r3 = reff(rm)/rm^3 (from stPxxConvolved) Additionally one needs numpoints*r_steps values (maximum 21,000 with current WUs, that means numpoints values for every rm) of r[j] = mag2r(r2mag(rm) + x[j])*0.001 (x[j] is the one from numericalintegration.c) With that in place, one can spare all calls to mag2r and r2mag (reff also calls this) within calculate_integrals with all those costly pow() calls. Performance potential depending on CPU about factor 2 to 3. If this is all implemented well, one gets an app at least a factor 30 faster than the current official Windows application. Depending on the CPU, the used compiler and all the small additional things(*) it could be more than that. (*) Like the use of the function norm. It calculates the norm of a vector using a square root, but what is actually used is the square of the norm. So instead calling norm and than squaring this value, it would be more sensible to use dotp(xyz,xyz), as this delivers directly the square of the norm and is of course significantly faster (as it doesn't use a square root). The project should hurry to implement this. I guess some guys are able to do this in a day or so. If they choose to publish the binary of their version, the whole credit thing gets broken. |
Send message Joined: 5 Feb 08 Posts: 236 Credit: 49,648 RAC: 0 |
Wow. This is really helpful. We'll implement these before we release the new binary. Dave Przybylo MilkyWay@home Developer Department of Computer Science Rensselaer Polytechnic Institute |
Send message Joined: 4 Oct 08 Posts: 1734 Credit: 64,228,409 RAC: 0 |
I would certainly be interested in seeig the new client published and used, assuming no science gets compromised. If you are seeking volunteers to try the code, please bear me in mind. |
Send message Joined: 1 Oct 08 Posts: 106 Credit: 24,162,445 RAC: 0 |
I would certainly be interested in seeig the new client published and used, assuming no science gets compromised. It's tried already. In the moment a quarter of the WUs get processed with such a code ;) Edit: On only 14 cores :o |
Send message Joined: 19 Apr 08 Posts: 13 Credit: 151,588 RAC: 0 |
Good to see this kind of collaboration moving forward! Best regards, EigenState |
Send message Joined: 10 Aug 08 Posts: 218 Credit: 41,846,854 RAC: 0 |
Just an idea to see if this is feasible from both aspects. If the code get optimized and wu's are done in say 1/10 to time and credits are adjusted according there is the potential that the servers will be overloaded with the increased number of wu's being processed. (???????) If the code gets optimized, is there a way to keep the wu's running for the same amount of time but doing 10 times more work? Or a combination that works for both testers and the project? I guess what I'm asking that if credits and the amount of credit for work is not the primary reason for optimizing the code (besides the overhead of time being wasted) then can the project benefit for having more science done in the same amount of time or a reduced amount so that swamping the servers isn't a factor. Aside from all that it is nice to see this kind of collaboration as well. |
Send message Joined: 18 Nov 07 Posts: 280 Credit: 2,442,757 RAC: 0 |
As far as I know it should be possible (but perhaps not useful) to increase accuracy even further to push up calculation time. However, remember that a more advanced app is being worked on that should push calculation time up anyway, and which will benefit just as much from these optimizations :) |
Send message Joined: 4 Oct 08 Posts: 1734 Credit: 64,228,409 RAC: 0 |
Dave Przybylo, as the Project Administrator and a developer: How is the redevelopment of the project code, with improvements in parts to speed it up without compromising the science and accuracy, going? Can we have an update on the software's progress, and an estimate for when the improved code might be released to us volunteer crunchers, please? |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
The lastest version of the application we have has all these optimizations, and is running *edit*8x-9x*/edit* faster for me on our linux machines. These are pretty high end machines so I'm hoping for even better performance on windows and osx. |
©2024 Astroinformatics Group