Welcome to MilkyWay@home

Suggestions for performance improvements

Message boards : Application Code Discussion : Suggestions for performance improvements
Message board moderation

To post messages, you must log in.

AuthorMessage
Milksop at try

Send message
Joined: 1 Oct 08
Posts: 106
Credit: 24,162,445
RAC: 0
Message 5457 - Posted: 13 Oct 2008, 18:49:11 UTC
Last modified: 13 Oct 2008, 18:56:19 UTC

As Dave pointed out to me,
Well we're improving the code. We did in fact implement a few of the changes you pointed out. If you'd like to point out more we'd be happy to take a look and implement those if they pan out as well. But we are looking into our code.

So I promised to make some further suggestions. But let's first summarize again the ones already given.

(i)
In the function stPbxConvolved, the parameters numpoints and wedge were switched. This leads to 82 instead of 30 iterations in the innermost loop with current WUs (the actual numbers are supplied by the WUs). Despite the minor difference in the numerical outcome with current WUs, this bug was looming large if the parameters would be changed in the future. Furthermore, the unrequested additional iterations lead of course to a performance degragation of about factor two with the current official app.

(ii)
The function qgaus performs a Gauss-Legendre integration. For this numerical method, one needs a weight function, which is calculated every time the qgaus function is called (done by the the call to gaussLegendre). As the number of integration points is every time the same (numpoints), one only needs to calculate these weights once for any given WU. This brings up the performance another factor of two.
There are some other small things to note here. The values a and b passed to the function are calculated just before the calls to qgaus in the functions stPxxConvolved. Ironically, a and b are not directly used but only the mean value xm=(a+b)*0.5, which is identical to gPrime in the functions calling qgaus and from which the values a and by are derived. It would be better, just to pass gPrime instead, as the xr=(a-b)*0.5 is a constant (3*stdev), which could also be incorporated to the arrays x und w in gaussLegendre.

===============================================

And now to some new ones. I should mention, that some of the changes can only be applied when the functions are called from the calculate_integrals part of the app (and not when coming from calculate_likelyhood). But as the app spends more than 99% in the former, these are very important ones and justify to create own version of the functions if necessary.

(iii)
As stPsgConvolved and stPbxConvolved are called as a pair and with the same coordinates, one can think about merging these two functions, as they share some calculations. Performance potential 20%.

(iv)
The same is true for streamConvolve and backgroundConvolve (one have to create another version of qgaus for this, too). Here the potential sharing is far bigger (call to mag2r, computation of N).
And try to avoid constructs like this:
exponent = pow( (g-gPrime), 2 ) / (2*stdev*stdev);
at all costs. For a square, use a multiplication, it is far faster than a call to pow().
performance potential 40%.

(v)
And again, stPsgFunction and stPbxFunction share some things (like the call lbr2xyz(coordpar, xyz)), which don't need to be computed twice. Performance potential 15%.

(vi)
One notices, that a lot of the computation in calculate_integrals, stPxxConvolved, streamConvolve, backgroundConvolve and stPxxFunction is only dependent on the parameters mu and nu (coordpar[0] and coordpar[1]) and not r (coordpar[2]). Nevertheless these computations are done every time in the innermost loop (where only r changes). They need only to be done the first time for a combination of mu and nu. Later runs through the loops can simply reuse these values. Together these changes improve the performance more than a factor of 2. More specificall this includes:
(vi)a
the calculation of id (quite costly with 2 cosines in it), and the calls to atGCToEq and atEqToGal in calculate_integrals
(vi)b
in the stPxxFunction (or the merged version) all that stuff pertaining to bpars[] and spars[] is redundant, as these are actually constant for the every loop (not only for constant r), that includes the calls to atGCToEq, atEqToGal and lbr2xyz(lbr, c) and the calculation of a[] (again quite costly with all those trigonometric functions).
(vi)c
the call lbr2xyz(coordpar, xyz) (or lbr2xyz(coordpar, xg) in stPbxFunction) can be "manually inlined" to that function(s). Because only coordpar[2] is changing between calls for constant mu and nu, the time consuming two sin() and two cos() calls only have to be made the first time for a combination of mu and nu (reducing tremendously the amount of calls here). The calculation of xyz[] reduces then to three multiplications and one addition.

(vii)
The last suggestion for today is especially helpful for older CPUs like a P4, but also AthlonXPs.
When looking at a current 260credit WU, one sees, that the variable r can only have 700 different values (when integrating with qgaus the different r values increase to 21,000). But a lot of calculations (like mag2r and r2mag) are done 180 million times or even more than 8 billion times with these values. Therefore it is sensible to calculate the results of these calls only once, save it in some array and simply look it up, when needed. A memory access (especially when cached) is far faster than those computations (involving exp and pow calls).
What I would suggest are r_steps (maximum 700 with current WUs) values of:
rm = (next_r+r)*0.5 (from calculate_integrals)
ir = (next_r^3 - r^3)/3.0 (from calculate_integrals)
reff_r3 = reff(rm)/rm^3 (from stPxxConvolved)

Additionally one needs numpoints*r_steps values (maximum 21,000 with current WUs, that means numpoints values for every rm) of
r[j] = mag2r(r2mag(rm) + x[j])*0.001 (x[j] is the one from numericalintegration.c)

With that in place, one can spare all calls to mag2r and r2mag (reff also calls this) within calculate_integrals with all those costly pow() calls. Performance potential depending on CPU about factor 2 to 3.

If this is all implemented well, one gets an app at least a factor 30 faster than the current official Windows application. Depending on the CPU, the used compiler and all the small additional things(*) it could be more than that.

(*) Like the use of the function norm. It calculates the norm of a vector using a square root, but what is actually used is the square of the norm. So instead calling norm and than squaring this value, it would be more sensible to use dotp(xyz,xyz), as this delivers directly the square of the norm and is of course significantly faster (as it doesn't use a square root).

The project should hurry to implement this. I guess some guys are able to do this in a day or so. If they choose to publish the binary of their version, the whole credit thing gets broken.
ID: 5457 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Dave Przybylo
Avatar

Send message
Joined: 5 Feb 08
Posts: 236
Credit: 49,648
RAC: 0
Message 5459 - Posted: 13 Oct 2008, 19:27:01 UTC - in response to Message 5457.  

Wow. This is really helpful. We'll implement these before we release the new binary.
Dave Przybylo
MilkyWay@home Developer
Department of Computer Science
Rensselaer Polytechnic Institute
ID: 5459 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John Clark

Send message
Joined: 4 Oct 08
Posts: 1734
Credit: 64,228,409
RAC: 0
Message 5460 - Posted: 13 Oct 2008, 20:50:38 UTC

I would certainly be interested in seeig the new client published and used, assuming no science gets compromised.

If you are seeking volunteers to try the code, please bear me in mind.
ID: 5460 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Milksop at try

Send message
Joined: 1 Oct 08
Posts: 106
Credit: 24,162,445
RAC: 0
Message 5461 - Posted: 13 Oct 2008, 21:09:48 UTC - in response to Message 5460.  
Last modified: 13 Oct 2008, 21:41:31 UTC

I would certainly be interested in seeig the new client published and used, assuming no science gets compromised.

If you are seeking volunteers to try the code, please bear me in mind.

It's tried already. In the moment a quarter of the WUs get processed with such a code ;) Edit: On only 14 cores :o
ID: 5461 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
EigenState

Send message
Joined: 19 Apr 08
Posts: 13
Credit: 151,588
RAC: 0
Message 5462 - Posted: 14 Oct 2008, 2:41:08 UTC

Good to see this kind of collaboration moving forward!

Best regards,
EigenState
ID: 5462 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Arion
Avatar

Send message
Joined: 10 Aug 08
Posts: 218
Credit: 41,846,854
RAC: 0
Message 5463 - Posted: 14 Oct 2008, 3:27:02 UTC

Just an idea to see if this is feasible from both aspects.

If the code get optimized and wu's are done in say 1/10 to time and credits are adjusted according there is the potential that the servers will be overloaded with the increased number of wu's being processed. (???????)

If the code gets optimized, is there a way to keep the wu's running for the same amount of time but doing 10 times more work? Or a combination that works for both testers and the project?

I guess what I'm asking that if credits and the amount of credit for work is not the primary reason for optimizing the code (besides the overhead of time being wasted) then can the project benefit for having more science done in the same amount of time or a reduced amount so that swamping the servers isn't a factor.

Aside from all that it is nice to see this kind of collaboration as well.


ID: 5463 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Emanuel

Send message
Joined: 18 Nov 07
Posts: 280
Credit: 2,442,757
RAC: 0
Message 5468 - Posted: 14 Oct 2008, 14:31:16 UTC

As far as I know it should be possible (but perhaps not useful) to increase accuracy even further to push up calculation time. However, remember that a more advanced app is being worked on that should push calculation time up anyway, and which will benefit just as much from these optimizations :)
ID: 5468 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John Clark

Send message
Joined: 4 Oct 08
Posts: 1734
Credit: 64,228,409
RAC: 0
Message 5524 - Posted: 17 Oct 2008, 13:15:47 UTC

Dave Przybylo, as the Project Administrator and a developer:

How is the redevelopment of the project code, with improvements in parts to speed it up without compromising the science and accuracy, going?

Can we have an update on the software's progress, and an estimate for when the improved code might be released to us volunteer crunchers, please?
ID: 5524 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 6282 - Posted: 19 Nov 2008, 14:12:20 UTC - in response to Message 5524.  
Last modified: 19 Nov 2008, 14:52:28 UTC

The lastest version of the application we have has all these optimizations, and is running *edit*8x-9x*/edit* faster for me on our linux machines. These are pretty high end machines so I'm hoping for even better performance on windows and osx.
ID: 6282 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Application Code Discussion : Suggestions for performance improvements

©2024 Astroinformatics Group