Welcome to MilkyWay@home

compiler optimization flags

Message boards : Application Code Discussion : compiler optimization flags
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
ebahapo
Avatar

Send message
Joined: 6 Sep 07
Posts: 66
Credit: 636,861
RAC: 0
Message 6802 - Posted: 26 Nov 2008, 20:43:56 UTC - in response to Message 6798.  

You don't seem to understand that in a chain of many operations (or worst: in a loop with the same operation using the results from the previous iteration, such as in suites), your 15th decimal error will grow to the 14th, then the 13th, etc.. and this at each dozen of operations. In the end, the error might show on the 5th, 4th or even third decimal, depending on how many loops you went through... and precisely, calculations such as BOINC's all rely on complex calculations done within numerous loops.

Don't use -ffast-math. Period.

An error is an error in both directions, sometimes up, sometimes down. So it stays in the 15th digit.

-ffast-math is used in SPEC benchmarks, which validate the results for acceptance, without any issue.

-ffast-math is not the Bogeyman you paint it.

ID: 6802 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Dave Przybylo
Avatar

Send message
Joined: 5 Feb 08
Posts: 236
Credit: 49,648
RAC: 0
Message 6804 - Posted: 26 Nov 2008, 20:53:10 UTC - in response to Message 6800.  

This isn't related to the -ffast-math discussion, but it looks like the x86_64 compile for Linux isn't actually doing x86_64. The i686 target has -m32 in the CXXFLAGS, but x86_64 doesn't have a -m64 flag anywhere. Good to see SSE2 is enabled though.


You don't need a flag if youre compiling on a 64bit machine. We only used the 32 flag because we were compiling on a 64bit machine.
Dave Przybylo
MilkyWay@home Developer
Department of Computer Science
Rensselaer Polytechnic Institute
ID: 6804 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jedirock
Avatar

Send message
Joined: 8 Nov 08
Posts: 178
Credit: 6,140,854
RAC: 0
Message 6808 - Posted: 26 Nov 2008, 21:02:39 UTC - in response to Message 6804.  
Last modified: 26 Nov 2008, 21:03:25 UTC

You don't need a flag if youre compiling on a 64bit machine. We only used the 32 flag because we were compiling on a 64bit machine.

Doesn't mean that it will be compiled on a 64-bit machine. I specify the architecture on my Mac all the time because the Makefile may be used on a different machine. And what happens if someone tries cross-compiling from Linux or Windows or something else?
ID: 6808 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Dave Przybylo
Avatar

Send message
Joined: 5 Feb 08
Posts: 236
Credit: 49,648
RAC: 0
Message 6814 - Posted: 26 Nov 2008, 21:36:32 UTC - in response to Message 6808.  
Last modified: 26 Nov 2008, 23:48:48 UTC

Well we're not going out of our way trying to make things portable here so people can compile it on many different machines. The makefile is basically just for us to compile the application for others to use. I think it's safe to assume that someone who knows how to compile cross platform also knows how to change the makefile accordingly. If you'd like to make changes to the makefile, you can post them in a new thread here and we'll implement them into the version if they're valid.
Dave Przybylo
MilkyWay@home Developer
Department of Computer Science
Rensselaer Polytechnic Institute
ID: 6814 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Thierry Godefroy

Send message
Joined: 29 Jul 08
Posts: 9
Credit: 2,200,784
RAC: 0
Message 6815 - Posted: 26 Nov 2008, 21:37:57 UTC - in response to Message 6802.  
Last modified: 26 Nov 2008, 21:39:27 UTC


An error is an error in both directions, sometimes up, sometimes down. So it stays in the 15th digit.

This would be true (if it were not more a problem of truncation rather than 5/5 rounding) for addition, but this is no more true with multiplication.


-ffast-math is used in SPEC benchmarks, which validate the results for acceptance, without any issue.

We are not speaking about benchmarking here, but about science application where accuracy does matter.

Running optimized apps is a good thing, because you can compute more results in less time, meaning also that you will need to consume less power for each result (good for the planet).

But this must not be at the cost of invalid results as then the calculations you did are in pure waste for the project itself.

Worst: unlike Seti and many other BOINC projects, Milkyway does not compare your results with the ones of others, meaning that if you send slightly wrong result, thier validator will not be able to notice the problem and this actually invalid result will pollute the science project.

Optimization must not be done at the cost of poorer science results. Period.
ID: 6815 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile speedimic
Avatar

Send message
Joined: 22 Feb 08
Posts: 260
Credit: 57,387,048
RAC: 0
Message 6816 - Posted: 26 Nov 2008, 21:42:12 UTC - in response to Message 6808.  

You don't need a flag if youre compiling on a 64bit machine. We only used the 32 flag because we were compiling on a 64bit machine.

Doesn't mean that it will be compiled on a 64-bit machine. I specify the architecture on my Mac all the time because the Makefile may be used on a different machine. And what happens if someone tries cross-compiling from Linux or Windows or something else?


Just imagine a newby trying to compile it... :-)
mic.


ID: 6816 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ebahapo
Avatar

Send message
Joined: 6 Sep 07
Posts: 66
Credit: 636,861
RAC: 0
Message 6820 - Posted: 26 Nov 2008, 22:07:44 UTC - in response to Message 6815.  
Last modified: 26 Nov 2008, 22:10:07 UTC

This would be true (if it were not more a problem of truncation rather than 5/5 rounding) for addition, but this is no more true with multiplication.

The error goes both ways for multiplications too. As I said before, given that the finite math of floating-point calculations on computers implies in an error of 0.5 bit of the mantissa, by your rationale all calculations done on a computer would be increasingly wrong, which is patently incorrect.

We are not speaking about benchmarking here, but about science application where accuracy does matter.

I mentioned SPEC because it's a benchmark that REQUIRES AND VERIFIES correct results. So, if -ffast-math doesn't affect the results of over 20 scientific applications from SPEC CPU2006, I doubt that it would affect Milkyway.

But if you prefer to live with your preconceptions and ignore the facts, fine. Enough said.
ID: 6820 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Thierry Godefroy

Send message
Joined: 29 Jul 08
Posts: 9
Credit: 2,200,784
RAC: 0
Message 6827 - Posted: 26 Nov 2008, 22:49:51 UTC - in response to Message 6820.  
Last modified: 26 Nov 2008, 22:51:20 UTC

This would be true (if it were not more a problem of truncation rather than 5/5 rounding) for addition, but this is no more true with multiplication.

The error goes both ways for multiplications too. As I said before, given that the finite math of floating-point calculations on computers implies in an error of 0.5 bit of the mantissa, by your rationale all calculations done on a computer would be increasingly wrong, which is patently incorrect.


You obviously don't have studied numeric analysis... I did (even if it was looong ago).


We are not speaking about benchmarking here, but about science application where accuracy does matter.

I mentioned SPEC because it's a benchmark that REQUIRES AND VERIFIES correct results. So, if -ffast-math doesn't affect the results of over 20 scientific applications from SPEC CPU2006, I doubt that it would affect Milkyway.

I think the admins said it in this very thread: they WANT maximum accuracy. Deal with it.


But if you prefer to live with your preconceptions and ignore the facts, fine.

These are not preconceptions but actual facts I could verify several times by myself.

If you don't trust me, perhaps will you trust someone else: here is one of the many examples (here for Seti) of -ffast-math problems (Google a bit and you'll find many others):
http://www.pperry.f2s.com/boinc-compile-seti.htm
(look at the very bottom, two paragraphs before the end).


Enough said.

Indeed !
ID: 6827 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Silvers

Send message
Joined: 21 Aug 08
Posts: 625
Credit: 558,425
RAC: 0
Message 6832 - Posted: 26 Nov 2008, 23:45:43 UTC

In regards to all the -ffast-math discussion, I have to ask a question based on what I noticed with the difference between a K6 and my K8 (Athlon64 3700+). The K6 has an enormously inferior FPU, yet using Milksop's app it was able to come close to or hit the 108 cr/hr limit. This suggests, at least to me, that the application is not very FPU-intensive. If that is indeed the case, and if it still is the case with the new application, then I would think that using -ffast-math probably would not give much of a boost to the performance anyway... Combine that with any chance of polluting the results and I would agree with the project and with Thierry that the option should not be used...

Agree? Disagree? Feel free to discuss...

(I'll watch, because this is out of my league, just making a casual, yet perhaps incorrect, observation...)

-Brian
ID: 6832 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Milksop at try

Send message
Joined: 1 Oct 08
Posts: 106
Credit: 24,162,445
RAC: 0
Message 6841 - Posted: 27 Nov 2008, 1:41:19 UTC - in response to Message 6832.  

In regards to all the -ffast-math discussion, I have to ask a question based on what I noticed with the difference between a K6 and my K8 (Athlon64 3700+). The K6 has an enormously inferior FPU, yet using Milksop's app it was able to come close to or hit the 108 cr/hr limit. This suggests, at least to me, that the application is not very FPU-intensive.

It is using quite some double precision math, but you have to understand, that the K6-FPU isn't that bad for some tasks. The theoretical throughput is only half of a Pentium2 if I remember right, but the latencies are very low. That may compensate the lower throughput in some cases. But a K6 reaches only roughly 60% of the performance of a P2 at the same clock here.

Regarding the ffast-math discussion, I'm definitely with Augustine. Thierry, you have to see that very few computational problems require the precision you are proposing here. If an algorithm would require such measures, it would also be sensitive to the arrangement of the arguments in the code. Furthermore, it would be hard to get the same results with x87 math compared to a PPC just because of the longer internal mantissa of the x87 FPU. You would need to flush it to memory after every operation to be really sure. Nobody does that (it is simply too slow).
You may be right, that in the general case the output may change (most probably very slightly and unnoticable, one really needs some special cases to see real changes), but I would regard such an algorithm close to numerical unstable. And believe me, Milkyway isn't at that point. Hell, the bug with the number of the integration points didn't change the (decimal) output!
ID: 6841 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 6848 - Posted: 27 Nov 2008, 3:08:49 UTC - in response to Message 6789.  

One or tow bits of mantissa, perhaps, but for *each* operation: the result after many consecutive ops can be quite significant.
Let me give you an example. Let's consider we only have 7 decimal positions of precision for a FPU (there are much more in modern FPUs, but that's just to make it easier in this example), and take this simple operation:
15 * 10 / 1000000000 = 0.00000015 (truncated as 0.0000001 because of or 7 decimals limitations)
should it be optimized (for example, because of out or order ops optimizations) as:
10 / 1000000000 * 15
then you get 10 / 1000000000 = 0.000000001 = 0.0000000 (7 decimals)
and 0.0000000 * 15 = 0.0000000 in the end...

Believe me, the above effect is far from negligible...

Because that's one decimal digit not a bit of difference. Besides, all FP operations have an average error of 0.5 bit by definition.

We're talking about a difference smaller than 15 decimal digits! If the output of the application is truncated to the default 5 digits, it'll never even show up.

HTH


The output of the application is 15 digits right now, the earlier versions had a bug where it was only printing out 5 digits.
ID: 6848 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Silvers

Send message
Joined: 21 Aug 08
Posts: 625
Credit: 558,425
RAC: 0
Message 6855 - Posted: 27 Nov 2008, 5:02:55 UTC - in response to Message 6841.  

In regards to all the -ffast-math discussion, I have to ask a question based on what I noticed with the difference between a K6 and my K8 (Athlon64 3700+). The K6 has an enormously inferior FPU, yet using Milksop's app it was able to come close to or hit the 108 cr/hr limit. This suggests, at least to me, that the application is not very FPU-intensive.

It is using quite some double precision math, but you have to understand, that the K6-FPU isn't that bad for some tasks. The theoretical throughput is only half of a Pentium2 if I remember right, but the latencies are very low. That may compensate the lower throughput in some cases. But a K6 reaches only roughly 60% of the performance of a P2 at the same clock here.

Regarding the ffast-math discussion, I'm definitely with Augustine. Thierry, you have to see that very few computational problems require the precision you are proposing here. If an algorithm would require such measures, it would also be sensitive to the arrangement of the arguments in the code. Furthermore, it would be hard to get the same results with x87 math compared to a PPC just because of the longer internal mantissa of the x87 FPU. You would need to flush it to memory after every operation to be really sure. Nobody does that (it is simply too slow).
You may be right, that in the general case the output may change (most probably very slightly and unnoticable, one really needs some special cases to see real changes), but I would regard such an algorithm close to numerical unstable. And believe me, Milkyway isn't at that point. Hell, the bug with the number of the integration points didn't change the (decimal) output!


If I were a manager, what I'd want to know is if the results are reliable in a sufficiently large enough sample and the amount of performance gain observed by enabling the option.

IMO, if we're talking less than a few percent, I'm not sure it's worth the risk...
ID: 6855 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Thierry Godefroy

Send message
Joined: 29 Jul 08
Posts: 9
Credit: 2,200,784
RAC: 0
Message 6877 - Posted: 27 Nov 2008, 16:39:56 UTC - in response to Message 6841.  

You may be right, that in the general case the output may change (most probably very slightly and unnoticable, one really needs some special cases to see real changes), but I would regard such an algorithm close to numerical unstable. And believe me, Milkyway isn't at that point. Hell, the bug with the number of the integration points didn't change the (decimal) output!


It is noticeable enough in Seti that it makes -ffast-math a no-no for it and does give INVALID results. See:
http://www.pperry.f2s.com/boinc-compile-seti.htm
(look at the very bottom, two paragraphs before the end).

What you guys don't seem to understand is that an error, even on the 15th decimal in one operation can spread (especially during multiplications) till it becomes quite significant (on the 3rd decimal of the final result, for example).

If you still don't want to understand and insist on using -ffast-math, then my guess is that the project admins will end up turning up the validation by results comparisons (like SETI does), meaning the throughput for the project will be divided by three (as they will need to make each WUs calculated by at least three different computers and then compare the results, only returning the ones that are close enough to each others to denote a non-crippled result).

This is my last post on this topic.
ID: 6877 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Milksop at try

Send message
Joined: 1 Oct 08
Posts: 106
Credit: 24,162,445
RAC: 0
Message 6886 - Posted: 27 Nov 2008, 17:38:38 UTC - in response to Message 6877.  
Last modified: 27 Nov 2008, 17:45:55 UTC

You may be right, that in the general case the output may change (most probably very slightly and unnoticable, one really needs some special cases to see real changes), but I would regard such an algorithm close to numerical unstable. And believe me, Milkyway isn't at that point. Hell, the bug with the number of the integration points didn't change the (decimal) output!

What you guys don't seem to understand is that an error, even on the 15th decimal in one operation can spread (especially during multiplications) till it becomes quite significant (on the 3rd decimal of the final result, for example).

And what you don't seem to understand is that an algorithm can actually work around this problem. If major deviations occur I would say the algorithm may have a problem.
I was talking about the official app using a wrong number of integration points (82) and was still generating the same output file as when using the correct number (30). If that is the case, don't tell me the ouput will change if the compiler rearranges the calculations a bit (to a mathematical identical expression). Actually I have done such things already by hand and MW appears to be quite stable against that.

If you still don't want to understand and insist on using -ffast-math, then my guess is that the project admins will end up turning up the validation by results comparisons (like SETI does), meaning the throughput for the project will be divided by three (as they will need to make each WUs calculated by at least three different computers and then compare the results, only returning the ones that are close enough to each others to denote a non-crippled result).

I'm not insisting on using it, in fact I have not used this option in my published versions either (just -O2, that was all). The reason is quite simple, you only turn to the compiler flags, if you have done the high level stuff already. The effort of testing all the combinations of compiler options and maybe different WU types is higher than to get the five or ten percent improvement from other changes to the code.

All I was saying it should be safe to use it here. But one have of course to check the results for deviations offline first (by comparing the result of the official app and a self compiled with the exact same WU). Everyone should do it either way if he compiles an own version. If it gives the same result, you can use it. It is really simple.
ID: 6886 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Raistmer*

Send message
Joined: 27 Jun 09
Posts: 85
Credit: 39,805,338
RAC: 0
Message 27465 - Posted: 10 Jul 2009, 19:33:02 UTC

Current options for SETI opt apps (that validates OK):
Akv8: /Qfp-speculationfast (ICC)
AP : /fp:fast (MSVC)
both for windows targets but I assume they are some equivalent of -ffast-math discussed here.
So SETI is not very adequate example here.
But surely, SETI doesn't need 15th digit. Most calculations are even done in float, not double.
ID: 27465 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Overtonesinger
Avatar

Send message
Joined: 15 Feb 10
Posts: 63
Credit: 1,836,010
RAC: 0
Message 51669 - Posted: 11 Nov 2011, 22:12:22 UTC - in response to Message 27465.  

Can someone compile MilkyWay separation app 32-bit or x64 for Windows?
- with those among others flags, please?: (in ICC)

-SSE3_ATOM -O3

I would love to try and run it on all ATOMs at home!

(for example, on my two old intel ATOM minibooks, the SETi astropulse v505 compiled with these flags is 33 percent faster than the best optimized non-ATOM binary of it. So, for a 100-hour workunit it is a lot faster!).

I would really love to try MW Atom-optimized app here. :)



P.S.There are approx. 40 milion devices with intel Atom around the world... yet, no one probably knows how many of them is running MilkyWay... or how many will start to run it once it becomes available in optimized form for them. ;)
ID: 51669 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Application Code Discussion : compiler optimization flags

©2024 Astroinformatics Group