Welcome to MilkyWay@home

milkyway3 v0.02 source released

Message boards : News : milkyway3 v0.02 source released
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 38249 - Posted: 7 Apr 2010, 1:28:25 UTC

As I said before, we're moving over to a new application. Here's a preliminary version that we'll be using to test it's validator and assimilator. It's going to have to be updated with a new fitness function which uses a different model of the Milky Way galaxy, but I'm still waiting for John Vickers to give me go ahead to start using that code -- should be later this week.


In the meantime, we can still start testing this code as it uses the same output format that John's new code will use. This new application should really improve server performance as it doesn't use search parameter files to send out new workunits, it will take the parameters from the command line instead. Additionally, it will report the fitness in stderr, which gets moved into the result's xml.


What this means is the server will be creating 1 less file for every workunit made (it currently creates 1 file), and receiving 1 less file for every result sent (it currently receives 1 file). This means sending workunits out and getting results in won't be hammering the filesystem nearly as hard. Right now the recent server crashes have been happening because the file deletion daemon just can't handle the sheer mass of WU files that are being generated, and it brings the server to a screeching halt which then crashes the other daemons. This new application should fix that problem so we really want to get everyone swapped over to it ASAP.


Contained in this news post (if you go to the forums) is the download location for the new source, and directions on how to compile and test it.


--Travis
ID: 38249 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 38250 - Posted: 7 Apr 2010, 1:34:33 UTC - in response to Message 38249.  
Last modified: 7 Apr 2010, 7:19:11 UTC

You can get the source here: http://milkyway.cs.rpi.edu/milkyway/download/mw3_v0.02.zip and http://milkyway.cs.rpi.edu/milkyway/download/mw3_v0.02.tar

The makefile in milkyway/bin should work for both linux and OS X now. The BOINC software should be in the parent directory of milkyway, ie. if you have a software directory:

/software
/software/boinc/... <----boinc code here
/software/milkyway/ <---- milkyway code here

To be able to compile.

To test the code, move your binary into milkyway/bin/test_files/

There's a test_small.sh and a test_large.sh. If you run the test_small.sh it should output the results for the different stripes provided in stderr.txt, and it should look something like the following. You can also run the tests individually (with test_application_on_.sh, so between runs I'm going to put what stripe is being run:

stripe 11:
21:21:38 (12732): Can't open init data file - running in standalone mode
0.0003444046777667009217108
0 34.16259801226563297405
1 667.25896879726303723146
2 336.16783412734901048680
-3.1250978838032099638155614
stock_osx_i686: 0.02 double
21:22:20 (12732): called boinc_finish

stripe 12:
shmget in attach_shmem: Invalid argument
21:22:22 (12736): Can't set up shared mem: -1. Will run in standalone mode.
0.0004449968586906448138156
0 93.28291463972249175640
1 693.31476977137049289013
2 379.87099590291904860351
-3.2208300766762310018975768
stock_osx_i686: 0.02 double
21:23:05 (12736): called boinc_finish

stripe 20:
shmget in attach_shmem: Invalid argument
21:23:07 (12741): Can't set up shared mem: -1. Will run in standalone mode.
0.0004719392291151077410694
0 10.71234415081183222185
1 513.40698152693448719219
-2.9853612927614374683571441
stock_osx_i686: 0.02 double
21:23:16 (12741): called boinc_finish

stripe 21:
shmget in attach_shmem: Invalid argument
21:23:18 (12745): Can't set up shared mem: -1. Will run in standalone mode.
0.0002240358727967291284713
0 9.54103668867131560205
1 458.95295766568307271882
-2.8894040571271419892696031
stock_osx_i686: 0.02 double
21:23:24 (12745): called boinc_finish

stripe 79:
shmget in attach_shmem: Invalid argument
21:23:26 (12748): Can't set up shared mem: -1. Will run in standalone mode.
0.0000794168946569495708349
0 98.08330586161639530474
-2.9467337959002959379972708
stock_osx_i686: 0.02 double
21:23:32 (12748): called boinc_finish
shmget in attach_shmem: Invalid argument

stripe 82:
21:23:34 (12751): Can't set up shared mem: -1. Will run in standalone mode.
0.0001708114073352570109077
0 101.09116755805874277030
-2.9856176357361423612246654
stock_osx_i686: 0.02 double
21:23:41 (12751): called boinc_finish

stripe 86:
shmget in attach_shmem: Invalid argument
21:23:43 (12754): Can't set up shared mem: -1. Will run in standalone mode.
0.0006493793897383106934057
0 374.13823192297877540113
-3.0279737877182579808277296
stock_osx_i686: 0.02 double
21:23:58 (12754): called boinc_finish


You want to make sure that the background_integral, stream_integral, and search_likelihood fields are all within 10e-11 (and probably at least 10e-12 to be really safe) of the ones given here to ensure they're validated correctly. If you're getting the results on a CPU, they should probably match exactly or close to exactly -- between my different machines they do.

I'm currently running test_large.sh. this uses workunits that the server is currently sending out to people, and should have those values to compare against shortly. It's very important that these match up to the same amount of decimal places, because this is what we're actually going to be using, and the added complexity of these WUs could cause some differences that won't show up in test_small.sh -- which is only for testing purposes.

When I have these results (hopefully tomorrow), I'll edit this post with them as well.

--Travis
ID: 38250 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 38321 - Posted: 7 Apr 2010, 21:04:42 UTC - in response to Message 38250.  
Last modified: 9 Apr 2010, 16:06:07 UTC

Here's test_large.sh for stripe 11:

0.0003443992660761982451019
0 34.16147717411647732888
1 667.21617282424983841338
2 336.15696917331467830081
-3.1250868172779200371280695
stock_osx_i686: 0.02 double


stripe 12:

0.0004449918961078170144899
0 93.27312262639343032333
1 693.27679438258951449825
2 379.86009835109649657170
-3.2208204239369941923598617
stock_osx_i686: 0.02 double


stripe 20:
0.0004719020737142438654682
0 10.71148904735921014719
1 513.26961290476504018443
-2.9853108349120298647960681
stock_osx_i686: 0.02 double


stripe 21:
0.0002240104908331347615281
0 9.54048581001507933763
1 458.82543260018468345152
-2.8893404822235160267496212
stock_osx_i686: 0.02 double


stripe 79:
0.0000794095014117974698888
0 98.04755369273112819428
-2.9466812925013314838906808
stock_osx_i686: 0.02 double

stripe 82:
0.0001707988764891311657748
0 101.05333869703883920010
-2.9855677973465843955125365
stock_osx_i686: 0.02 double

stripe 86:
0.0006493010440547261445665
0 374.00915239954110802501
-3.0279071739719700673276748
stock_osx_i686: 0.02 double
ID: 38321 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 38352 - Posted: 8 Apr 2010, 0:49:10 UTC - in response to Message 38321.  
Last modified: 9 Apr 2010, 16:10:35 UTC

I just noticed that the integral sizes for stripes 20, 21, 79, 82 and 86 (the large versions) were not large enough to reflect the current workunit sizes. To be sure we're getting the right values I've updated the link to the source code files.

For stripes 20, 21, 79, and 82, the integrals should be:
 32 r[min,max,steps]: 16.0, 22.5, 1400
 33 mu[min,max,steps]: 133, 249, 1600
 34 nu[min,max,steps]: -1.25, 1.25, 640


and for 86 (which does two integrals), they should be:

 22 r[min,max,steps]: 16.000000, 22.500000, 1400
 23 mu[min,max,steps]: 310.000000, 420.000000, 1600
 24 nu[min,max,steps]: -1.250000, 1.250000, 640
 25 number_cuts: 1
 26 r_cut[min,max,steps][3]: 16.0000000000, 22.5000000000, 700
 27 mu_cut[min,max,steps][3]: 21.6000000000, 22.3000000000, 800
 28 nu_cut[min,max,steps][3]: -1.2500000000, 1.2500000000, 320

ID: 38352 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 38431 - Posted: 9 Apr 2010, 0:22:56 UTC
Last modified: 9 Apr 2010, 0:25:58 UTC

I have just built some kind of hybrid between MW2 and the new MW3. It takes the values from the search parameter file if it exists, otherwise from the command line. It should work for "sub projects" (an out file is written if the search parameter file was there). I just have to check all the different test units. It's a bit late here already, so for today just the stripe 11 WUs.
Bear in mind, that double precision values have only slightly less than 16 decimal digits precision. That's why I output only those 16 significant digits. The binary encoding of the number is unambigously determined by those digits, anything more is simply redundant.

stripe 11-small:
<background_integral>0.0003444046777667009217108</background_integral>
<stream_integral>0 34.16259801226563297405</stream_integral>
<stream_integral>1 667.25896879726303723146</stream_integral>
<stream_integral>2 336.16783412734901048680</stream_integral>
<search_likelihood>-3.1250978838032099638155614</search_likelihood>
<search_application>stock_osx_i686: 0.02 double</search_application>

<background_integral>0.0003444046777666962</background_integral>
<stream_integral>0 34.16259801227059</stream_integral>
<stream_integral>1 667.2589687972669</stream_integral>
<stream_integral>2 336.167834127362</stream_integral>
<search_likelihood>-3.125097883803223</search_likelihood>
<search_application>Gipsel_GPU_CAL_x64: 0.24 double</search_application>


stripe 11-large:
<background_integral>0.0003443992660761982451019</background_integral>
<stream_integral>0 34.16147717411647732888</stream_integral>
<stream_integral>1 667.21617282424983841338</stream_integral>
<stream_integral>2 336.15696917331467830081</stream_integral>
<search_likelihood>-3.1250868172779200371280695</search_likelihood>
<search_application>stock_osx_i686: 0.02 double</search_application>

<background_integral>0.0003443992660766585</background_integral>
<stream_integral>0 34.16147717899386</stream_integral>
<stream_integral>1 667.2161728244141</stream_integral>
<stream_integral>2 336.15696917356</stream_integral>
<search_likelihood>-3.125086817279327</search_likelihood>
<search_application>Gipsel_GPU_CAL_x64: 0.24 double</search_application>


The difference of the final likelihood between the stock OSX and the ATI app is about 10^-12 for the large (production size) WU. I think that is roughly the same range as the difference between the stock CPU and the CUDA application. I've not tested it yet (takes too long ;), but I think my CPU versions will return virtually the same (maybe with a 10^-15 difference) likelihood values as the ATI GPUs.

PS:
The small WU took 2 seconds or so, the large one about 8.5 minutes on a HD3870.

PPS
@Travis:

I'm quite sure that the stock CPU and CUDA versions would get quite a bit closer to my results if they would use Kahan summations for everything outside of the convolution loops as I'm doing (integration as well as likelihood calculation).

You told me the results can be anywhere in the stderr.txt and additional output doesn't hurt. If you want to test it, here is the complete output of the run:

Can't set up shared mem: -1
Will run in standalone mode.
Running Milkyway@home ATI GPU application version 0.24 (Win64, CAL 1.3) by Gipsel
Parsing search parameters from command line.
CPU: AMD Phenom(tm) 9750 Quad-Core Processor (4 cores/threads) 2.44807 GHz (294ms)

CAL Runtime: 1.3.145
Found 3 CAL devices

Device 0: ATI Radeon HD2350/2400/3200/4200 (RV610/RV620) 64 MB local RAM (remote 28 MB cached + 4 MB uncached)
GPU core clock: 837 MHz, memory clock: 397 MHz
40 shader units organized in 1 SIMDs with 8 VLIW units (5-issue), wavefront size 32 threads
not supporting double precision

Device 1: ATI Radeon HD3800 (RV670) 512 MB local RAM (remote 28 MB cached + 0 MB uncached)
GPU core clock: 837 MHz, memory clock: 397 MHz
320 shader units organized in 4 SIMDs with 16 VLIW units (5-issue), wavefront size 64 threads
supporting double precision

Device 2: ATI Radeon HD3800 (RV670) 512 MB local RAM (remote 28 MB cached + 0 MB uncached)
GPU core clock: 837 MHz, memory clock: 397 MHz
320 shader units organized in 4 SIMDs with 16 VLIW units (5-issue), wavefront size 64 threads
supporting double precision

0 WUs already running on GPU 1
0 WUs already running on GPU 2
Starting WU on GPU 1

main integral, 640 iterations (1600x1400), 3 streams
predicted runtime per iteration is 649 ms (33.3333 ms are allowed), dividing each iteration in 20 parts
borders of the domains at 0 80 160 240 320 400 480 560 640 720 800 880 960 1040 1120 1200 1280 1360 1440 1520 1600
Calculated about 3.28897e+013 floatingpoint ops on GPU, 2.47165e+008 on FPU. Approximate GPU time 502.078 seconds.

<background_integral>0.0003443992660766585</background_integral>
<stream_integral>0 34.16147717899386</stream_integral>
<stream_integral>1 667.2161728244141</stream_integral>
<stream_integral>2 336.15696917356</stream_integral>

probability calculation (97434 stars)
Calculated about 3.23676e+009 floatingpoint ops on FPU.

<search_likelihood>-3.125086817279327</search_likelihood>
<search_application>Gipsel_GPU_CAL_x64: 0.24 double</search_application>

WU completed.
CPU time: 5.875 seconds,  GPU time: 502.078 seconds,  wall clock time: 504.791 seconds,  CPU frequency: 2.44809 GHz


When the exact same application is used with a search parameter file, it is ignoring the command line arguments (and suppresses the warnings as seen with the current version). Unfortunately the results are not identical to the 0.23 version as 0.24 uses the updated (no hardcoded rotation matrix anymore) sgrToGal coordinate conversion in "atSurveyGeometry.c".

Running Milkyway@home ATI GPU application version 0.24 (Win64, CAL 1.3) by Gipsel
Search parameter file found. Ignoring search parameters on command line.
CPU: AMD Phenom(tm) 9750 Quad-Core Processor (4 cores/threads) 2.44808 GHz (347ms)
[..]
ID: 38431 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 38433 - Posted: 9 Apr 2010, 3:46:51 UTC - in response to Message 38431.  

I've updated the info with stipes 79 and 82, just one more to go. My poor old laptop has been crunching 24/7 trying to get the values out.

How does your application compare to the other stripes? There's some different settings which might effect the end result (especially between stripes 11,12 and 20,21 and 79,82 and 86).

I'm going to have to talk to Anthony about Kahan summation. I know I implemented it but I'm not quite sure if he's using it.

Other than that the results look really good :) Having things not be identical to 0.23 doesn't really matter because we'll be using it for milkyway3 (and not the current running application). So it should match what we've got.

I'm just finishing up some work generation changes and should hopefully start putting out WUs for the new application tomorrow.
ID: 38433 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 38460 - Posted: 9 Apr 2010, 12:35:56 UTC - in response to Message 38433.  
Last modified: 9 Apr 2010, 12:36:28 UTC

I've updated the info with stipes 79 and 82, just one more to go. My poor old laptop has been crunching 24/7 trying to get the values out.

How does your application compare to the other stripes? There's some different settings which might effect the end result (especially between stripes 11,12 and 20,21 and 79,82 and 86).

I just modified the batch files and set the integral sizes in astronomy parameters to the values you mentioned above. It will take half an hour or so and I have the values for all stripes.
ID: 38460 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 38463 - Posted: 9 Apr 2010, 13:08:53 UTC - in response to Message 38250.  
Last modified: 9 Apr 2010, 13:10:03 UTC

Small tests (stock app is the one for OSX with the values Travis posted above):

stripe 11:
stock app: -3.125097883803210
ATI3 0.24: -3.125097883803223

stripe 12:
stock app: -3.220830076676231
ATI3 0.24: -3.220830076676198

stripe 20:
stock app: -2.985361292761437
ATI3 0.24: -2.985361292761449

stripe 21:
stock app: -2.889404057127142
ATI3 0.24: -2.889404057127163

stripe 79:
stock app: -2.946733795900296
ATI3 0.24: -2.946733795900271

stripe 82:
stock app: -2.985617635736142
ATI3 0.24: -2.985617635736161

stripe 86:
stock app: -3.027973787718258
ATI3 0.24: -3.027973787718241

====================================

Large Test:

stripe 11:
stock app: -3.125086817277920
ATI3 0.24: -3.125086817279327

stripe 12:
stock app: -3.220820423936994
ATI3 0.24: -3.220820423937265

stripe 20:
stock app: -2.985310834912030
ATI3 0.24: -2.985310834396576

stripe 21:
stock app: -2.889340482223516
ATI3 0.24: -2.889340481681885

stripe 79:
stock app: -2.946681292501331
ATI3 0.24: -2.946681291653775

stripe 82:
stock app: -2.985567797346584
ATI3 0.24: -2.985567796521971

stripe 86:
stock app: not available yet
ATI3 0.24: -3.027907173993795

==================================

All results are given with 16 significant decimal digits. The results of the large WUs for stripes 20, 21, 79, 82, and 86 deviate significantly. That coincides with the wrong integral sizes defined in the astronomy parameter file for those stripes as mentioned by Travis above. I've run the calculations for the corrected values given in the linked post. I can only assume the values of the stock app are for the original (too small) values. Can you say something about it Travis?
ID: 38463 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 38480 - Posted: 9 Apr 2010, 16:07:47 UTC - in response to Message 38463.  
Last modified: 9 Apr 2010, 16:10:25 UTC


All results are given with 16 significant decimal digits. The results of the large WUs for stripes 20, 21, 79, 82, and 86 deviate significantly. That coincides with the wrong integral sizes defined in the astronomy parameter file for those stripes as mentioned by Travis above. I've run the calculations for the corrected values given in the linked post. I can only assume the values of the stock app are for the original (too small) values. Can you say something about it Travis?


I just updated with stripe 86 (which seems to match yours), but I'm sure the other values (20, 21, 79, 82) are for the right integral sizes, because I fixed them al before running them; and they all took 10-12 hours on my laptop. Smaller sizes would have been much quicker.

*edit* doh i see the problem, 20, 21, 79, and 82 should be 640 for the nu value (not 320). That most likely explains the discrepancy. Going to fix the post about the sizes.
ID: 38480 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 38484 - Posted: 9 Apr 2010, 17:31:38 UTC - in response to Message 38463.  

Large test updated with fixed integral sizes:

stripe 11:
stock app: -3.125086817277920
ATI3 0.24: -3.125086817279327

stripe 12:
stock app: -3.220820423936994
ATI3 0.24: -3.220820423937265

stripe 20:
stock app: -2.985310834912030
ATI3 0.24: -2.985310834913758


stripe 21:
stock app: -2.889340482223516
ATI3 0.24: -2.889340482226973

stripe 79:
stock app: -2.946681292501331
ATI3 0.24: -2.946681292507772

stripe 82:
stock app: -2.985567797346584
ATI3 0.24: -2.985567797359872

stripe 86:
stock app: -3.027907173971970
ATI3 0.24: -3.027907173993795

====================================

The differences are generally in the 10^-12 range, but for stripe 82 and 86 10^-11. Maybe one should really look into reducing that a bit.
ID: 38484 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 38501 - Posted: 9 Apr 2010, 21:56:10 UTC - in response to Message 38484.  

The differences are generally in the 10^-12 range, but for stripe 82 and 86 10^-11. Maybe one should really look into reducing that a bit.


I wonder if using Kahan summations on the CPU would bring things closer together at all.
ID: 38501 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 38502 - Posted: 9 Apr 2010, 22:10:32 UTC - in response to Message 38501.  
Last modified: 9 Apr 2010, 22:19:36 UTC

The differences are generally in the 10^-12 range, but for stripe 82 and 86 10^-11. Maybe one should really look into reducing that a bit.


I wonder if using Kahan summations on the CPU would bring things closer together at all.

It definitely worked here ;)
You see it when comparing the results of the ATI apps and my optimized CPU apps (starting with the 0.20 versions). They are extremely close (I've taken extra care to do some things as similar as possible in the ATI and CPU version). That is simply the effect of the Kahan summation and the "scale shift trick" in calculate_likelihood. The tests I did back then in september last year with some WUs (1600x700x320, 120 convolution steps) yielded really identical results for all my CPU and GPU versions.

Did you read my PM? Should be easy to put those things in and check what it does.
ID: 38502 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 38512 - Posted: 10 Apr 2010, 0:29:51 UTC - in response to Message 38502.  

Did you read my PM? Should be easy to put those things in and check what it does.


I did and I'm gonna give it a shot. I'm just wondering if doing Kahan summation and the scale shift trick will bring the CPUs closer to the GPU or farther away.
ID: 38512 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 38513 - Posted: 10 Apr 2010, 1:49:04 UTC - in response to Message 38512.  
Last modified: 10 Apr 2010, 1:57:40 UTC

Did you read my PM? Should be easy to put those things in and check what it does.


I did and I'm gonna give it a shot. I'm just wondering if doing Kahan summation and the scale shift trick will bring the CPUs closer to the GPU or farther away.

You mean when comparing a CPU version with the changes to a CUDA app without them?

When it is implemented in both, it will bring it closer together (as can be seen from my versions starting with 0.20, CPU and GPU versions have it). When the results could be seen in the task details for a short time, one has seen it is still true for current production WUs (just look at the results of the 0.21 GPU and the 64bit 0.20 SSE3 CPU application, they are identical).

When implemented only in the CPU version, it may not help that much. It depends also how you do the reduction (summing the results of all threads) in the CUDA version. The ATI version did a tree like summation before, which was already better than what is currently done in the stock CPU application. The Kahan summation during the integration was therefore a less marked improvement as the changes to the likelihood calculation.
But todays larger WUs may shift that as the summation method gets relatively more important with larger integrals. As one sees the deviation between the stock CPU version and my version implementing those changes grows with the integral sizes, one may conclude that a better summation outweighs the positive effect of the scale shift.

The scale shift is extremely easy to do. You just need to change two lines of code or so, and it should work exactly the same for the CUDA version. It buys you simply roughly one digit more precision for that step (if the integral values are not already too "noisy" by accumulated numerical imprecision). Maybe you should start with that for both versions and see what happens.

After that you can look into the better summation methods as this may be a bit more tricky for the CUDA version than the CPU version. ATI cards do some combination of a Kahan summation and a tree like Kahan sum (where 2 values with 2 correction terms are added together resulting in 1 value with 1 correction term) as the very last step (only once per WU). As said, it didn't changed too much compared to the earlier (simpler) approach, but this really depends how it is done in the CUDA version now (I guess I have to look into that). From my experience, it moved the CPU results more than the ATI results, because the ATI version used a slightly better method (pairwise/treelike sum) from the beginning.
ID: 38513 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Anthony Waters

Send message
Joined: 16 Jun 09
Posts: 85
Credit: 172,476
RAC: 0
Message 38566 - Posted: 11 Apr 2010, 3:03:47 UTC

Large
11
CUDA new: -3.125086817279229
stock app: -3.125086817277920
ATI3 0.24: -3.125086817279327
12
CUDA new: -3.220820423937230
stock app: -3.220820423936994
ATI3 0.24: -3.220820423937265

20
CUDA new: -2.985310834913899
stock app: -2.985310834912030
ATI3 0.24: -2.985310834913758

21
CUDA new: -2.889340482227060
stock app: -2.889340482223516
ATI3 0.24: -2.889340482226973

79
CUDA new: -2.946681292507908
stock app: -2.946681292501331
ATI3 0.24: -2.946681292507772

82
CUDA new: -2.985567797360019
stock app: -2.985567797346584
ATI3 0.24: -2.985567797359872

86
CUDA new: -3.027907173993949
stock app: -3.027907173971970
ATI3 0.24: -3.027907173993795

CUDA version has the added kahan summation

ID: 38566 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 38568 - Posted: 11 Apr 2010, 4:38:16 UTC - in response to Message 38566.  

That's interesting. Looks like the two GPU apps are closer to each other than the CPU one. Going to have to try the Kahan summation in the CPU app to see if it brings it closer to them.
ID: 38568 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 38577 - Posted: 11 Apr 2010, 10:34:48 UTC - in response to Message 38566.  

[snipping the results]

CUDA version has the added kahan summation

Looks great. The differences between CUDA and ATI are now down to the 10^-13 range. Have you also tried that "scale shift" thing? It improves the accuracy of the likelihood calculation (are you using Kahan sums also there?), if that is the limit now.
ID: 38577 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 38784 - Posted: 16 Apr 2010, 18:02:06 UTC - in response to Message 38568.  

That's interesting. Looks like the two GPU apps are closer to each other than the CPU one. Going to have to try the Kahan summation in the CPU app to see if it brings it closer to them.

Any news to this?
ID: 38784 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 38797 - Posted: 17 Apr 2010, 0:41:44 UTC - in response to Message 38784.  

That's interesting. Looks like the two GPU apps are closer to each other than the CPU one. Going to have to try the Kahan summation in the CPU app to see if it brings it closer to them.

Any news to this?


Getting there :) Had a couple things come up. Should have some results this weekend.
ID: 38797 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 39018 - Posted: 22 Apr 2010, 19:40:20 UTC - in response to Message 38797.  
Last modified: 22 Apr 2010, 19:44:49 UTC

Some results for testing the Kahan summation method with the CPU application (stock ksm, stock app is the old version):

stripe 11:
stock app: -3.125097883803210
stock ksm: -3.1250978838032299478300047
ATI3 0.24: -3.125097883803223

stripe 12:
stock app: -3.220830076676231
stock ksm: -3.2208300766762043565449858
ATI3 0.24: -3.220830076676198

stripe 20:
stock app: -2.985361292761437
stock ksm: -2.9853612927614427974276623
ATI3 0.24: -2.985361292761449

stripe 21:
stock app: -2.889404057127142
stock ksm: -2.8894040571271575323919478
ATI3 0.24: -2.889404057127163

stripe 79:
stock app: -2.946733795900296
stock ksm: -2.9467337959002648517525813
ATI3 0.24: -2.946733795900271

stripe 82:
stock app: -2.985617635736142
stock ksm: -2.9856176357361552398117510
ATI3 0.24: -2.985617635736161

stripe 86:
stock app: -3.027973787718258
stock ksm: -3.0279737877182353322780273
ATI3 0.24: -3.027973787718241


I'm gonna start crunching the large sized workunits to get some values for those. I should be releasing updated code later today.

At least for the smaller WUs, it looks like it brought the values closer to the ATI application, sometimes overshooting :P which is kind of interesting. Generally it looks like we gained at least a significant digit of accuracy from it. It will be interesting to see the difference in larger sized workunits.
ID: 39018 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : News : milkyway3 v0.02 source released

©2024 Astroinformatics Group