Welcome to MilkyWay@home

Milkyway Nbody ST vs. MT: real benchmarking

Message boards : Number crunching : Milkyway Nbody ST vs. MT: real benchmarking
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 812
Credit: 20,950,354
RAC: 6,108
Message 77668 - Posted: 17 Oct 2025, 16:33:03 UTC

As some of you might have read, I did in the past some comparisons between MT and ST tasks. However that was only based on the credits and I only compared 2-thread tasks vs. single thread tasks. So that was not very good comparison and after reading a lot about multithreading and cache sizes on PrimeGrid, I decided to make a real benchmark, in particular since I have seen, that compared to my wingmen, sometimes my computer needed ridiculously much CPU time for the same amount of work.

So now I tried 7 threads per WU (I run 14 of 16 threads of my CPU for CPU crunching the other two are for feeding the iGPU) and run same WU while normal production in ST-mode and than I changed my settings to 7 threads per WU, set up a 2nd instance of BOINC and run there the same WU there in 7-thread-mode together with another 7-thread-WU and one GPU-WU from Einstein in the first instance of BOINC.

The WU I have used is de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512.
           run time     CPU time
ST          11214 s      11205 s
MT (1)        387 s       2038 s
MT (2)        391 s       2046 s

So there's a huge difference in the CPU time required, which means we have here same cache effects as on PrimeGrid, we just don't know how much cache is needed per WU. Also the runtime of Einstein on the iGPU went down from about 18.5k seconds to around 15.9k.


In case someone wants to do some benchmarks on his computer, this is the WU (I added a number at the end to have several copies for testing and extended the deadline):
<workunit>
    <name>de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512_01</name>
    <app_name>milkyway_nbody_orbit_fitting</app_name>
    <version_num>193</version_num>
    <rsc_fpops_est>40006000000000.000000</rsc_fpops_est>
    <rsc_fpops_bound>400060000000000000.000000</rsc_fpops_bound>
    <rsc_memory_bound>500000000.000000</rsc_memory_bound>
    <rsc_disk_bound>104858000.000000</rsc_disk_bound>
    <command_line>
-f nbody_parameters.lua -h histogram.txt --seed 156791432 -np 12 -p 3.45074 1 0.339685 0.446913 27.1153 0.706509 41.902 21.1847 -2.65629 -6.82502 -46.1851 95741.7
    </command_line>
    <file_ref>
        <file_name>EMD_v193_OCS_orbit_fitting_lmc_pm_old_eps2.lua</file_name>
        <open_name>nbody_parameters.lua</open_name>
    </file_ref>
    <file_ref>
        <file_name>OCS_data_2025_pm.hist</file_name>
        <open_name>histogram.txt</open_name>
    </file_ref>
</workunit>
<result>
    <name>de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512_01_0</name>
    <final_cpu_time>0.000000</final_cpu_time>
    <final_elapsed_time>0.000000</final_elapsed_time>
    <exit_status>0</exit_status>
    <state>2</state>
    <platform>windows_x86_64</platform>
    <version_num>193</version_num>
    <plan_class>mt</plan_class>
    <suspended_via_gui/>
    <wu_name>de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512_01</wu_name>
    <report_deadline>1792749848.000000</report_deadline>
    <received_time>1760283432.046444</received_time>
</result>

I'll do some more tests when I have time, i have saved some other WUs in the past, but they were for older application versions, no idea if they will work on the current one. But I think it's pretty obvious, that one thread per task isn't optimal as I have thought after my simple comparison in the past.
ID: 77668 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 812
Credit: 20,950,354
RAC: 6,108
Message 77674 - Posted: 19 Oct 2025, 16:49:40 UTC - in response to Message 77668.  
Last modified: 19 Oct 2025, 17:16:18 UTC

Now I tried all 14 threads for one WU, that's worse than 2x 7-thread WUs.

WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512
Application version: 1.93
           Threads     run time     CPU time     WU/day on 14 threads
ST             1     11214.xx s   11205.xx s     107.8
MT             7       386.65 s    2038.38 s     446.5
MT             7       390.82 s    2046.41 s     442.1
MT            14       213.93 s    1990.48 s     403.9

So crunching this particular WU on half of the available threads instead of just one increases production by a factor of 4.12. Didn't expect that after my previous tests and in particular not such a huge difference between ST an MT.
ID: 77674 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 812
Credit: 20,950,354
RAC: 6,108
Message 77675 - Posted: 20 Oct 2025, 11:14:13 UTC - in response to Message 77674.  

Final test was with 3 5-thread WUs, so running on 15 threads.

WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512
Application version: 1.93
           Threads     run time     CPU time     WU/day on 14 threads
ST             1     11214.xx s   11205.xx s     107.8
MT             5       613.07 s    2350.41 s     422.8 (15 threads)
MT             7       386.65 s    2038.38 s     446.5
MT             7       390.82 s    2046.41 s     442.1
MT            14       213.93 s    1990.48 s     403.9

So 3x5 is slightly slower than 2x7 and will likely also have a negative impact on the iGPU production, so definitely not worth it, at least not on my Ryzen 5700G.


I also tested another, long running WU, however with a bit older application version. Shouldn't make much difference I guess.

WU: de_nbody_orbit_fitting_05_23_2025_v190_OCS__data__02_1748458520_896362
Application version: 1.90
           Threads     run time     CPU time     WU/day on 14 threads
ST             1     86598.57 s   86355.62 s     13.97
MT             7      7560.51 s   45338.19 s     22.86

So again huge increase in production, however not as huge as for the other WU, which is again surprising, as I'd expect the single core phase at the beginning of each WU to have a higher impact on short WUs, but apparently there are other more significant factors than that.

This is the WU in case someone wants to test with it:
<workunit>
    <name>de_nbody_orbit_fitting_05_23_2025_v190_OCS__data__02_1748458520_896362</name>
    <app_name>milkyway_nbody</app_name>
    <version_num>190</version_num>
    <rsc_fpops_est>20629100000000.000000</rsc_fpops_est>
    <rsc_fpops_bound>206291000000000000.000000</rsc_fpops_bound>
    <rsc_memory_bound>500000000.000000</rsc_memory_bound>
    <rsc_disk_bound>52428800.000000</rsc_disk_bound>
    <command_line>
-f nbody_parameters.lua -h histogram.txt --seed 411592566 -np 11 -p 2.49417 1 0.0338769 0.440317 12.5767 0.965698 46.2133 23.2603 -191.888 75.0999 135.272
    </command_line>
    <file_ref>
        <file_name>EMD_v190_OCS_orbit_fitting.lua</file_name>
        <open_name>nbody_parameters.lua</open_name>
    </file_ref>
    <file_ref>
        <file_name>OCS_data_2023_version2.hist</file_name>
        <open_name>histogram.txt</open_name>
    </file_ref>
</workunit>
<result>
    <name>de_nbody_orbit_fitting_05_23_2025_v190_OCS__data__02_1748458520_896362_1</name>
    <final_cpu_time>0.000000</final_cpu_time>
    <final_elapsed_time>0.000000</final_elapsed_time>
    <exit_status>0</exit_status>
    <state>2</state>
    <platform>windows_x86_64</platform>
    <version_num>190</version_num>
    <plan_class>mt</plan_class>
    <suspended_via_gui/>
    <wu_name>de_nbody_orbit_fitting_05_23_2025_v190_OCS__data__02_1748458520_896362</wu_name>
    <report_deadline>1792749848.000000</report_deadline>
    <received_time>1750690560.237772</received_time>
</result>

ID: 77675 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 812
Credit: 20,950,354
RAC: 6,108
Message 77728 - Posted: 20 Nov 2025, 12:51:11 UTC

More benchmarking results, now also without Einstein WU running on the iGPU of my Ryzen 5700G.

WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512
Application version: 1.93
        threads/WU     run time     CPU time     threads in use        WU/day    relative speed
ST             1     11214.xx s   11205.xx s     14x CPU + 1x iGPU     107.8         22.15%
MT             2      1633.08 s    3089.80 s     16x CPU + 0x iGPU     423.2         86.98%
MT             4       742.80 s    2534.84 s     16x CPU + 0x iGPU     465.3         95.62%
MT             5       613.07 s    2350.41 s     15x CPU + 1x iGPU     422.8         86.89%
MT             7       386.65 s    2038.38 s     14x CPU + 1x iGPU     446.5         91.85%
MT             7       390.82 s    2046.41 s     14x CPU + 1x iGPU     442.1         90.87%
MT             7       368.25 s    2024.03 s     14x CPU + 0x iGPU     469.2         96.43%
MT             8       355.12 s    2143.25 s     16x CPU + 0x iGPU     486.6    --> 100.00% <--
MT            14       213.93 s    1990.48 s     14x CPU + 1x iGPU     403.9         82.00%
MT            16       204.70 s    2130.02 s     16x CPU + 0x iGPU     422.1         86.74%



And here a bit longer running WU: de_nbody_10_27_2025_v193_OCS_north__data__07_1762888230_2212
Application version: 1.93
        threads/WU     run time     CPU time     threads in use        WU/day    relative speed
MT             2      2736.81 s    5205.73 s     16x CPU + 0x iGPU     252.6         84.79%
MT             4      1208.70 s    4198.17 s     16x CPU + 0x iGPU     285.9         96.00%
MT             7       612.74 s    3286.55 s     14x CPU + 1x iGPU     282.0         94.68%
MT             7       601.14 s    3388.61 s     14x CPU + 0x iGPU     287.5         96.51%
MT             8       580.16 s    3608.00 s     16x CPU + 0x iGPU     297.8    --> 100.00% <--
MT            16       328.15 s    3518.88 s     16x CPU + 0x iGPU     263.3         88.40%



This is the WU in case someone wants to test with it:
<workunit>
    <name>de_nbody_10_27_2025_v193_OCS_north__data__07_1762888230_2212</name>
    <app_name>milkyway_nbody_orbit_fitting</app_name>
    <version_num>193</version_num>
    <rsc_fpops_est>30022500000000.000000</rsc_fpops_est>
    <rsc_fpops_bound>3002250000000000000.000000</rsc_fpops_bound>
    <rsc_memory_bound>500000000.000000</rsc_memory_bound>
    <rsc_disk_bound>104858000.000000</rsc_disk_bound>
    <command_line>
-f nbody_parameters.lua -h histogram.txt --seed 233088231 -np 12 -p 5.28862 1 0.170294 0.298857 1.26323 0.0213518 45.8 21.5 -185.5 54.7 147.4 449866
    </command_line>
    <file_ref>
        <file_name>EMD_v193_OCS_north_no_orbit_old_eps2.lua</file_name>
        <open_name>nbody_parameters.lua</open_name>
    </file_ref>
    <file_ref>
        <file_name>OCS_data_2025_north.hist</file_name>
        <open_name>histogram.txt</open_name>
    </file_ref>
</workunit>

<result>
    <name>de_nbody_10_27_2025_v193_OCS_north__data__07_1762888230_2212__1</name>
    <final_cpu_time>0.000000</final_cpu_time>
    <final_elapsed_time>0.000000</final_elapsed_time>
    <exit_status>0</exit_status>
    <state>2</state>
    <platform>windows_x86_64</platform>
    <version_num>193</version_num>
    <plan_class>mt</plan_class>
    <suspended_via_gui/>
    <wu_name>de_nbody_10_27_2025_v193_OCS_north__data__07_1762888230_2212_</wu_name>
    <report_deadline>1794443214.000000</report_deadline>
    <received_time>1763406414.476450</received_time>
</result>

ID: 77728 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Nixon

Send message
Joined: 18 Nov 23
Posts: 10
Credit: 403,586
RAC: 2,901
Message 77742 - Posted: 22 Nov 2025, 18:40:33 UTC

Thanks for this. When I started MW@h I too observed that multi-threaded tasks use less than 100% of the CPU because of the synchronisation needed between threads, assumed that running multiple single-threaded tasks would therefore be more efficient overall, and set my preferences accordingly. And later I similarly noticed that wingmen almost invariably complete the same workunits in substantially less time (beyond differences explained by faster CPUs)…

Here are some numbers from my i5-3320M:

WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512
Application version: 1.93
        threads/WU     run time     CPU time     threads in use        WU/day
ST             1      6281         6243           4× CPU                55.0
MT             2      2871         5502           4× CPU                60.2
MT             4      1440         5147           4× CPU                60.0


For reference, these are the results I got:
<search_likelihood>-27342.660233678041550</search_likelihood>
<search_likelihood_EMD>-14.106647272974579</search_likelihood_EMD>
<search_likelihood_Mass>-3771.532869044177005</search_likelihood_Mass>
<search_likelihood_Beta>-822.641452301281788</search_likelihood_Beta>
<search_likelihood_BetaAvg>-274.418758463499103</search_likelihood_BetaAvg>
<search_likelihood_VelAvg>-383.897846477010262</search_likelihood_VelAvg>
<search_likelihood_Dist>-1141.268630669829690</search_likelihood_Dist>
<search_likelihood_PM_dec>-19182.767423587025405</search_likelihood_PM_dec>
<search_likelihood_PM_ra>-1752.026605862245560</search_likelihood_PM_ra>

I will add numbers from some other machines in due course.
ID: 77742 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Nixon

Send message
Joined: 18 Nov 23
Posts: 10
Credit: 403,586
RAC: 2,901
Message 77743 - Posted: 23 Nov 2025, 0:56:57 UTC
Last modified: 23 Nov 2025, 0:57:28 UTC

Here are some numbers from my i7-7700HQ:

WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512
Application version: 1.93
        threads/WU     run time     CPU time     threads in use        WU/day
ST             1      6193         6117           8× CPU               111.6
MT             2      2500         4825           8× CPU               138.2
MT             4      1171         4184           8× CPU               147.6
MT             8       528         3407           8× CPU               163.7
ID: 77743 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 812
Credit: 20,950,354
RAC: 6,108
Message 77744 - Posted: 23 Nov 2025, 8:19:36 UTC - in response to Message 77743.  

Interesting, seems like on older CPUs with not so many cores (and not so much cache) running 1 WU at a time on all threads might be best.
ID: 77744 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bobsmith18

Send message
Joined: 1 Nov 10
Posts: 40
Credit: 2,644,015
RAC: 4,026
Message 77745 - Posted: 23 Nov 2025, 19:36:08 UTC - in response to Message 77744.  

That statement is somewhat at odds with the figures in the final column of his table.
Perhaps the op could explain how he ran his tests
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 77745 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Nixon

Send message
Joined: 18 Nov 23
Posts: 10
Credit: 403,586
RAC: 2,901
Message 77789 - Posted: 6 Dec 2025, 20:07:53 UTC

We can’t really draw conclusions from a sample size of 1. But it’s an interesting exercise.

Here are some numbers from my dual Xeon E5-2650 v2:

WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512
Application version: 1.93
        threads/WU     run time     CPU time     threads in use        WU/day
ST             1      6613         6605           2× 16× CPU           418.1
MT             2      2906         5599           2× 16× CPU           475.7
MT             4      1437         5202           2× 16× CPU           480.9
MT             8       749         4932           2× 16× CPU           461.2
MT            16       451         4914           2× 16× CPU           382.8
MT            32       320         4977           2× 16× CPU           270.2

All these results seem to support the hypothesis that there’s a cache effect, but only up to a point.
ID: 77789 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 812
Credit: 20,950,354
RAC: 6,108
Message 77790 - Posted: 7 Dec 2025, 12:22:57 UTC - in response to Message 77789.  

We can’t really draw conclusions from a sample size of 1. But it’s an interesting exercise.
You can always choose any other WUs from your current cache, there's no need to limit yourself to this one. ;-)


All these results seem to support the hypothesis that there’s a cache effect, but only up to a point.
I guess the 32-thread example also shows the expected performance drop when the WU is split over two CPUs. Did you do something to make sure, that the WUs with less threads run all threads on the same CPU?
ID: 77790 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Nixon

Send message
Joined: 18 Nov 23
Posts: 10
Credit: 403,586
RAC: 2,901
Message 77791 - Posted: 7 Dec 2025, 13:17:51 UTC - in response to Message 77790.  

You can always choose any other WUs
Of course – but then we wouldn’t be comparing like with like for the purposes of this little experiment. I was more acknowledging that by reporting only one result per configuration I am ignoring the variance that a larger sample would reveal.

the 32-thread example also shows the expected performance drop when the WU is split over two CPUs
Indeed. I was expecting that one to be woeful (additional overhead from inter-CPU synchronisation combined with the reduced efficiency of adding more threads), and wouldn’t run “production” WUs like that, but I thought I’d try it anyway for completeness.

Did you do something to make sure, that the WUs with less threads run all threads on the same CPU?
Yes: I wrote a program that pins each task’s threads to selected cores to prevent things moving about during the test.
ID: 77791 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bigeagle

Send message
Joined: 12 Mar 09
Posts: 1
Credit: 31,214,555
RAC: 15,453
Message 77792 - Posted: 8 Dec 2025, 19:04:36 UTC

I don't know how to run workunits manually for benchmarking but i had a look at the results from time to time and noticed that usually others running mt WUs had a lot more cpu time. Sure most cpus are older but it's still more than i would expect. I'm running a 9800X3D limited to 95W PPT and usually full cpu usage for boinc.
For example:
a 9950X https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1011801174 with only slightly more. Unless that one is throttled as well it should be faster i'd expect.
an 9600X being the only one i found with less cpu time https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1011752654
most of the time i see things like
https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1011828447
https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1011718606

all in all i got the impression of mt WUs being slightly to considerably slower than running multiple single threaded ones. i assumed issues due to synchronization or locking and switched back to non-mt
ID: 77792 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 812
Credit: 20,950,354
RAC: 6,108
Message 77793 - Posted: 10 Dec 2025, 12:36:53 UTC - in response to Message 77792.  

I don't know how to run workunits manually for benchmarking
I used this guide to set up a 2nd BOINC instance and copied everything what's needed to run Milkyway from the first to the second instance, edited client_state.xml and disabled network for the second instance before staring it. If interested, I can write more detailed instructions.


I'm running a 9800X3D limited to 95W PPT and usually full cpu usage for boinc.
(...)
all in all i got the impression of mt WUs being slightly to considerably slower than running multiple single threaded ones. i assumed issues due to synchronization or locking and switched back to non-mt
It's not a surprise that an 9800X3D with it's 96MB L3 cache for 8 cores / 16 threads behaves differently than for example my 5700G with same amount of threads, but only 16MB L3 cache (and slower RAM). I mean, one Milkyway WU needs about 20MB of RAM, so 4-5 of them fit into your L3 cache completely. From the results posted so far, I guess Milkyway Nbody runs best, if it gets around 5MB cache per task, so yes, it's possible, that on your CPU single core is most efficient.
ID: 77793 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 812
Credit: 20,950,354
RAC: 6,108
Message 77803 - Posted: 18 Dec 2025, 20:46:22 UTC

Not a real benchmark, but interesting anyway, a WU completed by my Ryzen 5700G using as usual 7 threads/WU and by another 5700G using 1 thread/WU.

de_nbody_orbit_fitting_10_23_2025_v193_OCS_north_MW2014__data__3_1763637002_213405

           Threads      run time       CPU time
ST             1      100,692.50 s   100,692.50 s
MT             7       11,088.50 s    63,695.55 s

This is pretty much in line with my result for another long running WU, which I posted further up, in particular if we assume, that the other 5700G is running on stock clock settings and using all cores.
ID: 77803 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bobsmith18

Send message
Joined: 1 Nov 10
Posts: 40
Credit: 2,644,015
RAC: 4,026
Message 77805 - Posted: 18 Dec 2025, 22:06:38 UTC - in response to Message 77803.  

It is also interesting to see the degradation in per-core performance as the number of cores used increases. In an ideal world this example would be returning about 77,000 seconds of PCU time for its about 11,000 seconds of clock time. I can think of several reasons for this: including running out of L3 cache, inter-process synchronisation delays, motherboard to CPU data bottlenecks (RAM read/write becoming swamped by demand), L3 cache management. None of these are within our control as "mere crunchers", but the developers may have a better idea as to why this is happening.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 77805 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bobsmith18

Send message
Joined: 1 Nov 10
Posts: 40
Credit: 2,644,015
RAC: 4,026
Message 77806 - Posted: 18 Dec 2025, 22:12:16 UTC - in response to Message 77803.  

It is also interesting to see the degradation in per-core performance as the number of cores used increases. In an ideal world this example would be returning about 77,000 seconds of PCU time for its about 11,000 seconds of clock time. I can think of several reasons for this: including running out of L3 cache, inter-process synchronisation delays, motherboard to CPU data bottlenecks (RAM read/write becoming swamped by demand), L3 cache management. None of these are within our control as "mere crunchers", but the developers may have a better idea as to why this is happening.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 77806 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 812
Credit: 20,950,354
RAC: 6,108
Message 77807 - Posted: 19 Dec 2025, 14:07:04 UTC - in response to Message 77806.  

running out of L3 cache (...) motherboard to CPU data bottlenecks (RAM read/write becoming swamped by demand)
This two slow down the ST application, not MT. RAM bandwidth usage is a lot higher when running ST.
ID: 77807 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Nixon

Send message
Joined: 18 Nov 23
Posts: 10
Credit: 403,586
RAC: 2,901
Message 77812 - Posted: 22 Dec 2025, 14:00:39 UTC

Digging out an even older machine: here are some numbers from my dual Xeon X5660 (2 sockets, 12 cores, 24 threads, 24 MB L3 cache):

WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512
Application version: 1.93
        threads/WU     run time     CPU time     threads in use        WU/day
ST             1      8540         8515           2× 12× CPU           242.8
MT             2      3995         7700           2× 12× CPU           259.5
MT             3      2717         7540           2× 12× CPU           254.4
MT             4      2050         7413           2× 12× CPU           252.9
MT             6      1408         7243           2× 12× CPU           245.4
MT            12       803         7172           2× 12× CPU           215.2
MT            24       559         7866           2× 12× CPU           154.7

There’s more to it than cache, it seems…

With this one I found it particularly surprising that the 3-thread tasks did better than the 4-thread ones, given that half the cores end up running threads from two processes, which I expected would badly degrade the L1 and L2 cache efficiency.
ID: 77812 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 812
Credit: 20,950,354
RAC: 6,108
Message 77813 - Posted: 22 Dec 2025, 17:51:02 UTC - in response to Message 77812.  

There’s more to it than cache, it seems…
Tripple-Channel RAM for each CPU helps for sure a lot. There's a reason, why Intel and AMD do not release tripple or quad-channel for consumer products. Considering when dual-channel became standard, quad-channel should be standard since few years, but that would kill their server CPU sales (or they'd have to sell them a lot cheaper).
ID: 77813 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Nixon

Send message
Joined: 18 Nov 23
Posts: 10
Credit: 403,586
RAC: 2,901
Message 77814 - Posted: 22 Dec 2025, 20:35:03 UTC - in response to Message 77813.  

Tripple-Channel RAM for each CPU helps for sure a lot.
But the E5-2650 v2 has quad-channel RAM, and more L3 cache per core than the X5660 – yet peaked at 4 threads/WU rather than 2. That’s the part I don’t understand…
ID: 77814 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Milkyway Nbody ST vs. MT: real benchmarking

©2026 Astroinformatics Group