Message boards :
Number crunching :
Milkyway Nbody ST vs. MT: real benchmarking
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 19 Jul 10 Posts: 812 Credit: 20,951,030 RAC: 6,123 |
As some of you might have read, I did in the past some comparisons between MT and ST tasks. However that was only based on the credits and I only compared 2-thread tasks vs. single thread tasks. So that was not very good comparison and after reading a lot about multithreading and cache sizes on PrimeGrid, I decided to make a real benchmark, in particular since I have seen, that compared to my wingmen, sometimes my computer needed ridiculously much CPU time for the same amount of work. So now I tried 7 threads per WU (I run 14 of 16 threads of my CPU for CPU crunching the other two are for feeding the iGPU) and run same WU while normal production in ST-mode and than I changed my settings to 7 threads per WU, set up a 2nd instance of BOINC and run there the same WU there in 7-thread-mode together with another 7-thread-WU and one GPU-WU from Einstein in the first instance of BOINC. The WU I have used is de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512.
run time CPU time
ST 11214 s 11205 s
MT (1) 387 s 2038 s
MT (2) 391 s 2046 s
So there's a huge difference in the CPU time required, which means we have here same cache effects as on PrimeGrid, we just don't know how much cache is needed per WU. Also the runtime of Einstein on the iGPU went down from about 18.5k seconds to around 15.9k. In case someone wants to do some benchmarks on his computer, this is the WU (I added a number at the end to have several copies for testing and extended the deadline):
<workunit>
<name>de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512_01</name>
<app_name>milkyway_nbody_orbit_fitting</app_name>
<version_num>193</version_num>
<rsc_fpops_est>40006000000000.000000</rsc_fpops_est>
<rsc_fpops_bound>400060000000000000.000000</rsc_fpops_bound>
<rsc_memory_bound>500000000.000000</rsc_memory_bound>
<rsc_disk_bound>104858000.000000</rsc_disk_bound>
<command_line>
-f nbody_parameters.lua -h histogram.txt --seed 156791432 -np 12 -p 3.45074 1 0.339685 0.446913 27.1153 0.706509 41.902 21.1847 -2.65629 -6.82502 -46.1851 95741.7
</command_line>
<file_ref>
<file_name>EMD_v193_OCS_orbit_fitting_lmc_pm_old_eps2.lua</file_name>
<open_name>nbody_parameters.lua</open_name>
</file_ref>
<file_ref>
<file_name>OCS_data_2025_pm.hist</file_name>
<open_name>histogram.txt</open_name>
</file_ref>
</workunit>
<result>
<name>de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512_01_0</name>
<final_cpu_time>0.000000</final_cpu_time>
<final_elapsed_time>0.000000</final_elapsed_time>
<exit_status>0</exit_status>
<state>2</state>
<platform>windows_x86_64</platform>
<version_num>193</version_num>
<plan_class>mt</plan_class>
<suspended_via_gui/>
<wu_name>de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512_01</wu_name>
<report_deadline>1792749848.000000</report_deadline>
<received_time>1760283432.046444</received_time>
</result>I'll do some more tests when I have time, i have saved some other WUs in the past, but they were for older application versions, no idea if they will work on the current one. But I think it's pretty obvious, that one thread per task isn't optimal as I have thought after my simple comparison in the past.
|
|
Send message Joined: 19 Jul 10 Posts: 812 Credit: 20,951,030 RAC: 6,123 |
Now I tried all 14 threads for one WU, that's worse than 2x 7-thread WUs. WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512 Application version: 1.93
Threads run time CPU time WU/day on 14 threads
ST 1 11214.xx s 11205.xx s 107.8
MT 7 386.65 s 2038.38 s 446.5
MT 7 390.82 s 2046.41 s 442.1
MT 14 213.93 s 1990.48 s 403.9
So crunching this particular WU on half of the available threads instead of just one increases production by a factor of 4.12. Didn't expect that after my previous tests and in particular not such a huge difference between ST an MT.
|
|
Send message Joined: 19 Jul 10 Posts: 812 Credit: 20,951,030 RAC: 6,123 |
Final test was with 3 5-thread WUs, so running on 15 threads. WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512 Application version: 1.93
Threads run time CPU time WU/day on 14 threads
ST 1 11214.xx s 11205.xx s 107.8
MT 5 613.07 s 2350.41 s 422.8 (15 threads)
MT 7 386.65 s 2038.38 s 446.5
MT 7 390.82 s 2046.41 s 442.1
MT 14 213.93 s 1990.48 s 403.9
So 3x5 is slightly slower than 2x7 and will likely also have a negative impact on the iGPU production, so definitely not worth it, at least not on my Ryzen 5700G. I also tested another, long running WU, however with a bit older application version. Shouldn't make much difference I guess. WU: de_nbody_orbit_fitting_05_23_2025_v190_OCS__data__02_1748458520_896362 Application version: 1.90
Threads run time CPU time WU/day on 14 threads
ST 1 86598.57 s 86355.62 s 13.97
MT 7 7560.51 s 45338.19 s 22.86
So again huge increase in production, however not as huge as for the other WU, which is again surprising, as I'd expect the single core phase at the beginning of each WU to have a higher impact on short WUs, but apparently there are other more significant factors than that. This is the WU in case someone wants to test with it: <workunit>
<name>de_nbody_orbit_fitting_05_23_2025_v190_OCS__data__02_1748458520_896362</name>
<app_name>milkyway_nbody</app_name>
<version_num>190</version_num>
<rsc_fpops_est>20629100000000.000000</rsc_fpops_est>
<rsc_fpops_bound>206291000000000000.000000</rsc_fpops_bound>
<rsc_memory_bound>500000000.000000</rsc_memory_bound>
<rsc_disk_bound>52428800.000000</rsc_disk_bound>
<command_line>
-f nbody_parameters.lua -h histogram.txt --seed 411592566 -np 11 -p 2.49417 1 0.0338769 0.440317 12.5767 0.965698 46.2133 23.2603 -191.888 75.0999 135.272
</command_line>
<file_ref>
<file_name>EMD_v190_OCS_orbit_fitting.lua</file_name>
<open_name>nbody_parameters.lua</open_name>
</file_ref>
<file_ref>
<file_name>OCS_data_2023_version2.hist</file_name>
<open_name>histogram.txt</open_name>
</file_ref>
</workunit>
<result>
<name>de_nbody_orbit_fitting_05_23_2025_v190_OCS__data__02_1748458520_896362_1</name>
<final_cpu_time>0.000000</final_cpu_time>
<final_elapsed_time>0.000000</final_elapsed_time>
<exit_status>0</exit_status>
<state>2</state>
<platform>windows_x86_64</platform>
<version_num>190</version_num>
<plan_class>mt</plan_class>
<suspended_via_gui/>
<wu_name>de_nbody_orbit_fitting_05_23_2025_v190_OCS__data__02_1748458520_896362</wu_name>
<report_deadline>1792749848.000000</report_deadline>
<received_time>1750690560.237772</received_time>
</result>
|
|
Send message Joined: 19 Jul 10 Posts: 812 Credit: 20,951,030 RAC: 6,123 |
More benchmarking results, now also without Einstein WU running on the iGPU of my Ryzen 5700G. WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512 Application version: 1.93 threads/WU run time CPU time threads in use WU/day relative speed ST 1 11214.xx s 11205.xx s 14x CPU + 1x iGPU 107.8 22.15% MT 2 1633.08 s 3089.80 s 16x CPU + 0x iGPU 423.2 86.98% MT 4 742.80 s 2534.84 s 16x CPU + 0x iGPU 465.3 95.62% MT 5 613.07 s 2350.41 s 15x CPU + 1x iGPU 422.8 86.89% MT 7 386.65 s 2038.38 s 14x CPU + 1x iGPU 446.5 91.85% MT 7 390.82 s 2046.41 s 14x CPU + 1x iGPU 442.1 90.87% MT 7 368.25 s 2024.03 s 14x CPU + 0x iGPU 469.2 96.43% MT 8 355.12 s 2143.25 s 16x CPU + 0x iGPU 486.6 --> 100.00% <-- MT 14 213.93 s 1990.48 s 14x CPU + 1x iGPU 403.9 82.00% MT 16 204.70 s 2130.02 s 16x CPU + 0x iGPU 422.1 86.74% And here a bit longer running WU: de_nbody_10_27_2025_v193_OCS_north__data__07_1762888230_2212 Application version: 1.93 threads/WU run time CPU time threads in use WU/day relative speed MT 2 2736.81 s 5205.73 s 16x CPU + 0x iGPU 252.6 84.79% MT 4 1208.70 s 4198.17 s 16x CPU + 0x iGPU 285.9 96.00% MT 7 612.74 s 3286.55 s 14x CPU + 1x iGPU 282.0 94.68% MT 7 601.14 s 3388.61 s 14x CPU + 0x iGPU 287.5 96.51% MT 8 580.16 s 3608.00 s 16x CPU + 0x iGPU 297.8 --> 100.00% <-- MT 16 328.15 s 3518.88 s 16x CPU + 0x iGPU 263.3 88.40% This is the WU in case someone wants to test with it: <workunit>
<name>de_nbody_10_27_2025_v193_OCS_north__data__07_1762888230_2212</name>
<app_name>milkyway_nbody_orbit_fitting</app_name>
<version_num>193</version_num>
<rsc_fpops_est>30022500000000.000000</rsc_fpops_est>
<rsc_fpops_bound>3002250000000000000.000000</rsc_fpops_bound>
<rsc_memory_bound>500000000.000000</rsc_memory_bound>
<rsc_disk_bound>104858000.000000</rsc_disk_bound>
<command_line>
-f nbody_parameters.lua -h histogram.txt --seed 233088231 -np 12 -p 5.28862 1 0.170294 0.298857 1.26323 0.0213518 45.8 21.5 -185.5 54.7 147.4 449866
</command_line>
<file_ref>
<file_name>EMD_v193_OCS_north_no_orbit_old_eps2.lua</file_name>
<open_name>nbody_parameters.lua</open_name>
</file_ref>
<file_ref>
<file_name>OCS_data_2025_north.hist</file_name>
<open_name>histogram.txt</open_name>
</file_ref>
</workunit>
<result>
<name>de_nbody_10_27_2025_v193_OCS_north__data__07_1762888230_2212__1</name>
<final_cpu_time>0.000000</final_cpu_time>
<final_elapsed_time>0.000000</final_elapsed_time>
<exit_status>0</exit_status>
<state>2</state>
<platform>windows_x86_64</platform>
<version_num>193</version_num>
<plan_class>mt</plan_class>
<suspended_via_gui/>
<wu_name>de_nbody_10_27_2025_v193_OCS_north__data__07_1762888230_2212_</wu_name>
<report_deadline>1794443214.000000</report_deadline>
<received_time>1763406414.476450</received_time>
</result>
|
|
Send message Joined: 18 Nov 23 Posts: 10 Credit: 403,852 RAC: 2,903 |
Thanks for this. When I started MW@h I too observed that multi-threaded tasks use less than 100% of the CPU because of the synchronisation needed between threads, assumed that running multiple single-threaded tasks would therefore be more efficient overall, and set my preferences accordingly. And later I similarly noticed that wingmen almost invariably complete the same workunits in substantially less time (beyond differences explained by faster CPUs)… Here are some numbers from my i5-3320M: WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512 Application version: 1.93
threads/WU run time CPU time threads in use WU/day
ST 1 6281 6243 4× CPU 55.0
MT 2 2871 5502 4× CPU 60.2
MT 4 1440 5147 4× CPU 60.0
For reference, these are the results I got: <search_likelihood>-27342.660233678041550</search_likelihood> <search_likelihood_EMD>-14.106647272974579</search_likelihood_EMD> <search_likelihood_Mass>-3771.532869044177005</search_likelihood_Mass> <search_likelihood_Beta>-822.641452301281788</search_likelihood_Beta> <search_likelihood_BetaAvg>-274.418758463499103</search_likelihood_BetaAvg> <search_likelihood_VelAvg>-383.897846477010262</search_likelihood_VelAvg> <search_likelihood_Dist>-1141.268630669829690</search_likelihood_Dist> <search_likelihood_PM_dec>-19182.767423587025405</search_likelihood_PM_dec> <search_likelihood_PM_ra>-1752.026605862245560</search_likelihood_PM_ra> I will add numbers from some other machines in due course. |
|
Send message Joined: 18 Nov 23 Posts: 10 Credit: 403,852 RAC: 2,903 |
Here are some numbers from my i7-7700HQ: WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512 Application version: 1.93
threads/WU run time CPU time threads in use WU/day
ST 1 6193 6117 8× CPU 111.6
MT 2 2500 4825 8× CPU 138.2
MT 4 1171 4184 8× CPU 147.6
MT 8 528 3407 8× CPU 163.7
|
|
Send message Joined: 19 Jul 10 Posts: 812 Credit: 20,951,030 RAC: 6,123 |
Interesting, seems like on older CPUs with not so many cores (and not so much cache) running 1 WU at a time on all threads might be best.
|
|
Send message Joined: 1 Nov 10 Posts: 40 Credit: 2,644,912 RAC: 4,080 |
That statement is somewhat at odds with the figures in the final column of his table. Perhaps the op could explain how he ran his tests Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
|
Send message Joined: 18 Nov 23 Posts: 10 Credit: 403,852 RAC: 2,903 |
We can’t really draw conclusions from a sample size of 1. But it’s an interesting exercise. Here are some numbers from my dual Xeon E5-2650 v2: WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512 Application version: 1.93
threads/WU run time CPU time threads in use WU/day
ST 1 6613 6605 2× 16× CPU 418.1
MT 2 2906 5599 2× 16× CPU 475.7
MT 4 1437 5202 2× 16× CPU 480.9
MT 8 749 4932 2× 16× CPU 461.2
MT 16 451 4914 2× 16× CPU 382.8
MT 32 320 4977 2× 16× CPU 270.2
All these results seem to support the hypothesis that there’s a cache effect, but only up to a point. |
|
Send message Joined: 19 Jul 10 Posts: 812 Credit: 20,951,030 RAC: 6,123 |
We can’t really draw conclusions from a sample size of 1. But it’s an interesting exercise.You can always choose any other WUs from your current cache, there's no need to limit yourself to this one. ;-) All these results seem to support the hypothesis that there’s a cache effect, but only up to a point.I guess the 32-thread example also shows the expected performance drop when the WU is split over two CPUs. Did you do something to make sure, that the WUs with less threads run all threads on the same CPU?
|
|
Send message Joined: 18 Nov 23 Posts: 10 Credit: 403,852 RAC: 2,903 |
You can always choose any other WUsOf course – but then we wouldn’t be comparing like with like for the purposes of this little experiment. I was more acknowledging that by reporting only one result per configuration I am ignoring the variance that a larger sample would reveal. the 32-thread example also shows the expected performance drop when the WU is split over two CPUsIndeed. I was expecting that one to be woeful (additional overhead from inter-CPU synchronisation combined with the reduced efficiency of adding more threads), and wouldn’t run “production” WUs like that, but I thought I’d try it anyway for completeness. Did you do something to make sure, that the WUs with less threads run all threads on the same CPU?Yes: I wrote a program that pins each task’s threads to selected cores to prevent things moving about during the test. |
|
Send message Joined: 12 Mar 09 Posts: 1 Credit: 31,214,555 RAC: 15,453 |
I don't know how to run workunits manually for benchmarking but i had a look at the results from time to time and noticed that usually others running mt WUs had a lot more cpu time. Sure most cpus are older but it's still more than i would expect. I'm running a 9800X3D limited to 95W PPT and usually full cpu usage for boinc. For example: a 9950X https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1011801174 with only slightly more. Unless that one is throttled as well it should be faster i'd expect. an 9600X being the only one i found with less cpu time https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1011752654 most of the time i see things like https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1011828447 https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1011718606 all in all i got the impression of mt WUs being slightly to considerably slower than running multiple single threaded ones. i assumed issues due to synchronization or locking and switched back to non-mt |
|
Send message Joined: 19 Jul 10 Posts: 812 Credit: 20,951,030 RAC: 6,123 |
I don't know how to run workunits manually for benchmarkingI used this guide to set up a 2nd BOINC instance and copied everything what's needed to run Milkyway from the first to the second instance, edited client_state.xml and disabled network for the second instance before staring it. If interested, I can write more detailed instructions. I'm running a 9800X3D limited to 95W PPT and usually full cpu usage for boinc.It's not a surprise that an 9800X3D with it's 96MB L3 cache for 8 cores / 16 threads behaves differently than for example my 5700G with same amount of threads, but only 16MB L3 cache (and slower RAM). I mean, one Milkyway WU needs about 20MB of RAM, so 4-5 of them fit into your L3 cache completely. From the results posted so far, I guess Milkyway Nbody runs best, if it gets around 5MB cache per task, so yes, it's possible, that on your CPU single core is most efficient.
|
|
Send message Joined: 19 Jul 10 Posts: 812 Credit: 20,951,030 RAC: 6,123 |
Not a real benchmark, but interesting anyway, a WU completed by my Ryzen 5700G using as usual 7 threads/WU and by another 5700G using 1 thread/WU. de_nbody_orbit_fitting_10_23_2025_v193_OCS_north_MW2014__data__3_1763637002_213405
Threads run time CPU time
ST 1 100,692.50 s 100,692.50 s
MT 7 11,088.50 s 63,695.55 s
This is pretty much in line with my result for another long running WU, which I posted further up, in particular if we assume, that the other 5700G is running on stock clock settings and using all cores.
|
|
Send message Joined: 1 Nov 10 Posts: 40 Credit: 2,644,912 RAC: 4,080 |
It is also interesting to see the degradation in per-core performance as the number of cores used increases. In an ideal world this example would be returning about 77,000 seconds of PCU time for its about 11,000 seconds of clock time. I can think of several reasons for this: including running out of L3 cache, inter-process synchronisation delays, motherboard to CPU data bottlenecks (RAM read/write becoming swamped by demand), L3 cache management. None of these are within our control as "mere crunchers", but the developers may have a better idea as to why this is happening. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
|
Send message Joined: 1 Nov 10 Posts: 40 Credit: 2,644,912 RAC: 4,080 |
It is also interesting to see the degradation in per-core performance as the number of cores used increases. In an ideal world this example would be returning about 77,000 seconds of PCU time for its about 11,000 seconds of clock time. I can think of several reasons for this: including running out of L3 cache, inter-process synchronisation delays, motherboard to CPU data bottlenecks (RAM read/write becoming swamped by demand), L3 cache management. None of these are within our control as "mere crunchers", but the developers may have a better idea as to why this is happening. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
|
Send message Joined: 19 Jul 10 Posts: 812 Credit: 20,951,030 RAC: 6,123 |
running out of L3 cache (...) motherboard to CPU data bottlenecks (RAM read/write becoming swamped by demand)This two slow down the ST application, not MT. RAM bandwidth usage is a lot higher when running ST.
|
|
Send message Joined: 18 Nov 23 Posts: 10 Credit: 403,852 RAC: 2,903 |
Digging out an even older machine: here are some numbers from my dual Xeon X5660 (2 sockets, 12 cores, 24 threads, 24 MB L3 cache): WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512 Application version: 1.93
threads/WU run time CPU time threads in use WU/day
ST 1 8540 8515 2× 12× CPU 242.8
MT 2 3995 7700 2× 12× CPU 259.5
MT 3 2717 7540 2× 12× CPU 254.4
MT 4 2050 7413 2× 12× CPU 252.9
MT 6 1408 7243 2× 12× CPU 245.4
MT 12 803 7172 2× 12× CPU 215.2
MT 24 559 7866 2× 12× CPU 154.7
There’s more to it than cache, it seems… With this one I found it particularly surprising that the 3-thread tasks did better than the 4-thread ones, given that half the cores end up running threads from two processes, which I expected would badly degrade the L1 and L2 cache efficiency. |
|
Send message Joined: 19 Jul 10 Posts: 812 Credit: 20,951,030 RAC: 6,123 |
There’s more to it than cache, it seems…Tripple-Channel RAM for each CPU helps for sure a lot. There's a reason, why Intel and AMD do not release tripple or quad-channel for consumer products. Considering when dual-channel became standard, quad-channel should be standard since few years, but that would kill their server CPU sales (or they'd have to sell them a lot cheaper).
|
|
Send message Joined: 18 Nov 23 Posts: 10 Credit: 403,852 RAC: 2,903 |
Tripple-Channel RAM for each CPU helps for sure a lot.But the E5-2650 v2 has quad-channel RAM, and more L3 cache per core than the X5660 – yet peaked at 4 threads/WU rather than 2. That’s the part I don’t understand… |
©2026 Astroinformatics Group