Milkyway Nbody ST vs. MT: real benchmarking

Author	Message
Link Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,503,497 RAC: 9,818	Message 77668 - Posted: 17 Oct 2025, 16:33:03 UTC As some of you might have read, I did in the past some comparisons between MT and ST tasks. However that was only based on the credits and I only compared 2-thread tasks vs. single thread tasks. So that was not very good comparison and after reading a lot about multithreading and cache sizes on PrimeGrid, I decided to make a real benchmark, in particular since I have seen, that compared to my wingmen, sometimes my computer needed ridiculously much CPU time for the same amount of work. So now I tried 7 threads per WU (I run 14 of 16 threads of my CPU for CPU crunching the other two are for feeding the iGPU) and run same WU while normal production in ST-mode and than I changed my settings to 7 threads per WU, set up a 2nd instance of BOINC and run there the same WU there in 7-thread-mode together with another 7-thread-WU and one GPU-WU from Einstein in the first instance of BOINC. The WU I have used is de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512. run time CPU time ST 11214 s 11205 s MT (1) 387 s 2038 s MT (2) 391 s 2046 s So there's a huge difference in the CPU time required, which means we have here same cache effects as on PrimeGrid, we just don't know how much cache is needed per WU. Also the runtime of Einstein on the iGPU went down from about 18.5k seconds to around 15.9k. In case someone wants to do some benchmarks on his computer, this is the WU (I added a number at the end to have several copies for testing and extended the deadline): <workunit> <name>de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512_01</name> <app_name>milkyway_nbody_orbit_fitting</app_name> <version_num>193</version_num> <rsc_fpops_est>40006000000000.000000</rsc_fpops_est> <rsc_fpops_bound>400060000000000000.000000</rsc_fpops_bound> <rsc_memory_bound>500000000.000000</rsc_memory_bound> <rsc_disk_bound>104858000.000000</rsc_disk_bound> <command_line> -f nbody_parameters.lua -h histogram.txt --seed 156791432 -np 12 -p 3.45074 1 0.339685 0.446913 27.1153 0.706509 41.902 21.1847 -2.65629 -6.82502 -46.1851 95741.7 </command_line> <file_ref> <file_name>EMD_v193_OCS_orbit_fitting_lmc_pm_old_eps2.lua</file_name> <open_name>nbody_parameters.lua</open_name> </file_ref> <file_ref> <file_name>OCS_data_2025_pm.hist</file_name> <open_name>histogram.txt</open_name> </file_ref> </workunit> <result> <name>de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512_01_0</name> <final_cpu_time>0.000000</final_cpu_time> <final_elapsed_time>0.000000</final_elapsed_time> <exit_status>0</exit_status> <state>2</state> <platform>windows_x86_64</platform> <version_num>193</version_num> <plan_class>mt</plan_class> <suspended_via_gui/> <wu_name>de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512_01</wu_name> <report_deadline>1792749848.000000</report_deadline> <received_time>1760283432.046444</received_time> </result> I'll do some more tests when I have time, i have saved some other WUs in the past, but they were for older application versions, no idea if they will work on the current one. But I think it's pretty obvious, that one thread per task isn't optimal as I have thought after my simple comparison in the past. ID: 77668 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,503,497 RAC: 9,818	Message 77674 - Posted: 19 Oct 2025, 16:49:40 UTC - in response to Message 77668. Last modified: 19 Oct 2025, 17:16:18 UTC Now I tried all 14 threads for one WU, that's worse than 2x 7-thread WUs. WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512 Application version: 1.93 Threads run time CPU time WU/day on 14 threads ST 1 11214.xx s 11205.xx s 107.8 MT 7 386.65 s 2038.38 s 446.5 MT 7 390.82 s 2046.41 s 442.1 MT 14 213.93 s 1990.48 s 403.9 So crunching this particular WU on half of the available threads instead of just one increases production by a factor of 4.12. Didn't expect that after my previous tests and in particular not such a huge difference between ST an MT. ID: 77674 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,503,497 RAC: 9,818	Message 77675 - Posted: 20 Oct 2025, 11:14:13 UTC - in response to Message 77674. Final test was with 3 5-thread WUs, so running on 15 threads. WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512 Application version: 1.93 Threads run time CPU time WU/day on 14 threads ST 1 11214.xx s 11205.xx s 107.8 MT 5 613.07 s 2350.41 s 422.8 (15 threads) MT 7 386.65 s 2038.38 s 446.5 MT 7 390.82 s 2046.41 s 442.1 MT 14 213.93 s 1990.48 s 403.9 So 3x5 is slightly slower than 2x7 and will likely also have a negative impact on the iGPU production, so definitely not worth it, at least not on my Ryzen 5700G. I also tested another, long running WU, however with a bit older application version. Shouldn't make much difference I guess. WU: de_nbody_orbit_fitting_05_23_2025_v190_OCS__data__02_1748458520_896362 Application version: 1.90 Threads run time CPU time WU/day on 14 threads ST 1 86598.57 s 86355.62 s 13.97 MT 7 7560.51 s 45338.19 s 22.86 So again huge increase in production, however not as huge as for the other WU, which is again surprising, as I'd expect the single core phase at the beginning of each WU to have a higher impact on short WUs, but apparently there are other more significant factors than that. This is the WU in case someone wants to test with it: <workunit> <name>de_nbody_orbit_fitting_05_23_2025_v190_OCS__data__02_1748458520_896362</name> <app_name>milkyway_nbody</app_name> <version_num>190</version_num> <rsc_fpops_est>20629100000000.000000</rsc_fpops_est> <rsc_fpops_bound>206291000000000000.000000</rsc_fpops_bound> <rsc_memory_bound>500000000.000000</rsc_memory_bound> <rsc_disk_bound>52428800.000000</rsc_disk_bound> <command_line> -f nbody_parameters.lua -h histogram.txt --seed 411592566 -np 11 -p 2.49417 1 0.0338769 0.440317 12.5767 0.965698 46.2133 23.2603 -191.888 75.0999 135.272 </command_line> <file_ref> <file_name>EMD_v190_OCS_orbit_fitting.lua</file_name> <open_name>nbody_parameters.lua</open_name> </file_ref> <file_ref> <file_name>OCS_data_2023_version2.hist</file_name> <open_name>histogram.txt</open_name> </file_ref> </workunit> <result> <name>de_nbody_orbit_fitting_05_23_2025_v190_OCS__data__02_1748458520_896362_1</name> <final_cpu_time>0.000000</final_cpu_time> <final_elapsed_time>0.000000</final_elapsed_time> <exit_status>0</exit_status> <state>2</state> <platform>windows_x86_64</platform> <version_num>190</version_num> <plan_class>mt</plan_class> <suspended_via_gui/> <wu_name>de_nbody_orbit_fitting_05_23_2025_v190_OCS__data__02_1748458520_896362</wu_name> <report_deadline>1792749848.000000</report_deadline> <received_time>1750690560.237772</received_time> </result> ID: 77675 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,503,497 RAC: 9,818	Message 77728 - Posted: 20 Nov 2025, 12:51:11 UTC More benchmarking results, now also without Einstein WU running on the iGPU of my Ryzen 5700G. WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512 Application version: 1.93 threads/WU run time CPU time threads in use WU/day relative speed ST 1 11214.xx s 11205.xx s 14x CPU + 1x iGPU 107.8 22.15% MT 2 1633.08 s 3089.80 s 16x CPU + 0x iGPU 423.2 86.98% MT 4 742.80 s 2534.84 s 16x CPU + 0x iGPU 465.3 95.62% MT 5 613.07 s 2350.41 s 15x CPU + 1x iGPU 422.8 86.89% MT 7 386.65 s 2038.38 s 14x CPU + 1x iGPU 446.5 91.85% MT 7 390.82 s 2046.41 s 14x CPU + 1x iGPU 442.1 90.87% MT 7 368.25 s 2024.03 s 14x CPU + 0x iGPU 469.2 96.43% MT 8 355.12 s 2143.25 s 16x CPU + 0x iGPU 486.6 --> 100.00% <-- MT 14 213.93 s 1990.48 s 14x CPU + 1x iGPU 403.9 82.00% MT 16 204.70 s 2130.02 s 16x CPU + 0x iGPU 422.1 86.74% And here a bit longer running WU: de_nbody_10_27_2025_v193_OCS_north__data__07_1762888230_2212 Application version: 1.93 threads/WU run time CPU time threads in use WU/day relative speed MT 2 2736.81 s 5205.73 s 16x CPU + 0x iGPU 252.6 84.79% MT 4 1208.70 s 4198.17 s 16x CPU + 0x iGPU 285.9 96.00% MT 7 612.74 s 3286.55 s 14x CPU + 1x iGPU 282.0 94.68% MT 7 601.14 s 3388.61 s 14x CPU + 0x iGPU 287.5 96.51% MT 8 580.16 s 3608.00 s 16x CPU + 0x iGPU 297.8 --> 100.00% <-- MT 16 328.15 s 3518.88 s 16x CPU + 0x iGPU 263.3 88.40% This is the WU in case someone wants to test with it: <workunit> <name>de_nbody_10_27_2025_v193_OCS_north__data__07_1762888230_2212</name> <app_name>milkyway_nbody_orbit_fitting</app_name> <version_num>193</version_num> <rsc_fpops_est>30022500000000.000000</rsc_fpops_est> <rsc_fpops_bound>3002250000000000000.000000</rsc_fpops_bound> <rsc_memory_bound>500000000.000000</rsc_memory_bound> <rsc_disk_bound>104858000.000000</rsc_disk_bound> <command_line> -f nbody_parameters.lua -h histogram.txt --seed 233088231 -np 12 -p 5.28862 1 0.170294 0.298857 1.26323 0.0213518 45.8 21.5 -185.5 54.7 147.4 449866 </command_line> <file_ref> <file_name>EMD_v193_OCS_north_no_orbit_old_eps2.lua</file_name> <open_name>nbody_parameters.lua</open_name> </file_ref> <file_ref> <file_name>OCS_data_2025_north.hist</file_name> <open_name>histogram.txt</open_name> </file_ref> </workunit> <result> <name>de_nbody_10_27_2025_v193_OCS_north__data__07_1762888230_2212__1</name> <final_cpu_time>0.000000</final_cpu_time> <final_elapsed_time>0.000000</final_elapsed_time> <exit_status>0</exit_status> <state>2</state> <platform>windows_x86_64</platform> <version_num>193</version_num> <plan_class>mt</plan_class> <suspended_via_gui/> <wu_name>de_nbody_10_27_2025_v193_OCS_north__data__07_1762888230_2212_</wu_name> <report_deadline>1794443214.000000</report_deadline> <received_time>1763406414.476450</received_time> </result> ID: 77728 · Rating: 0 · rate: / Reply Quote

Brian Nixon Send message Joined: 18 Nov 23 Posts: 2 Credit: 243,227 RAC: 1,042	Message 77742 - Posted: 22 Nov 2025, 18:40:33 UTC Thanks for this. When I started MW@h I too observed that multi-threaded tasks use less than 100% of the CPU because of the synchronisation needed between threads, assumed that running multiple single-threaded tasks would therefore be more efficient overall, and set my preferences accordingly. And later I similarly noticed that wingmen almost invariably complete the same workunits in substantially less time (beyond differences explained by faster CPUs)… Here are some numbers from my i5-3320M: WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512 Application version: 1.93 threads/WU run time CPU time threads in use WU/day ST 1 6281 6243 4× CPU 55.0 MT 2 2871 5502 4× CPU 60.2 MT 4 1440 5147 4× CPU 60.0 For reference, these are the results I got: <search_likelihood>-27342.660233678041550</search_likelihood> <search_likelihood_EMD>-14.106647272974579</search_likelihood_EMD> <search_likelihood_Mass>-3771.532869044177005</search_likelihood_Mass> <search_likelihood_Beta>-822.641452301281788</search_likelihood_Beta> <search_likelihood_BetaAvg>-274.418758463499103</search_likelihood_BetaAvg> <search_likelihood_VelAvg>-383.897846477010262</search_likelihood_VelAvg> <search_likelihood_Dist>-1141.268630669829690</search_likelihood_Dist> <search_likelihood_PM_dec>-19182.767423587025405</search_likelihood_PM_dec> <search_likelihood_PM_ra>-1752.026605862245560</search_likelihood_PM_ra> I will add numbers from some other machines in due course. ID: 77742 · Rating: 0 · rate: / Reply Quote

Brian Nixon Send message Joined: 18 Nov 23 Posts: 2 Credit: 243,227 RAC: 1,042	Message 77743 - Posted: 23 Nov 2025, 0:56:57 UTC Last modified: 23 Nov 2025, 0:57:28 UTC Here are some numbers from my i7-7700HQ: WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512 Application version: 1.93 threads/WU run time CPU time threads in use WU/day ST 1 6193 6117 8× CPU 111.6 MT 2 2500 4825 8× CPU 138.2 MT 4 1171 4184 8× CPU 147.6 MT 8 528 3407 8× CPU 163.7 ID: 77743 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 775 Credit: 20,503,497 RAC: 9,818	Message 77744 - Posted: 23 Nov 2025, 8:19:36 UTC - in response to Message 77743. Interesting, seems like on older CPUs with not so many cores (and not so much cache) running 1 WU at a time on all threads might be best. ID: 77744 · Rating: 0 · rate: / Reply Quote

bobsmith18 Send message Joined: 1 Nov 10 Posts: 18 Credit: 2,335,992 RAC: 6,149	Message 77745 - Posted: 23 Nov 2025, 19:36:08 UTC - in response to Message 77744. That statement is somewhat at odds with the figures in the final column of his table. Perhaps the op could explain how he ran his tests Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 77745 · Rating: 0 · rate: / Reply Quote