Welcome to MilkyWay@home

Milkyway Nbody ST vs. MT: real benchmarking

Message boards : Number crunching : Milkyway Nbody ST vs. MT: real benchmarking
Message board moderation

To post messages, you must log in.

AuthorMessage
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 775
Credit: 20,504,165
RAC: 9,827
Message 77668 - Posted: 17 Oct 2025, 16:33:03 UTC

As some of you might have read, I did in the past some comparisons between MT and ST tasks. However that was only based on the credits and I only compared 2-thread tasks vs. single thread tasks. So that was not very good comparison and after reading a lot about multithreading and cache sizes on PrimeGrid, I decided to make a real benchmark, in particular since I have seen, that compared to my wingmen, sometimes my computer needed ridiculously much CPU time for the same amount of work.

So now I tried 7 threads per WU (I run 14 of 16 threads of my CPU for CPU crunching the other two are for feeding the iGPU) and run same WU while normal production in ST-mode and than I changed my settings to 7 threads per WU, set up a 2nd instance of BOINC and run there the same WU there in 7-thread-mode together with another 7-thread-WU and one GPU-WU from Einstein in the first instance of BOINC.

The WU I have used is de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512.
           run time     CPU time
ST          11214 s      11205 s
MT (1)        387 s       2038 s
MT (2)        391 s       2046 s

So there's a huge difference in the CPU time required, which means we have here same cache effects as on PrimeGrid, we just don't know how much cache is needed per WU. Also the runtime of Einstein on the iGPU went down from about 18.5k seconds to around 15.9k.


In case someone wants to do some benchmarks on his computer, this is the WU (I added a number at the end to have several copies for testing and extended the deadline):
<workunit>
    <name>de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512_01</name>
    <app_name>milkyway_nbody_orbit_fitting</app_name>
    <version_num>193</version_num>
    <rsc_fpops_est>40006000000000.000000</rsc_fpops_est>
    <rsc_fpops_bound>400060000000000000.000000</rsc_fpops_bound>
    <rsc_memory_bound>500000000.000000</rsc_memory_bound>
    <rsc_disk_bound>104858000.000000</rsc_disk_bound>
    <command_line>
-f nbody_parameters.lua -h histogram.txt --seed 156791432 -np 12 -p 3.45074 1 0.339685 0.446913 27.1153 0.706509 41.902 21.1847 -2.65629 -6.82502 -46.1851 95741.7
    </command_line>
    <file_ref>
        <file_name>EMD_v193_OCS_orbit_fitting_lmc_pm_old_eps2.lua</file_name>
        <open_name>nbody_parameters.lua</open_name>
    </file_ref>
    <file_ref>
        <file_name>OCS_data_2025_pm.hist</file_name>
        <open_name>histogram.txt</open_name>
    </file_ref>
</workunit>
<result>
    <name>de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512_01_0</name>
    <final_cpu_time>0.000000</final_cpu_time>
    <final_elapsed_time>0.000000</final_elapsed_time>
    <exit_status>0</exit_status>
    <state>2</state>
    <platform>windows_x86_64</platform>
    <version_num>193</version_num>
    <plan_class>mt</plan_class>
    <suspended_via_gui/>
    <wu_name>de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512_01</wu_name>
    <report_deadline>1792749848.000000</report_deadline>
    <received_time>1760283432.046444</received_time>
</result>

I'll do some more tests when I have time, i have saved some other WUs in the past, but they were for older application versions, no idea if they will work on the current one. But I think it's pretty obvious, that one thread per task isn't optimal as I have thought after my simple comparison in the past.
ID: 77668 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 775
Credit: 20,504,165
RAC: 9,827
Message 77674 - Posted: 19 Oct 2025, 16:49:40 UTC - in response to Message 77668.  
Last modified: 19 Oct 2025, 17:16:18 UTC

Now I tried all 14 threads for one WU, that's worse than 2x 7-thread WUs.

WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512
Application version: 1.93
           Threads     run time     CPU time     WU/day on 14 threads
ST             1     11214.xx s   11205.xx s     107.8
MT             7       386.65 s    2038.38 s     446.5
MT             7       390.82 s    2046.41 s     442.1
MT            14       213.93 s    1990.48 s     403.9

So crunching this particular WU on half of the available threads instead of just one increases production by a factor of 4.12. Didn't expect that after my previous tests and in particular not such a huge difference between ST an MT.
ID: 77674 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 775
Credit: 20,504,165
RAC: 9,827
Message 77675 - Posted: 20 Oct 2025, 11:14:13 UTC - in response to Message 77674.  

Final test was with 3 5-thread WUs, so running on 15 threads.

WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512
Application version: 1.93
           Threads     run time     CPU time     WU/day on 14 threads
ST             1     11214.xx s   11205.xx s     107.8
MT             5       613.07 s    2350.41 s     422.8 (15 threads)
MT             7       386.65 s    2038.38 s     446.5
MT             7       390.82 s    2046.41 s     442.1
MT            14       213.93 s    1990.48 s     403.9

So 3x5 is slightly slower than 2x7 and will likely also have a negative impact on the iGPU production, so definitely not worth it, at least not on my Ryzen 5700G.


I also tested another, long running WU, however with a bit older application version. Shouldn't make much difference I guess.

WU: de_nbody_orbit_fitting_05_23_2025_v190_OCS__data__02_1748458520_896362
Application version: 1.90
           Threads     run time     CPU time     WU/day on 14 threads
ST             1     86598.57 s   86355.62 s     13.97
MT             7      7560.51 s   45338.19 s     22.86

So again huge increase in production, however not as huge as for the other WU, which is again surprising, as I'd expect the single core phase at the beginning of each WU to have a higher impact on short WUs, but apparently there are other more significant factors than that.

This is the WU in case someone wants to test with it:
<workunit>
    <name>de_nbody_orbit_fitting_05_23_2025_v190_OCS__data__02_1748458520_896362</name>
    <app_name>milkyway_nbody</app_name>
    <version_num>190</version_num>
    <rsc_fpops_est>20629100000000.000000</rsc_fpops_est>
    <rsc_fpops_bound>206291000000000000.000000</rsc_fpops_bound>
    <rsc_memory_bound>500000000.000000</rsc_memory_bound>
    <rsc_disk_bound>52428800.000000</rsc_disk_bound>
    <command_line>
-f nbody_parameters.lua -h histogram.txt --seed 411592566 -np 11 -p 2.49417 1 0.0338769 0.440317 12.5767 0.965698 46.2133 23.2603 -191.888 75.0999 135.272
    </command_line>
    <file_ref>
        <file_name>EMD_v190_OCS_orbit_fitting.lua</file_name>
        <open_name>nbody_parameters.lua</open_name>
    </file_ref>
    <file_ref>
        <file_name>OCS_data_2023_version2.hist</file_name>
        <open_name>histogram.txt</open_name>
    </file_ref>
</workunit>
<result>
    <name>de_nbody_orbit_fitting_05_23_2025_v190_OCS__data__02_1748458520_896362_1</name>
    <final_cpu_time>0.000000</final_cpu_time>
    <final_elapsed_time>0.000000</final_elapsed_time>
    <exit_status>0</exit_status>
    <state>2</state>
    <platform>windows_x86_64</platform>
    <version_num>190</version_num>
    <plan_class>mt</plan_class>
    <suspended_via_gui/>
    <wu_name>de_nbody_orbit_fitting_05_23_2025_v190_OCS__data__02_1748458520_896362</wu_name>
    <report_deadline>1792749848.000000</report_deadline>
    <received_time>1750690560.237772</received_time>
</result>

ID: 77675 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 775
Credit: 20,504,165
RAC: 9,827
Message 77728 - Posted: 20 Nov 2025, 12:51:11 UTC

More benchmarking results, now also without Einstein WU running on the iGPU of my Ryzen 5700G.

WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512
Application version: 1.93
        threads/WU     run time     CPU time     threads in use        WU/day    relative speed
ST             1     11214.xx s   11205.xx s     14x CPU + 1x iGPU     107.8         22.15%
MT             2      1633.08 s    3089.80 s     16x CPU + 0x iGPU     423.2         86.98%
MT             4       742.80 s    2534.84 s     16x CPU + 0x iGPU     465.3         95.62%
MT             5       613.07 s    2350.41 s     15x CPU + 1x iGPU     422.8         86.89%
MT             7       386.65 s    2038.38 s     14x CPU + 1x iGPU     446.5         91.85%
MT             7       390.82 s    2046.41 s     14x CPU + 1x iGPU     442.1         90.87%
MT             7       368.25 s    2024.03 s     14x CPU + 0x iGPU     469.2         96.43%
MT             8       355.12 s    2143.25 s     16x CPU + 0x iGPU     486.6    --> 100.00% <--
MT            14       213.93 s    1990.48 s     14x CPU + 1x iGPU     403.9         82.00%
MT            16       204.70 s    2130.02 s     16x CPU + 0x iGPU     422.1         86.74%



And here a bit longer running WU: de_nbody_10_27_2025_v193_OCS_north__data__07_1762888230_2212
Application version: 1.93
        threads/WU     run time     CPU time     threads in use        WU/day    relative speed
MT             2      2736.81 s    5205.73 s     16x CPU + 0x iGPU     252.6         84.79%
MT             4      1208.70 s    4198.17 s     16x CPU + 0x iGPU     285.9         96.00%
MT             7       612.74 s    3286.55 s     14x CPU + 1x iGPU     282.0         94.68%
MT             7       601.14 s    3388.61 s     14x CPU + 0x iGPU     287.5         96.51%
MT             8       580.16 s    3608.00 s     16x CPU + 0x iGPU     297.8    --> 100.00% <--
MT            16       328.15 s    3518.88 s     16x CPU + 0x iGPU     263.3         88.40%



This is the WU in case someone wants to test with it:
<workunit>
    <name>de_nbody_10_27_2025_v193_OCS_north__data__07_1762888230_2212</name>
    <app_name>milkyway_nbody_orbit_fitting</app_name>
    <version_num>193</version_num>
    <rsc_fpops_est>30022500000000.000000</rsc_fpops_est>
    <rsc_fpops_bound>3002250000000000000.000000</rsc_fpops_bound>
    <rsc_memory_bound>500000000.000000</rsc_memory_bound>
    <rsc_disk_bound>104858000.000000</rsc_disk_bound>
    <command_line>
-f nbody_parameters.lua -h histogram.txt --seed 233088231 -np 12 -p 5.28862 1 0.170294 0.298857 1.26323 0.0213518 45.8 21.5 -185.5 54.7 147.4 449866
    </command_line>
    <file_ref>
        <file_name>EMD_v193_OCS_north_no_orbit_old_eps2.lua</file_name>
        <open_name>nbody_parameters.lua</open_name>
    </file_ref>
    <file_ref>
        <file_name>OCS_data_2025_north.hist</file_name>
        <open_name>histogram.txt</open_name>
    </file_ref>
</workunit>

<result>
    <name>de_nbody_10_27_2025_v193_OCS_north__data__07_1762888230_2212__1</name>
    <final_cpu_time>0.000000</final_cpu_time>
    <final_elapsed_time>0.000000</final_elapsed_time>
    <exit_status>0</exit_status>
    <state>2</state>
    <platform>windows_x86_64</platform>
    <version_num>193</version_num>
    <plan_class>mt</plan_class>
    <suspended_via_gui/>
    <wu_name>de_nbody_10_27_2025_v193_OCS_north__data__07_1762888230_2212_</wu_name>
    <report_deadline>1794443214.000000</report_deadline>
    <received_time>1763406414.476450</received_time>
</result>

ID: 77728 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Nixon

Send message
Joined: 18 Nov 23
Posts: 2
Credit: 243,227
RAC: 1,042
Message 77742 - Posted: 22 Nov 2025, 18:40:33 UTC

Thanks for this. When I started MW@h I too observed that multi-threaded tasks use less than 100% of the CPU because of the synchronisation needed between threads, assumed that running multiple single-threaded tasks would therefore be more efficient overall, and set my preferences accordingly. And later I similarly noticed that wingmen almost invariably complete the same workunits in substantially less time (beyond differences explained by faster CPUs)…

Here are some numbers from my i5-3320M:

WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512
Application version: 1.93
        threads/WU     run time     CPU time     threads in use        WU/day
ST             1      6281         6243           4× CPU                55.0
MT             2      2871         5502           4× CPU                60.2
MT             4      1440         5147           4× CPU                60.0


For reference, these are the results I got:
<search_likelihood>-27342.660233678041550</search_likelihood>
<search_likelihood_EMD>-14.106647272974579</search_likelihood_EMD>
<search_likelihood_Mass>-3771.532869044177005</search_likelihood_Mass>
<search_likelihood_Beta>-822.641452301281788</search_likelihood_Beta>
<search_likelihood_BetaAvg>-274.418758463499103</search_likelihood_BetaAvg>
<search_likelihood_VelAvg>-383.897846477010262</search_likelihood_VelAvg>
<search_likelihood_Dist>-1141.268630669829690</search_likelihood_Dist>
<search_likelihood_PM_dec>-19182.767423587025405</search_likelihood_PM_dec>
<search_likelihood_PM_ra>-1752.026605862245560</search_likelihood_PM_ra>

I will add numbers from some other machines in due course.
ID: 77742 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Nixon

Send message
Joined: 18 Nov 23
Posts: 2
Credit: 243,227
RAC: 1,042
Message 77743 - Posted: 23 Nov 2025, 0:56:57 UTC
Last modified: 23 Nov 2025, 0:57:28 UTC

Here are some numbers from my i7-7700HQ:

WU: de_nbody_orbit_fitting_10_01_2025_v193_OCS_lmc_pm__data__02_1758550502_1411512
Application version: 1.93
        threads/WU     run time     CPU time     threads in use        WU/day
ST             1      6193         6117           8× CPU               111.6
MT             2      2500         4825           8× CPU               138.2
MT             4      1171         4184           8× CPU               147.6
MT             8       528         3407           8× CPU               163.7
ID: 77743 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 775
Credit: 20,504,165
RAC: 9,827
Message 77744 - Posted: 23 Nov 2025, 8:19:36 UTC - in response to Message 77743.  

Interesting, seems like on older CPUs with not so many cores (and not so much cache) running 1 WU at a time on all threads might be best.
ID: 77744 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bobsmith18

Send message
Joined: 1 Nov 10
Posts: 18
Credit: 2,336,632
RAC: 6,154
Message 77745 - Posted: 23 Nov 2025, 19:36:08 UTC - in response to Message 77744.  

That statement is somewhat at odds with the figures in the final column of his table.
Perhaps the op could explain how he ran his tests
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 77745 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Milkyway Nbody ST vs. MT: real benchmarking

©2025 Astroinformatics Group