Welcome to MilkyWay@home

Can the app run on 32 threads?

Message boards : Number crunching : Can the app run on 32 threads?
Message board moderation

To post messages, you must log in.

AuthorMessage
Ryan Munro

Send message
Joined: 22 Jun 09
Posts: 16
Credit: 81,413,008
RAC: 383
Message 77356 - Posted: 6 Mar 2025, 15:05:16 UTC

As above, I have a 32 thread CPU (9950x) and currently Boinc only seems to run one MW instance at a time, this hits a single CCD for whatever reason, pushes the temps into the 90's, is there any way to have one WU use all 32 threads, this should balance the load over both CCD's and stop the temps going so high?
I run a number of other apps on this CPU and only MW pushes temps that high, with MW suspended the CPU never goes above 80c, even Prime95 cant get temps that high.
ID: 77356 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 832
Credit: 21,815,555
RAC: 7,094
Message 77359 - Posted: 7 Mar 2025, 16:29:08 UTC - in response to Message 77356.  

As above, I have a 32 thread CPU (9950x) and currently Boinc only seems to run one MW instance at a time, this hits a single CCD for whatever reason, pushes the temps into the 90's, is there any way to have one WU use all 32 threads, this should balance the load over both CCD's and stop the temps going so high?
Yes, you can set it in your MilkyWay@home preferences, however I have no idea how good idea it is to run it on two CCDs, the reason why it runs on one is the by far faster communication between the threads.

No idea though why Milkyway is heating your CPU up like this, for everyone else it's AFAIK by far the coolest running application. On my 5700G for example it's using around 20 Watts less than MCM from WCG resulting in about 10° lower temperature.

Actually BOINC should run two 16-thread WUs on your CPU if you allow it to use 100% of CPU cores. Considering your temperature issues, I'd however suggest to set that to 50% and than run two 8-thread WUs and see how they spread over the two CCDs and what temperatures you get.
ID: 77359 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ahorek's team

Send message
Joined: 8 Sep 07
Posts: 13
Credit: 2,581,897
RAC: 1,214
Message 77978 - Posted: 14 May 2026, 17:38:21 UTC

There’s a hard limit of 16 cores, and while it might be possible to override it, it’s not really worth it. Spreading the workload across two CCDs consistently performs worse, since the current app doesn’t scale well with higher core counts.

It’s better to do the opposite: reduce the number of threads per WU and run more WUs in parallel. This lets you increase the number of independent tasks and avoid multithreading & cross-CCD overhead.

A single task per thread is generally ideal, as long as there’s enough cache. MW workloads aren’t as demanding as PrimeGrid, so a CPU like the 9950X (even without 3D) should handle them well.
That said, running tasks in multithreaded mode can make them appear faster, but it doesn’t necessarily improve efficiency. With 1T mode, you’ll complete more tasks over the same period compared to 16T mode, even though each individual task will take longer.
ID: 77978 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 18 Nov 22
Posts: 101
Credit: 655,270,899
RAC: 119,190
Message 77979 - Posted: 14 May 2026, 17:55:50 UTC - in response to Message 77978.  
Last modified: 14 May 2026, 17:56:09 UTC

definitely agree with Pavel here.

1T is the most efficient, scaling to more cores has diminishing returns. when running on CPU, I like to run 4T though just because some of the very long tasks take ages to run on 1T, taking more than 24hrs in some cases. 4T keeps most tasks completing in a reasonable amount of time without being too inefficient.

you could also combat the inefficiency by oversubscribing them to help fill the gaps. mark in BOINC as using 4 threads, but pass --nthreads 5 to the comand line instead.

ID: 77979 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 832
Credit: 21,815,555
RAC: 7,094
Message 77980 - Posted: 14 May 2026, 18:22:00 UTC - in response to Message 77978.  

With 1T mode, you’ll complete more tasks over the same period compared to 16T mode, even though each individual task will take longer.
According to the results submitted by me and others in my benchmarking thread, this isn't true. For my Ryzen 5700G for example running 2 8-thread WUs is most efficient and gives about 2-4x more work done per day, depending on the WU "size", short WUs seem to profit most from MT. For older CPUs with less cores/threads the "single thread penalty" isn't as huge as for newer ones, but running either one or two WUs seems to be most efficient for them too. So far nobody found single thread WU to be most efficent on their system, even if on some older Xeon-CPUs single thread was pretty close to the optimum.
ID: 77980 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ahorek's team

Send message
Joined: 8 Sep 07
Posts: 13
Credit: 2,581,897
RAC: 1,214
Message 77981 - Posted: 15 May 2026, 0:24:40 UTC

@Link - interesting! I ran my own offline benchmark on 9950x, and here are the results:

Total cores utilized	Threads / wu	Wus in parallel	ipc	core throughput	time	efficiency	speedup
1	                1               1               2,01    13,7            529     529             1
2	                2               1               2,02    11,99           296     296             1,78
4	                4               1               2,05    10,93           171     171             3,09
8	                8               1               1,92    8,448           106     106             4,99
16	               16               1               1,76    6,336            83      83             6,37
32	               32               1               0,83    3,071            82      82             6,45

load (+SMT)
32	1	32	0,98	4,704	1535	47,97	
32	2	16	1,15	5,175	736	46	
32	4	8	1,24	5,208	342	*** 42,75 ***
32	8	4	1,46	5,694	173	43,25	
32	16	2	1,52	5,016	104	52	

load (-SMT)
16	1	16	1,9	9,5	772	48,25	
16	2	8	1,95	9,36	391	48,875	
16	4	4	2,03	9,338	202	50,5	
16	8	2	1,89	7,938	124	62	


A single WU without load doesn’t scale well beyond 8 threads (5× runtime improvement).
Between 16 -> 32 threads (ST), there's basically no improvement.
8 WU with 4 threads seems to be the most efficient option - 12% faster than ST (16x1 / 32x1)
Using SMT improves the performance by about 18%

It’s worth mentioning that I tested it on Linux using a relatively short WU. Not all computations are the same, and not every part of the algorithm is parallelized, so longer WUs should benefit more from higher thread counts. Theoretically.
Results may also vary on Windows or with different CPUs. Data locality, L3 cache, or memory bandwidth limits are likely responsible for this behavior.
According to my results, 8 MB of L3 cache per task seems to be enough, though estimating the actual requirements is much harder than with PrimeGrid tasks, where the data size is fixed and can be calculated more easily. The optimal cache size may also vary across computation phases and with the WU size/type.
ID: 77981 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 832
Credit: 21,815,555
RAC: 7,094
Message 77982 - Posted: 15 May 2026, 7:42:21 UTC - in response to Message 77981.  

A single WU without load doesn’t scale well beyond 8 threads (5× runtime improvement).
Between 16 -> 32 threads (ST), there's basically no improvement.
This suggests, that the options to run more than 16 threads should be removed, at least from the web preferences. Currently it's possible to set there up to 256 threads, which is insane.


8 WU with 4 threads seems to be the most efficient option - 12% faster than ST (16x1 / 32x1)
This seems to be in line with the results for my CPU with 16 threads and 16MB L3 cache, so 1MB/thread, for which 2 WUs with 8 threads seem to be best. You have 2MB/thread, so 2x WUs with half the threads is exactly what is expected to be best (and also that the "single thread penalty" isn't as high as on my CPU).


It’s worth mentioning that I tested it on Linux using a relatively short WU. Not all computations are the same, and not every part of the algorithm is parallelized, so longer WUs should benefit more from higher thread counts. Theoretically.
I tested a long WU and the result was the opposite of that, which I also expected because of the single thread start up phase, which makes up significant part of a short WU. While the speed up for a short WU was 4.12, for the long running WU it was just 1.63 (in both cases 14 CPU threads in use + 1 GPU WU on the iGPU, so the standard configuration for me).
ID: 77982 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Can the app run on 32 threads?

©2026 Astroinformatics Group