Message boards :
Number crunching :
Can the app run on 32 threads?
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 22 Jun 09 Posts: 16 Credit: 81,413,008 RAC: 157 |
As above, I have a 32 thread CPU (9950x) and currently Boinc only seems to run one MW instance at a time, this hits a single CCD for whatever reason, pushes the temps into the 90's, is there any way to have one WU use all 32 threads, this should balance the load over both CCD's and stop the temps going so high? I run a number of other apps on this CPU and only MW pushes temps that high, with MW suspended the CPU never goes above 80c, even Prime95 cant get temps that high. |
|
Send message Joined: 19 Jul 10 Posts: 832 Credit: 21,889,871 RAC: 7,261 |
As above, I have a 32 thread CPU (9950x) and currently Boinc only seems to run one MW instance at a time, this hits a single CCD for whatever reason, pushes the temps into the 90's, is there any way to have one WU use all 32 threads, this should balance the load over both CCD's and stop the temps going so high?Yes, you can set it in your MilkyWay@home preferences, however I have no idea how good idea it is to run it on two CCDs, the reason why it runs on one is the by far faster communication between the threads. No idea though why Milkyway is heating your CPU up like this, for everyone else it's AFAIK by far the coolest running application. On my 5700G for example it's using around 20 Watts less than MCM from WCG resulting in about 10° lower temperature. Actually BOINC should run two 16-thread WUs on your CPU if you allow it to use 100% of CPU cores. Considering your temperature issues, I'd however suggest to set that to 50% and than run two 8-thread WUs and see how they spread over the two CCDs and what temperatures you get.
|
|
Send message Joined: 8 Sep 07 Posts: 13 Credit: 2,582,140 RAC: 549 |
There’s a hard limit of 16 cores, and while it might be possible to override it, it’s not really worth it. Spreading the workload across two CCDs consistently performs worse, since the current app doesn’t scale well with higher core counts. It’s better to do the opposite: reduce the number of threads per WU and run more WUs in parallel. This lets you increase the number of independent tasks and avoid multithreading & cross-CCD overhead. A single task per thread is generally ideal, as long as there’s enough cache. MW workloads aren’t as demanding as PrimeGrid, so a CPU like the 9950X (even without 3D) should handle them well. That said, running tasks in multithreaded mode can make them appear faster, but it doesn’t necessarily improve efficiency. With 1T mode, you’ll complete more tasks over the same period compared to 16T mode, even though each individual task will take longer. |
|
Send message Joined: 18 Nov 22 Posts: 102 Credit: 658,236,260 RAC: 235,704 |
definitely agree with Pavel here. 1T is the most efficient, scaling to more cores has diminishing returns. when running on CPU, I like to run 4T though just because some of the very long tasks take ages to run on 1T, taking more than 24hrs in some cases. 4T keeps most tasks completing in a reasonable amount of time without being too inefficient. you could also combat the inefficiency by oversubscribing them to help fill the gaps. mark in BOINC as using 4 threads, but pass --nthreads 5 to the comand line instead.
|
|
Send message Joined: 19 Jul 10 Posts: 832 Credit: 21,889,871 RAC: 7,261 |
With 1T mode, you’ll complete more tasks over the same period compared to 16T mode, even though each individual task will take longer.According to the results submitted by me and others in my benchmarking thread, this isn't true. For my Ryzen 5700G for example running 2 8-thread WUs is most efficient and gives about 2-4x more work done per day, depending on the WU "size", short WUs seem to profit most from MT. For older CPUs with less cores/threads the "single thread penalty" isn't as huge as for newer ones, but running either one or two WUs seems to be most efficient for them too. So far nobody found single thread WU to be most efficent on their system, even if on some older Xeon-CPUs single thread was pretty close to the optimum.
|
|
Send message Joined: 8 Sep 07 Posts: 13 Credit: 2,582,140 RAC: 549 |
@Link - interesting! I ran my own offline benchmark on 9950x, and here are the results: Total cores utilized Threads / wu Wus in parallel ipc core throughput time efficiency speedup 1 1 1 2,01 13,7 529 529 1 2 2 1 2,02 11,99 296 296 1,78 4 4 1 2,05 10,93 171 171 3,09 8 8 1 1,92 8,448 106 106 4,99 16 16 1 1,76 6,336 83 83 6,37 32 32 1 0,83 3,071 82 82 6,45 load (+SMT) 32 1 32 0,98 4,704 1535 47,97 32 2 16 1,15 5,175 736 46 32 4 8 1,24 5,208 342 *** 42,75 *** 32 8 4 1,46 5,694 173 43,25 32 16 2 1,52 5,016 104 52 load (-SMT) 16 1 16 1,9 9,5 772 48,25 16 2 8 1,95 9,36 391 48,875 16 4 4 2,03 9,338 202 50,5 16 8 2 1,89 7,938 124 62 A single WU without load doesn’t scale well beyond 8 threads (5× runtime improvement). Between 16 -> 32 threads (ST), there's basically no improvement. 8 WU with 4 threads seems to be the most efficient option - 12% faster than ST (16x1 / 32x1) Using SMT improves the performance by about 18% It’s worth mentioning that I tested it on Linux using a relatively short WU. Not all computations are the same, and not every part of the algorithm is parallelized, so longer WUs should benefit more from higher thread counts. Theoretically. Results may also vary on Windows or with different CPUs. Data locality, L3 cache, or memory bandwidth limits are likely responsible for this behavior. According to my results, 8 MB of L3 cache per task seems to be enough, though estimating the actual requirements is much harder than with PrimeGrid tasks, where the data size is fixed and can be calculated more easily. The optimal cache size may also vary across computation phases and with the WU size/type. |
|
Send message Joined: 19 Jul 10 Posts: 832 Credit: 21,889,871 RAC: 7,261 |
A single WU without load doesn’t scale well beyond 8 threads (5× runtime improvement).This suggests, that the options to run more than 16 threads should be removed, at least from the web preferences. Currently it's possible to set there up to 256 threads, which is insane. 8 WU with 4 threads seems to be the most efficient option - 12% faster than ST (16x1 / 32x1)This seems to be in line with the results for my CPU with 16 threads and 16MB L3 cache, so 1MB/thread, for which 2 WUs with 8 threads seem to be best. You have 2MB/thread, so 2x WUs with half the threads is exactly what is expected to be best (and also that the "single thread penalty" isn't as high as on my CPU). It’s worth mentioning that I tested it on Linux using a relatively short WU. Not all computations are the same, and not every part of the algorithm is parallelized, so longer WUs should benefit more from higher thread counts. Theoretically.I tested a long WU and the result was the opposite of that, which I also expected because of the single thread start up phase, which makes up significant part of a short WU. While the speed up for a short WU was 4.12, for the long running WU it was just 1.63 (in both cases 14 CPU threads in use + 1 GPU WU on the iGPU, so the standard configuration for me).
|
©2026 Astroinformatics Group