Message boards :
Number crunching :
More on multi-thread performance
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 1 Nov 10 Posts: 41 Credit: 2,879,221 RAC: 8,855 |
Recent discussions in one of the news threads led me to thinking what was the real impact of multi-threading on performance - would running say 4 tasks each on a single thread be better than running one task on four threads, and was there any advantage on running a task on an even or odd number of threads. So, first cut at the answers: Using the individuals declaration of how they are running tasks I have considered three fairly similar computers: 1050100 (my own), running two tasks each using two threads 1024003 running two tasks, each using seven threads 103942 running seven tasks, one thread each. Taking a sample of between 200 and 300 tasks, and discarding tasks taking less than 10 seconds to run: 1039422 = 24019 seconds per task 1050100 = 6263 seconds per task 1034003 = 2869 seconds per task I'll take 1039422 as the baseline, so what runtime time would be reasonable to expect? for two cores -> 24019/2 = 12009.5 for seven cores -> 24019/7 = 16374.9 And likewise for CPU time For two cores -> 23921/2 = 1196.5 For seven cores -> 23921/7 = 2417.3 First glance suggests there is something wrong with the times for the single core data - compared with both the 2 and 7 core times it is far too slow. We need to know more about the configuration of that computer. Now moving on to the ratio of clock vs. CPU times In a perfect world we could reasonably expect the CPU time to be run time multiplied by the number of cores being used, but there are losses in the system, both at the computer level and within the application, so we can expect the actual ratio to be a bit lower than the ideal. Computer 105011 Expect 2 Observe 1.865 Computer 1024003 Expect 7 Observe 5.326 And what about the ratio between 2 and 7 cores? Expect 3.5 Observe 2.856 So, on this it would appear that running two cores is "nearer ideal" than running seven cores. There are two possible reasons, first that the applications being run are optimised to run on some sort of Symmetrical Multi-Thread Model, or that the computer running 7-threaded tasks is actually saturated in some way or other, both of which lead to a loss of performance. What next? More data on a wider range of computer configuration. All three of the computers considered here are fairly recent AMD Ryzen based. We need to fill in the gaps between 2 and 7 cores being used, and maybe extend up to 16 cores. Preferably this should be with only one (MilkyWay) task running, and at least 2oo tasks being run..... Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
|
Send message Joined: 19 Jul 10 Posts: 819 Credit: 21,098,268 RAC: 5,537 |
First glance suggests there is something wrong with the times for the single core data - compared with both the 2 and 7 core times it is far too slow.Yes, single core is very slow. 1050100 (my own), running two tasks each using two threadsIf you are running just 2x 2-threads you are using just 4 threads of your 8C/16T Ryzen 7 3700X, so on average 0.5 threads per real core, the other (my) computer is using 14 of it's 16 threads, i.e. each real core runs 1.75 threads. Of course it needs more CPU time per task. To make it comparable you would need to run 7 2-thread tasks, currently half of your CPU is idle. or that the computer running 7-threaded tasks is actually saturated in some way or other, both of which lead to a loss of performance.That computer is limited to 4 GHz (while I assume your is boosting constantly at 4.4GHz, at least on those 4 cores, which are in use) and is running Einstein on the iGPU. So yes, it's saturated quite a bit more than yours. ;-) We need to fill in the gaps between 2 and 7 cores being used, and maybe extend up to 16 cores. Preferably this should be with only one (MilkyWay) task running, and at least 2oo tasks being run.....I have not done it without the iGPU in use since that's useless for me, but I posted my results for 1, 5, 7 and 14 threads. 3x5 threads was already slightly slower than 2x7, even if one more thread was used for CPU processing. I might make benchmarks with my test WU with no GPU tasks running when I have time, I expect 2x8 threads to be most efficient, 1x16 and 4x4 to be about the same and 8x2 already significantly slower.
|
|
Send message Joined: 1 Nov 10 Posts: 41 Credit: 2,879,221 RAC: 8,855 |
Mine is also running at 4GHz (+/-.03GHz) - so there's got to be something adrift on yours. In addition to the for cores running MilkyWay I have 4 cores running various other BOINC-based work, and the rest (8) are looking after general computing tasks such as streaming, browsing, audio production, I've almost always got a few spreadsheet and word processing tasks going on just to add to the mix, indeed the only time I suspend BOINC is when I'm doing any CAD work. You may have noticed I don't have an iGPU, this was a conscious decision when I bought the current processor - two fairly old, but at the time of purchase fairly high-end, GPUs work very nicely when it comes to manipulating large CAD models, indeed far better than I would have obtained from an iGPU at the time. Perhaps it''s the iGPU that's grabbing your performance down? (In a past life I noticed that even when not being actively used they did tend to impose a load on the CPU due directly shared resources). Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
|
Send message Joined: 19 Jul 10 Posts: 819 Credit: 21,098,268 RAC: 5,537 |
In addition to the for cores running MilkyWay I have 4 cores running various other BOINC-based work, and the rest (8) are looking after general computing tasks such as streaming, browsing, audio production, I've almost always got a few spreadsheet and word processing tasks going on just to add to the mix, indeed the only time I suspend BOINC is when I'm doing any CAD work.That's still just 50% BOINC usage, I'm at 87.5% + whatever the GPU task needs and of course + "general computing" since that's the only computer I use right now. Perhaps it''s the iGPU that's grabbing your performance down?According to my tests with WCG, where the tasks have nearly identical runtimes, not really and in particular less than running two additional CPU tasks on the remaining threads. I tried different configurations when I bought this computer and 14 CPU tasks + 1 GPU seems to be the sweetspot, at least at 4.0GHz, at the standard boost clock (4.65GHz), the iGPU tasks are running a lot slower even if it's not hitting the thermal limit (but it's constantly at the EDC limit, maybe that's the cause). But like I said, I'll run some benchmarks without GPU, now I'm curious too, than we will know more, just not right now.
|
|
Send message Joined: 19 Jul 10 Posts: 819 Credit: 21,098,268 RAC: 5,537 |
Testing completed. As expected, 2x 8-thread WU on all 16 threads is best. 4x 4-thread WU is next best with around 4% loss. 8x 2-thread and 1x 16-thread already significantly slower, about 12-14% loss, in fact that's same slow as running 3x 5-thread Nbody + Einstein on the iGPU and significantly slower than my standard 2x 7-thread + Einstein, but without getting anything done on the iGPU, so both are definitely a bad choice, even if a lot better than running single thread. The iGPU slows down Milkyway by about 6-10%, which is acceptable IMHO.
|
|
Send message Joined: 1 Nov 10 Posts: 41 Credit: 2,879,221 RAC: 8,855 |
Your results are incomplete due the use of a single selected task in your test, this could be the best, or the worst case task to base you theory on. The way you are overloading your computer I'm not surprised at your results so far. Do ONE two-core at a time collect the times for over 200 tasks. Then repeat for ONE 4-four core task, then repeat for ONE 5-core task, then repeat for ONE 6-core task, then ONE 8-core task. Do not jump to conclusions, wait, it will take many days. Keep the base load exactly as you had fr your 7-core tasks. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
|
Send message Joined: 19 Jul 10 Posts: 819 Credit: 21,098,268 RAC: 5,537 |
I used two different WUs for my latest tests + a longer one before, which was however only tested ST vs. 7-thread. Feel free to do your own tests, those are enough for me. What's the point of running a single 2/4/5/etc.-thread WU on a 16-thread CPU? We know it will complete fastest if we use all 16 threads. Even when running such highly optimized applications like PrimeGrid's LLR you need at least as many threads as you have real cores, everything else is slower. Also there's no point to run 200 tasks and than compare them to 200 tasks with completely different amount of work, you must pick a WU (or few of them) and run it in different ways, at least on projects with completely random runtimes.
|
|
Send message Joined: 1 Nov 10 Posts: 41 Credit: 2,879,221 RAC: 8,855 |
You do not understand what you are talking about. You have selected two tasks, yes, but do you KNOW that that they represent the whole range of tasks. Even 200 tasks is a small sample, given the number of tasks run in one day over all the hosts, but it covers a far better selection. Now try and do a larger sample of tasks, ranging from those that run for less than 10 seconds to those that take at least double your average processing time. In order to select your representative tasks you must first run a very large sample, then select tasks that fall into each of at least three groups, the very fast, those that are of medium runtime, and those that take a very long time. DO NOT compare the performance of PrimeGrid with MilkyWay. PrimGrid employs predominantly integer arithmetic, whereas MilkyWay employs a lot of floating-point arithmetic. With AMD processors there is one float calculation unit for two other calculation units; the two other units may be considered a "master" and a "sub", the master has first call on the float unit, so the sub will tend to be slower, thus when you exceed the number of "master" units the overall performance will drop rapidly - which is exactly as you have observed in trying to run a 16-core task on a 16-core (that is 8master + 8sub). Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
|
Send message Joined: 19 Jul 10 Posts: 819 Credit: 21,098,268 RAC: 5,537 |
With AMD processors there is one float calculation unit for two other calculation units; the two other units may be considered a "master" and a "sub", the master has first call on the float unit, so the sub will tend to be slower, thus when you exceed the number of "master" units the overall performance will drop rapidly - which is exactly as you have observed in trying to run a 16-core task on a 16-core (that is 8master + 8sub).One FPU shared between two cores was before Ryzen. And even if this type of limitation was all that matters, than 2x 8-thread would be same slow. It's not. Like I said, feel free to run your own tests, I'm not going to spend days or weeks on benchmarking different types of WUs, it's simply not worth it, in particular tests without GPU are completely irrelevant to me. But don't compare just 200 random tasks with other 200 random tasks, that doesn't work here and is completely pointless just like my comparison of credits, which made me think single thread is fastest. There are a lot more long running v.193 WUs than v1.94 WUs (as usual at the beginning of a new batch AFAICT), so according to your method, my PC got suddenly a lot faster. That's not true of course.
|
|
Send message Joined: 1 Nov 10 Posts: 41 Credit: 2,879,221 RAC: 8,855 |
Well - yes and no. The Ryzen family have a number of "real" cores - the ones I called "master", and an equal number of "virtual" or "hyper" cores - the ones I called "sub". They have one FPU for each "real" core, thus each FPU is shared between the "real" and "virtual" (what I called "sub"). By virtue of the silicon the "real" core will have first call on its associated FPU. As for your views on sample size - have you ever looked at a plot of tasks per time-slot vs. time-slot? I've just done that and the result is not what I would expect - using about 300 tasks from one PC, and 100 second time slots and found that there were more very short (sub-100 second) and very long (over 6000 second) tasks than any 100 second slot in between. I was expecting a "bell" curve, where there were very few sort or long task, but a distinctly greater number somewhere in between - the line between the two extremes just wobbles up and down..... This basically means that no single task (or even pair of tasks) time can be taken as representative of the whole sample, rather one has to use a large sample, well in the hundreds, to get a fair (not even a good) representation of the real range of task durations. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
|
Send message Joined: 19 Jul 10 Posts: 819 Credit: 21,098,268 RAC: 5,537 |
The Ryzen family have a number of "real" cores - the ones I called "master", and an equal number of "virtual" or "hyper" cores - the ones I called "sub". They have one FPU for each "real" core, thus each FPU is shared between the "real" and "virtual" (what I called "sub"). By virtue of the silicon the "real" core will have first call on its associated FPU.I don't understand why you find out your own terms here. There are physical/real cores and there are virtual/logical cores. If you turn SMT in BIOS off, you have just the real cores and can execute one thread at a time on each of them. If you turn it on, every real core becomes two virtual cores and it can execute two threads at a time (at slower speed however since they have to share most of the ressources). Both virtual cores are equal in that case, no "master" or "sub" and the real core will try to execute the commands for each thread simultaneously, hence the name "Simultaneous Multithreading". The reason why SMT speeds up processing is because it is interleaving instructions from the two threads across the shared execution resources, so that situations like doing nothing while waiting for data or whatever occur less often. This wouldn't work as good as it does, if the physical core would priorise the thread running on the "master virtual core". SMT might not help for highly optimized applications like LLR since they will rarely get into such situations (however it doesn't seem to slow them down either as far as I've read on their forums), but for average applications it helps a lot and Milkyway's Nbody and nearly all other project applications fall into that category. This basically means that no single task (or even pair of tasks) time can be taken as representative of the whole sample, rather one has to use a large sample, well in the hundreds, to get a fair (not even a good) representation of the real range of task durations.Yes, but we are not searching the perfect representation of task durations, we just want to know with how many threads per WU the application performs best on the CPU under typical load, so in my case for example while running Einstein on the iGPU. Since unlike on PrimeGrid here we can't just start Prime95 and get the result in few of seconds and we don't even know how much cache the Milkyway application needs, we need to experiment, but within some sensible limits. And of course with selected WUs, since everyone is different and the mix isn't constant either. There's no way that a single 2-thread WU is going to get most work done per day, so I don't need to test it. The WUs I've choosen were short ones for obvious resons and than some limited testing on a long one to verify, that the application does not behave completely different in such case. It does not, that's good enough IMHO. I started testing after I noticed, that my system running single thread was way too slow compared to similar systems running same WUs. If I notice again, that in some cases my system is way too slow, I might test that type of WUs, but right now I don't see the point. I tried running on 50% of the cores with Einstein's FGRP5 and WCG's MCM1, in both cases I got a huge drop in production, here on Milkyway even going from 16 to 14 threads (without GPU in use) means 3-4% less work done per day, with less threads it will only get worse.
|
|
Send message Joined: 1 Nov 10 Posts: 41 Credit: 2,879,221 RAC: 8,855 |
SMT might not help for highly optimised applications like LLR since they will rarely get into such situations (however it doesn't seem to slow them down either as far as I've read on their forums), but for average applications it helps a lot and Milkyway's Nbody and nearly all other project applications fall into that category.| Sorry but you just do not understand what you are talking about. One can optimise an application to use a single processor thread, or multiple threads. There is a vast difference between the two, and, while a distributed application may actually contain both, but they are very different, and the correct code segments can be selected at runtime. Anyway that is now a long way off the subject of this thread. END OF ANY DISCUSSION OFF LLR here. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
|
Send message Joined: 1 Nov 10 Posts: 41 Credit: 2,879,221 RAC: 8,855 |
I don't understand why you find out your own terms here. There are physical/real cores and there are virtual/logical cores. Quite a simple answer - for the last few years I've been working on a large system which uses the terms I used. (When I say "large" there are 256 "master" processors, each of which is a 16 core Intel chip, of these has a TMS320 and an i960 "sub". Additionally there are 512 Nvidia GPUs - not sure what model they are as I only feed them with data...) Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
|
Send message Joined: 1 Nov 10 Posts: 41 Credit: 2,879,221 RAC: 8,855 |
The acronym "SMT" has several meaning, the one you describe is, strictly "Sequential Multi Threading". Then there's Symmetric Multi Tasking and Synchronous Multi Tasking, which are similar, but not the same. Consider a two-threaded process. In Symmetric the two threads each handle half the data, and do so at their own pace. In Synchronous Multi Tasking the two threads march through the process totally in synch with each other. Both have their advantages and disadvantages and which is used really depends on the problem being solved. (And for a real headache have a heterogenous mix of processor types of processor in use. - (How I pity the people who have to synch the data on the system I'm working on.) Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
|
Send message Joined: 1 Nov 10 Posts: 41 Credit: 2,879,221 RAC: 8,855 |
Yes, but we are not searching the perfect representation of task durations, Too true, but we need to make sure that our sample tasks are representative of the whole set of tasks. Unless the set is very homogeneous a single task cannot fulfil that, so a large number of tasks is needed. Thus we have a choice, develop a set of couple of hundred tasks coming from each of the time buckets, or take the assume that a random set of tasks will be sufficiently representative of the whole set. (Whole set - all the tasks being offered by MilkyWay for a particular analysis) Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
|
Send message Joined: 19 Jul 10 Posts: 819 Credit: 21,098,268 RAC: 5,537 |
When I say "large" there are 256 "master" processors, each of which is a 16 core Intel chip, of these has a TMS320 and an i960 "sub". Additionally there are 512 Nvidia GPUs - not sure what model they are as I only feed them with data...)I thought we are talking about a simple PC with a single Ryzen CPU running Milkyway and eventually some other BOINC projects. Anyway, like I said, if you think my tests are not good enough, feel free to do your own. I got the information I needed, this is not supposed to be a PhD thesis about running Milkyway on a Ryzen 5700G.
|
©2026 Astroinformatics Group