Welcome to MilkyWay@home

HIgh thread count applications

Message boards : Number crunching : HIgh thread count applications
Message board moderation

To post messages, you must log in.

AuthorMessage
rz5rqt

Send message
Joined: 5 Sep 09
Posts: 9
Credit: 556,746,368
RAC: 40,306
Message 76906 - Posted: 10 Feb 2024, 16:36:40 UTC

In another forum I read an article about applications with high thread counts are not as efficient. For example 16 thread count application will not finish twice as fast as an 8 thread count application. This got me thinking about how I might improving my throughput by running multiple applications at a lower thread count AND can I increase CPU utilization on some systems. One of my computers is a XEON E5 2678 with 24 "CPUs". Milkyway uses 16 by default but with a config file I can run two applications at a time with 12 CPUs apiece. Seems like a "no brainer". But how much did I gain? To "prove" that I would need 2 test files that had equal run times. First run sequentially, then run concurrently. Anyone here ever see any data like this? Pointers to other articles?
ID: 76906 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3339
Credit: 524,010,781
RAC: 12
Message 76912 - Posted: 11 Feb 2024, 10:45:26 UTC - in response to Message 76906.  

In another forum I read an article about applications with high thread counts are not as efficient. For example 16 thread count application will not finish twice as fast as an 8 thread count application. This got me thinking about how I might improving my throughput by running multiple applications at a lower thread count AND can I increase CPU utilization on some systems. One of my computers is a XEON E5 2678 with 24 "CPUs". Milkyway uses 16 by default but with a config file I can run two applications at a time with 12 CPUs apiece. Seems like a "no brainer". But how much did I gain? To "prove" that I would need 2 test files that had equal run times. First run sequentially, then run concurrently. Anyone here ever see any data like this? Pointers to other articles?


Why not run some regular tasks one way and then some the other way, say for about 24 hours each, and then take the average times of each and see which is better. The problem with choosing just one task is that one task could be just that a one off and not representative of real life tasks except that one. IF you do want to run just one task you can run it outside of Boinc but I don't know how to do that.
ID: 76912 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
xii5ku

Send message
Joined: 1 Jan 17
Posts: 37
Credit: 109,936,261
RAC: 2,664
Message 76913 - Posted: 11 Feb 2024, 14:21:08 UTC

My personal recipe to answer this question — whenever a project with variable workunit sizes is involved — is to test-run tasks outside of boinc, from a little script — with all tasks generated from one and the same workunit. Using a fixed workunit for the tests makes this process fully repeatable and very precise. The script launches as many tasks as I want, with as many software threads as I want, and measures the time it takes for each of such a test-run. (Or it kills the test-run after a set length of time, and measures the progress that the tasks made within that time.)

I haven't looked into applying this recipe to the NBody application yet. If NBody tasks receive only deterministic input parameters from the workunit, then it may be doable. But if there are also randomized initial values per task, this recipe won't be applicable.
ID: 76913 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rz5rqt

Send message
Joined: 5 Sep 09
Posts: 9
Credit: 556,746,368
RAC: 40,306
Message 76914 - Posted: 11 Feb 2024, 15:33:59 UTC - in response to Message 76912.  

True. But using real data would mean slightly different results if tested multiple times over multiple days because the data packets we crunch are different length. An unknown variable. When we get through this big backlog of _01 tasks and start getting credit again, I could use recent credit numbers to see an improvement. And it would be very close but because of my background, I would kind of like to know to the .01%. Now I don't expect to go down some rabbit hole for these numbers but I do like to look at things like this. Used to do it for a living. Just thought if anyone else out there had done some "all variables controlled" testing, I would take advantage of their work. Thanks for responding. I see we both started back in 2009.
ID: 76914 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 712
Credit: 551,958,389
RAC: 45,625
Message 76915 - Posted: 11 Feb 2024, 22:06:42 UTC - in response to Message 76906.  

In another forum I read an article about applications with high thread counts are not as efficient. For example 16 thread count application will not finish twice as fast as an 8 thread count application. This got me thinking about how I might improving my throughput by running multiple applications at a lower thread count AND can I increase CPU utilization on some systems. One of my computers is a XEON E5 2678 with 24 "CPUs". Milkyway uses 16 by default but with a config file I can run two applications at a time with 12 CPUs apiece. Seems like a "no brainer". But how much did I gain? To "prove" that I would need 2 test files that had equal run times. First run sequentially, then run concurrently. Anyone here ever see any data like this? Pointers to other articles?

Grab a task file from a local computer and copy it to a temp folder along with the MW MT app. Stop BOINC. Open a command window and cd to the temp folder and run the MT executable with the task file as its input. The application will run in the task file with the default 16 threads.

Save the stderr.txt output in the folder for later comparison. Then delete the output result file, the boinc_finish file, the lockfile if any and the stdderr.txt file. Don't delete the task file.

The run the application again but this time run it with the num_threads 12 parameter after the executable name and before the task file name.

If the application won't accept the num_threads parameter directly, then just export the value to your environment.
export OMP_NUM_THREADS=12

Then compare the stderr.txt files by comparing the runtimes by subtracting the start time from the finish time to find the elapsed time for both runs. If the elapsed time times two of the num_threads 12 run is less than double the 16 threads run, the 12 threads run times two tasks will be best for production.
ID: 76915 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
dchhbfx
Avatar

Send message
Joined: 11 Feb 24
Posts: 1
Credit: 33,281
RAC: 0
Message 76959 - Posted: 6 Mar 2024, 12:43:09 UTC - in response to Message 76915.  

hello
ID: 76959 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3339
Credit: 524,010,781
RAC: 12
Message 76961 - Posted: 7 Mar 2024, 11:47:42 UTC - in response to Message 76959.  

hello


Welcome!! Nice bunch of computers you have there and they seem to be doing great!!
ID: 76961 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rz5rqt

Send message
Joined: 5 Sep 09
Posts: 9
Credit: 556,746,368
RAC: 40,306
Message 77108 - Posted: 27 Apr 2024, 19:54:47 UTC

I took a different path in my investigation of high thread count. I acquired three Chinese X99 dual CPU motherboards and populated them all with two Xeon E5 2680 V4 CPUs. 32 Gig of memory and a 256Gig NVME. Fresh installs of Windows 10. The first one I did not create an app_config file and it ran 3 tasks at a time using 16 CPUs each (48 total). The second computer was configured to run 4 tasks at a time using 12 CPUs (48 total). The third was configured to run 6 tasks at a time using 8 CPUs (48 total). After 21 days of collecting data every day it was apparent that the computer running 3 tasks of 16 CPUs did not produce the same amount of credit. Not over the total time period and not in RAC. The other two computers averaged more than 40,500 RAC but the computer with 3 tasks was consistently under 38,000 RAC. So 10 days ago I reconfigured the 3 tasks computer to run 6 tasks with 8 CPUs. The daily numbers came up immediately and after 4 days the RAC is well over 40,500. I am confident that for this computer configuration running 6 tasks with 8 CPUs is more efficient than 3 task with 16 CPUs. The daily numbers and RAC was not much different for the 12 CPU computer but I will change that one also to see if it improves anything.

Thanks to everyone who commented below.
ID: 77108 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 618
Credit: 19,254,980
RAC: 22
Message 77109 - Posted: 28 Apr 2024, 8:56:29 UTC - in response to Message 77108.  
Last modified: 28 Apr 2024, 9:29:21 UTC

I'm testing currently single thread tasks vs. two-thread tasks on my Ryzen CPU (I think anything with more than two threads will be even more inefficient). I collected data for 300 consecutive tasks for each application. Not all tasks have been validated yet, but the current state is ~5.2 million credits per year (RAC of around 14,000) for the MT application running on two threads and 7.5 million credits (RAC ~20,500) for the single core application (if the computer would run 24/7). That's a huge difference IMHO, specially since (I don't know why) the CPU is using around 4 Watts less when running 16x the single thread application than when running 8x dual-thread, in both cases constantly boosting @4.6 GHz.

Will do the same for the current 1.87, which is the MT-application runnung on a single thread, but it seems to be pretty identical regarding runtimes and energy consumption to the 1.86 ST-application.
ID: 77109 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3339
Credit: 524,010,781
RAC: 12
Message 77112 - Posted: 28 Apr 2024, 10:48:21 UTC - in response to Message 76906.  

In another forum I read an article about applications with high thread counts are not as efficient. For example 16 thread count application will not finish twice as fast as an 8 thread count application. This got me thinking about how I might improving my throughput by running multiple applications at a lower thread count AND can I increase CPU utilization on some systems. One of my computers is a XEON E5 2678 with 24 "CPUs". Milkyway uses 16 by default but with a config file I can run two applications at a time with 12 CPUs apiece. Seems like a "no brainer". But how much did I gain? To "prove" that I would need 2 test files that had equal run times. First run sequentially, then run concurrently. Anyone here ever see any data like this? Pointers to other articles?


There is a way to run the exact same file multiple ways but unless you copy the file to another place and do a 'sneaker net' type thing where you copy the file back into Boinc run it with 2 cpu's. then copy it back into Boinc again and run it with 3 cpu's etc. But it would most likely only be applicable to your system and that file, it would be easier to run a batch of files with 2 cpu's, then stop getting tasks and change your config to use 3 cpu's etc, that would give you a more real life scenario due to the various file sizes which we users can't control.
ID: 77112 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 618
Credit: 19,254,980
RAC: 22
Message 77113 - Posted: 28 Apr 2024, 12:59:41 UTC - in response to Message 77112.  
Last modified: 28 Apr 2024, 13:02:25 UTC

In case of Milkyway, we can't copy any files, the WU is just a command line for the application saved in the client_state.xml, like for example this one:

<workunit>
    <name>de_nbody_04_08_2024_v186_OCS__data__9_1713924693_672099</name>
    <app_name>milkyway_nbody_orbit_fitting</app_name>
    <version_num>187</version_num>
    <rsc_fpops_est>47099200000000.000000</rsc_fpops_est>
    <rsc_fpops_bound>470992000000000000.000000</rsc_fpops_bound>
    <rsc_memory_bound>500000000.000000</rsc_memory_bound>
    <rsc_disk_bound>52428800.000000</rsc_disk_bound>
    <command_line>
-f nbody_parameters.lua -h histogram.txt --seed 423091676 -np 11 -p 3.63393 1 0.141287 0.148999 1.27413 0.014464 1 1 1 1 1
    </command_line>
    <file_ref>
        <file_name>EMD_v186_OCS.lua</file_name>
        <open_name>nbody_parameters.lua</open_name>
    </file_ref>
    <file_ref>
        <file_name>data_hist_summer_2020_beta_disp7.hist</file_name>
        <open_name>histogram.txt</open_name>
    </file_ref>
</workunit>

And the result part:
<result>
    <name>de_nbody_04_08_2024_v186_OCS__data__9_1713924693_672099_1</name>
    <final_cpu_time>0.000000</final_cpu_time>
    <final_elapsed_time>0.000000</final_elapsed_time>
    <exit_status>0</exit_status>
    <state>2</state>
    <platform>windows_x86_64</platform>
    <version_num>187</version_num>
    <wu_name>de_nbody_04_08_2024_v186_OCS__data__9_1713924693_672099</wu_name>
    <report_deadline>1715345332.000000</report_deadline>
    <received_time>1714308532.311998</received_time>
</result>

So basically you can just get some WUs, disable network access, create a copy of your client_state.xml and from there you can select your set of test WUs and run them in different ways, just modify either the copy of client_state.xml with test WUs and place it in your BOINC data dir or modify the app_info.xml.
ID: 77113 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rz5rqt

Send message
Joined: 5 Sep 09
Posts: 9
Credit: 556,746,368
RAC: 40,306
Message 77115 - Posted: 28 Apr 2024, 13:20:02 UTC - in response to Message 77109.  

Interesting. I had planned to try a system running 12 tasks with 4 CPUs at the end of the current three week test period but I may just bump that up and configure one computer that way now.

Thanks for the input.
ID: 77115 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 618
Credit: 19,254,980
RAC: 22
Message 77117 - Posted: 28 Apr 2024, 16:05:18 UTC - in response to Message 77115.  

I had planned to try a system running 12 tasks with 4 CPUs at the end of the current three week test period but I may just bump that up and configure one computer that way now.
Better try single core tasks, I'm pretty sure you will get a similar increase in production as me. Milkyway tasks need very little memory, you should be able to run 56 N-Body tasks even on the system with just 16GB RAM.
ID: 77117 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : HIgh thread count applications

©2024 Astroinformatics Group