Welcome to MilkyWay@home

Posts by Captiosus

1) Message boards : Number crunching : Splitting MT N-body? (Message 66825)
Posted 1 Dec 2017 by Captiosus
Ha yeah, I probably should have mentioned that your initial copy over from Cosmology@home completely mangled the code.
2) Message boards : Number crunching : Splitting MT N-body? (Message 66821)
Posted 30 Nov 2017 by Captiosus
Cheers! Took a little bit to get it right, but i got it and now hopefully it will help in improving CPU utilization a bit.
3) Message boards : Number crunching : Splitting MT N-body? (Message 66814)
Posted 26 Nov 2017 by Captiosus
I was wondering, is there a way to split MT N-Body units so that instead of say, 1 unit taking up all 12 threads, i can have 2 units using 6 threads each?
4) Message boards : Number crunching : m2050 (Message 66208)
Posted 20 Feb 2017 by Captiosus
will these work ? open cl 1.1 but are DP

If you can cool it, yeah it will work. Before I accidentally broke mine I was using my own M2050 for milkyway (amongst other projects), and am currently using a GTX 460 (same core architecture; Fermi) for milkyway and it does well enough.

A word of warning though, it IS very first generation Fermi and thus will run hot and power hungry. Make sure you have a strong enough PSU and ample airflow to keep it fed and cooled.

Finally, i recommend you set up your client to run multiple workunits on it, it is a powerful card after all.
5) Message boards : Number crunching : Just had a BOINC unexpectedly quit when starting an Nbody unit (Message 66151)
Posted 1 Feb 2017 by Captiosus
Has anyone been having issues with BOINC crashing while computing Milkyway@home tasks? I just had BOINC quit while it was in the process of starting a workunit, and it gave no warning, no error nothing.

The last line of the log says it was starting an Nbody Tighter Constraints MT unit (de_nbody_11_7_16_v162_20k_tighterconstraints_1_1484858102_344533_1).
The only other behavior I can add is that it appears BOINC had just paused the CPU modfit programs (all 8 of them) to start the MT unit when it happened.

I'm not sure if its nbody being derp again, or I just got a bad unit, or what is going on. This rig had successfully gone through easily a dozen or two Nbody units before the crash.

This machine is overclocked, but I dont think that has anything to do with it as this machine has successfully passed running 12hr prime95 large inplace FFTs, and has processed a number of both Collatz and Milkyway@home units, with all units being considered valid by their respective projects. The Collatz CPU units in particular took the machine about 5 days to chew through the 8 units (one per thread) it was given, and those 8 were all considered valid.

I dunno, I'll cook the remaining Milkyway units I have left then try other projects, see if anything can get BOINC to quit again.
6) Message boards : News : Scheduled Maintenance Concluded (Message 65731)
Posted 13 Nov 2016 by Captiosus
My GPU units are not executing on the GPUs, they're running on the CPU and sucking up CPU time. My machine is running win7, and i typically run 2 units per card.

Oh well, looks like I'm waiting for the next fix. Off to SETI!
7) Message boards : Number crunching : Massive server issues and Wu validation delays (Message 65549)
Posted 28 Oct 2016 by Captiosus

I have so many tasks pending the server has trouble loading the "tasks" page. I agree with the aforementioned solution of increasing WU size. The sheer number of these tiny WUs is probably the majority of the server side problems. Since there is such a discrepancy between a high DP card and a low DP card, doubling the runtime on a tahiti with its 1/4 DP would increase a 1/32nd DP card to 16x? A 750ti looks to run 1 Wu in about 100 sec, and a 1080 in about 25 sec. If the tahiti time doubled I think that would mean 1600 seconds for a 750ti and 400 seconds for a 1080.

If you double the length of the WU, the runtime doubles for all cards.

ie right now all cards are "running a 100 yard dash". If you made it a 200 yard dash, all the cards times would go up by a factor of 2.

Well, I'm wondering if it would be possible to hand out longer and shorter workunits to different video cards according to their processing capability (in gflops) and manufacturer, so that cards that fall within certain performance metrics get adequately sized workunits.

I mean, I have a mismatched pair of video cards in my main rig that struggle to get above 150 Gflops double precision, and they each chew through a pair of units in about 3-4 minutes tops. Why the hell am I getting the same sized units as those getting sent out to machines equipped with R9 280x or other TFLOP-class DP cards, cards that as I understand it have to run 8 or more concurrent workunits to maximize utilization due to how fast they burn through them?
8) Message boards : Number crunching : MT Nbody 1.62 workunit locked up (Message 65070)
Posted 24 Aug 2016 by Captiosus
Well, unit hasnt locked up, but I now have an Nbody WU that is suffering a memory leak as it progresses. Idle for this rig is about 1GB. In the past hour and a half, this unit has pushed the computer from 1GB memory useage to 1.9GB, and the climb started in earnest about 10 minutes in.
Memory increase rate looks to be about ~100mb every 10 minutes or so, and is linear in growth after the first 10 minutes.

WU that is leaking is de_nbody_8_1_16_v162_2k_3_1471352127_585064_0, although shorter workunits also show the same memory behavior until they end (very slow growth for the first 10 minutes, then 100MB every following 10 minutes).
9) Message boards : Number crunching : MT Nbody 1.62 workunit locked up (Message 65069)
Posted 24 Aug 2016 by Captiosus
Adding in I just had a Nbody unit lock up on my secondary rig. Made it to 1.516% complete then froze.

WU is de_nbody_8_1_16_v162_2k_3_147135217_585063_0

Not entirely sure if this particular one froze because of the secondary rig being under the effects of a fairly significant overclock (+1.1ghz on a C2Q), so I will back it down to 3.6ghz (which i know is stable on this rig) just to be sure.
10) Message boards : Number crunching : MT Nbody 1.62 workunit locked up (Message 65017)
Posted 11 Aug 2016 by Captiosus
Kewl, I'll be looking forward to the response, if any.
11) Message boards : Number crunching : MT Nbody 1.62 workunit locked up (Message 65001)
Posted 8 Aug 2016 by Captiosus
Would like to add (since it wont let me edit):

I've also been getting a number of errors in nbody, both for MT and ST tasks, all for v1.62. 2 MT units (the worse of the two highlighted above), and 5 (and counting) ST Nbody units have all failed with a computation error. Checking the log shows that all of the failed units have experienced an "Exceeded disk limit xx.xx MB > 50MB". I have 4GB of disk space set aside for BOINC units, so there's no way that can be it.
12) Message boards : Number crunching : MT Nbody 1.62 workunit locked up (Message 64998)
Posted 7 Aug 2016 by Captiosus
So, I thought i'd cook a few units last night, and left my CPU and GPUs to finish running the batch I pulled from the servers. For the most part, everything worked out fine (some of the CPU MT units went by really fast), and I left it running overnight.

I wake up this morning and notice that i've still got CPU activity. I check BOINC and I see one of the SMT units has frozen at 91.019%, and has been in this state for the past 11 hours or so.
There was also a fairly considerable memory leak in progress, and when I aborted the unit I ended up freeing up about 3GB of memory even though taskmanager, resource monitor, and process explorer all showed it was using about 800mb.

The WU in question is de_nbody_8_1_16_v162_2k_3_1470395169_81723_4
13) Message boards : News : Nbody Release 1.54 (Message 64178)
Posted 15 Dec 2015 by Captiosus
Please read in the other topic for announcement of n-body 1.52 what I wrote (one month ago and now again), nobody ever answered me but they are not working on my Mac : running on 1 core out of 8, blocking the other boinc apps to run (pretending it is "mt"), and in 1.52 they would not even complete normally (AND they started running again now when my setup does not allow nbody)...


EDIT : there is a change now, after 15mn of running they actually start to run on all 8 cores and really using 100% of my CPU (instead of 1/8th).

EDIT2 : they don't completely run 100%, it's more like 90% of available CPU, letting a 10% idle.

EDIT3 : OK so the task did finish on estimated time and was sent successfully back to the servers, so things look better with this 1.54, even though not perfect (running 15mn with 1/8 of CPU and not running a full 100% for the remaining time).

It because, as Sidd said, the initialization period cannot be made multi-threaded. It screws up the math needed to set the initial values of each body in the model, and when it gets turned in along with other workunits from the same batch, the results poorly correlate due to the bad math.

My suggestion in response to that was to use the idle initialization period to prime a batch of workunits for compute, then once enough are primed and ready to go, process them one by one.
Alternatively, interleave it by setting 87.5% of the available threads for MT (on my CPU, it'd be 14 threads of 16 total, on yours it'd be 7 of 8 total), and use the remaining thread(s) to initialize the next unit, so when an MT workunit is done, another one can immediately start processing.
14) Message boards : News : New Release- Nbody version 1.52 (Message 63948)
Posted 22 Sep 2015 by Captiosus
I've got one that's seems to be causing problems. This one has been running for 78 hours and it's been at 100% since about 9 hours and it never did save a checkpoint. The original estimate was for about 1.5 hours. BOINC indicates it is using 8 cpus, but the activity monitor shows it is only using 1.3% cpu with 2 threads. I've seen some with the previous version of N-body work like this, so I'll probably watch it for a little while longer to see if it finishes.

I'm running a Mac Mini with OS 10.10.5

The work unit is de_nbody_9_09_15_orphan_sim_0_1437561602_56033_0

This app is supposed to use X cores, but it seems, regardless the CPU's type, it uses only up to 25 % of the CPU during 50 % of the WU's lenght, and then up to 100 % for the remaining 50 % of the time.

I think the long period of having low CPU Utilization (1-2 cores at most) is the initialization period, the setting up of the work so that computation can actually proceed. The problem with it is that it is one of those serial tasks that cant easily be split up into multiple threads for processing. I get the same thing as well; a long batch of single threaded work (3-5 min) running on a single core, and then a short burst using all of the set cores to do the actual compute.

What I was thinking about suggesting was the splitting of the initialization period and compute period into 2 distinct tasks. Uninitialized work is sent out in batches, and they get prepped for computing in groups (with my Xeon it'd be 15 tasks at once getting initialized). Once initialized, the workunit is passed through an SHA hash function, which is then sent to the MW@H servers for comparison. If the results from (n) number of clients match, the workunits on those computers are flagged as ready for compute and will be processed at the next task switch. Once the MT tasks are complete, they are sent in for standard end of work processing and credit is awarded.

Alternatively, work that is initialized is sent to the MW@H servers (not just the hash) for comparison to ensure they are initialized properly. A small block of credit is awarded if they are, and then the initialized work is sent back out to begin the actual compute process like any other work unit. Once complete, its sent back in and normal end of work processing is done.

A third alternative would be to have the MW@H program take uninitialized workunits, initialize them in batches, checkpoint them at the end of the initialization period, then once a number are ready, switch to MT mode and rip through them before sending the work in.

Now, I am aware this would increase overhead for the project by a not insignificant amount, but in the end the whole idea would be to minimize idle time on the clients (the major issue at the moment) so more work can be done. Any ideas to improve it would be nice.
15) Message boards : News : New Release- Nbody version 1.52 (Message 63938)
Posted 18 Sep 2015 by Captiosus

That would be an awesome thing to get working. We have been tossing ideas around about how to do it. But it is still, unfortunately, a work in progress.

Also, right now we want to focus on making sure the application can actually return results before tinkering with the code again. But, again, we have that on our to do list!


Oh goodie. Any ideas on doing that that seem viable?
16) Message boards : News : New Release- Nbody version 1.52 (Message 63934)
Posted 17 Sep 2015 by Captiosus
It works! Awesome! I just re-enabled it and my CPU is chewing through MT tasks like they're candy.

I would like to ask though: Is there anything that can be done to further optimize CPU useage so there arent large periods of low (single thread) CPU utilization?

As it stands right now, on my CPU with the MT tasks, theres about a minute of single thread activity (which as I understand it is the initialization period that cannot be multi-threaded), and once the initialization period is complete theres a quick (30sec) burst where the task uses all of the designated threads and completes itself (in my case, 15).
Is there any way that this could be altered without breaking NBody again?
17) Message boards : News : Nbody Status Update (Message 63863)
Posted 10 Aug 2015 by Captiosus
Groovy. Looking forward to it.
18) Message boards : News : Nbody Status Update (Message 63861)
Posted 9 Aug 2015 by Captiosus
Hey Sidd, can we get a possible ETA on the new N-body being ready?
19) Message boards : News : New Nbody Version 1.50 (Message 63582)
Posted 15 May 2015 by Captiosus
Hmm, seems the problem with milkyway nbody still remains even though its been updated. M0CZY says it works in linux. That begs the question: if Nbody wont run right in windows, why not do it in a linux VM like some other projects do?

And for me both single thread and MT Nbody exhibit the stalling bug.
20) Message boards : News : New Nbody version 1.48 (Message 63166)
Posted 20 Feb 2015 by Captiosus
Interesting thoughts. I wonder if Sidd can enlighten us where the parallelisation 'sweet spot' is, or if they'd like us to try and find it. Considering the MT phase only, I'd imagine that there comes a point where the overhead of managing and synchronising multiple threads exceeds the benefit - but I wouldn't know whether the tipping point is above or below 15 threads.

I'm currently verifying that the Application configuration tools available in BOINC v7.4.36 allow thread control - initially, limiting the active thread count to 3, so that other projects can continue to make progress on one core while nbody runs.

The next test is to run a bundle of tasks to the first checkpoint with an app_config thread limit of one, with the intention of changing app_config when they've all been prepped, and running the MT phase with a 3-thread app_config. Very labour intensive, and not amenable to scripted automation, but might be an interesting proof-of concept. If it works, the project might consider splitting the app at the end of the initialisation phase.

a) Send out a single threaded task to perform initialisation
b) Return the initialisation data generated as an output file
c) Send out the initialistion data as an input file to a new, multithreaded, simulation task.

I think with the new version the parallelization of the work tops off at about 10 active threads. Once my CPU gets going, it only uses about 2/3ds of the available core count.

As for your tests, that is precisely what I was thinking of. Heres what I cooked up for an earlier post but cut it out:

What I would like to know is if the initialization and actual computation of a run can be split for more effective use of available resources. Instead of doing it like this:
Download work units
Initialize one on one thread (blocking all other cores/threads from useful work)
Process the workunit
Send in completed work

I propose the work be done like this (2 methods):
Batch mode
1.Download workunits
2.Initialize a number of workunits simultaneously. Each unit that is initialized has its state saved to await open resources, and forms a work queue in the order its prep completes. This way the execution resources arent sitting there doing nothing.
3A.When there are sufficient workunits ready, the app switches to multi-threaded mode, grabs an initialized unit, and begins processing.
3B. When complete, the unit is turned in and another initialized unit is pulled from the client side work queue for processing.
4.When the initialized work queue is exhausted, it switches back to initialization mode and preps another batch of work.

Stream mode
1.Download workunits
2. Initialize a workunit, then begin processing the moment it is done using the specified thread like how its currently done, but minus 1 thread.
3. While the ready unit is getting chewed through, the open thread is used to prep another unit for processing. Units that are ready before resources open up have their states saved to disk. The open thread is then used to ready another unit.
4. As each unit completes, a ready unit is slotted in and begins processing. This keeps going until the work dries up (either from nothing coming from the project, or the user has set no new work for the project).

In either case, 2 programs are necessary. One to initialize, and one to actually process.

Now, I'll admit I know very little about programming and thus dont know the viability of switching back and forth between prepping a batch of workunits and chewing through them one at a time when they're ready.

A variation of this would have it going from Batch to stream mode as workflow gets moving, and instead of one monolithic MT unit, it does 2 or 3 at once depending on thread allocation count. On my rig for example, 2 blocks of 7 threads would run MT, with the 15th thread initializing units.

Batch mode would be suitable for systems that run under the effective thread count (11), while batch to stream mode would be better suited for machines with high core counts. Machines with extremely high thread counts (>=24 threads) would dedicate more than one thread for maintaining the work queue (1 init thread per every work block). So a crazy person running an i4P loaded with 16c/32t Xeons (120T) could end up with 14 8-thread work blocks, with the remaining 8 threads feeding the beast so to speak. Tuning that for optimal workflow would take some time though.

I wonder if BOINC allows running apps to have their own daughter processes.[/quote]

Next 20

©2022 Astroinformatics Group