Posts by Captiosus

21) Message boards : Number crunching : MT Nbody 1.62 workunit locked up (Message 65017) Posted 11 Aug 2016 by Captiosus Post: Kewl, I'll be looking forward to the response, if any.
22) Message boards : Number crunching : MT Nbody 1.62 workunit locked up (Message 65001) Posted 8 Aug 2016 by Captiosus Post: Would like to add (since it wont let me edit): I've also been getting a number of errors in nbody, both for MT and ST tasks, all for v1.62. 2 MT units (the worse of the two highlighted above), and 5 (and counting) ST Nbody units have all failed with a computation error. Checking the log shows that all of the failed units have experienced an "Exceeded disk limit xx.xx MB > 50MB". I have 4GB of disk space set aside for BOINC units, so there's no way that can be it.
23) Message boards : Number crunching : MT Nbody 1.62 workunit locked up (Message 64998) Posted 7 Aug 2016 by Captiosus Post: So, I thought i'd cook a few units last night, and left my CPU and GPUs to finish running the batch I pulled from the servers. For the most part, everything worked out fine (some of the CPU MT units went by really fast), and I left it running overnight. I wake up this morning and notice that i've still got CPU activity. I check BOINC and I see one of the SMT units has frozen at 91.019%, and has been in this state for the past 11 hours or so. There was also a fairly considerable memory leak in progress, and when I aborted the unit I ended up freeing up about 3GB of memory even though taskmanager, resource monitor, and process explorer all showed it was using about 800mb. The WU in question is de_nbody_8_1_16_v162_2k_3_1470395169_81723_4
24) Message boards : News : Nbody Release 1.54 (Message 64178) Posted 15 Dec 2015 by Captiosus Post: Please read in the other topic for announcement of n-body 1.52 what I wrote (one month ago and now again), nobody ever answered me but they are not working on my Mac : running on 1 core out of 8, blocking the other boinc apps to run (pretending it is "mt"), and in 1.52 they would not even complete normally (AND they started running again now when my setup does not allow nbody)... http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3819&postid=64164 EDIT : there is a change now, after 15mn of running they actually start to run on all 8 cores and really using 100% of my CPU (instead of 1/8th). EDIT2 : they don't completely run 100%, it's more like 90% of available CPU, letting a 10% idle. EDIT3 : OK so the task did finish on estimated time and was sent successfully back to the servers, so things look better with this 1.54, even though not perfect (running 15mn with 1/8 of CPU and not running a full 100% for the remaining time). It because, as Sidd said, the initialization period cannot be made multi-threaded. It screws up the math needed to set the initial values of each body in the model, and when it gets turned in along with other workunits from the same batch, the results poorly correlate due to the bad math. My suggestion in response to that was to use the idle initialization period to prime a batch of workunits for compute, then once enough are primed and ready to go, process them one by one. Alternatively, interleave it by setting 87.5% of the available threads for MT (on my CPU, it'd be 14 threads of 16 total, on yours it'd be 7 of 8 total), and use the remaining thread(s) to initialize the next unit, so when an MT workunit is done, another one can immediately start processing.
25) Message boards : News : New Release- Nbody version 1.52 (Message 63948) Posted 22 Sep 2015 by Captiosus Post: I've got one that's seems to be causing problems. This one has been running for 78 hours and it's been at 100% since about 9 hours and it never did save a checkpoint. The original estimate was for about 1.5 hours. BOINC indicates it is using 8 cpus, but the activity monitor shows it is only using 1.3% cpu with 2 threads. I've seen some with the previous version of N-body work like this, so I'll probably watch it for a little while longer to see if it finishes. I'm running a Mac Mini with OS 10.10.5 The work unit is de_nbody_9_09_15_orphan_sim_0_1437561602_56033_0 This app is supposed to use X cores, but it seems, regardless the CPU's type, it uses only up to 25 % of the CPU during 50 % of the WU's lenght, and then up to 100 % for the remaining 50 % of the time. I think the long period of having low CPU Utilization (1-2 cores at most) is the initialization period, the setting up of the work so that computation can actually proceed. The problem with it is that it is one of those serial tasks that cant easily be split up into multiple threads for processing. I get the same thing as well; a long batch of single threaded work (3-5 min) running on a single core, and then a short burst using all of the set cores to do the actual compute. What I was thinking about suggesting was the splitting of the initialization period and compute period into 2 distinct tasks. Uninitialized work is sent out in batches, and they get prepped for computing in groups (with my Xeon it'd be 15 tasks at once getting initialized). Once initialized, the workunit is passed through an SHA hash function, which is then sent to the MW@H servers for comparison. If the results from (n) number of clients match, the workunits on those computers are flagged as ready for compute and will be processed at the next task switch. Once the MT tasks are complete, they are sent in for standard end of work processing and credit is awarded. Alternatively, work that is initialized is sent to the MW@H servers (not just the hash) for comparison to ensure they are initialized properly. A small block of credit is awarded if they are, and then the initialized work is sent back out to begin the actual compute process like any other work unit. Once complete, its sent back in and normal end of work processing is done. A third alternative would be to have the MW@H program take uninitialized workunits, initialize them in batches, checkpoint them at the end of the initialization period, then once a number are ready, switch to MT mode and rip through them before sending the work in. Now, I am aware this would increase overhead for the project by a not insignificant amount, but in the end the whole idea would be to minimize idle time on the clients (the major issue at the moment) so more work can be done. Any ideas to improve it would be nice.
26) Message boards : News : New Release- Nbody version 1.52 (Message 63938) Posted 18 Sep 2015 by Captiosus Post: Hey, That would be an awesome thing to get working. We have been tossing ideas around about how to do it. But it is still, unfortunately, a work in progress. Also, right now we want to focus on making sure the application can actually return results before tinkering with the code again. But, again, we have that on our to do list! Cheers, Sidd Oh goodie. Any ideas on doing that that seem viable?
27) Message boards : News : New Release- Nbody version 1.52 (Message 63934) Posted 17 Sep 2015 by Captiosus Post: It works! Awesome! I just re-enabled it and my CPU is chewing through MT tasks like they're candy. I would like to ask though: Is there anything that can be done to further optimize CPU useage so there arent large periods of low (single thread) CPU utilization? As it stands right now, on my CPU with the MT tasks, theres about a minute of single thread activity (which as I understand it is the initialization period that cannot be multi-threaded), and once the initialization period is complete theres a quick (30sec) burst where the task uses all of the designated threads and completes itself (in my case, 15). Is there any way that this could be altered without breaking NBody again?
28) Message boards : News : Nbody Status Update (Message 63863) Posted 10 Aug 2015 by Captiosus Post: Groovy. Looking forward to it.
29) Message boards : News : Nbody Status Update (Message 63861) Posted 9 Aug 2015 by Captiosus Post: Hey Sidd, can we get a possible ETA on the new N-body being ready?
30) Message boards : News : New Nbody Version 1.50 (Message 63582) Posted 15 May 2015 by Captiosus Post: Hmm, seems the problem with milkyway nbody still remains even though its been updated. M0CZY says it works in linux. That begs the question: if Nbody wont run right in windows, why not do it in a linux VM like some other projects do? And for me both single thread and MT Nbody exhibit the stalling bug.
31) Message boards : News : New Nbody version 1.48 (Message 63166) Posted 20 Feb 2015 by Captiosus Post: Interesting thoughts. I wonder if Sidd can enlighten us where the parallelisation 'sweet spot' is, or if they'd like us to try and find it. Considering the MT phase only, I'd imagine that there comes a point where the overhead of managing and synchronising multiple threads exceeds the benefit - but I wouldn't know whether the tipping point is above or below 15 threads. I'm currently verifying that the Application configuration tools available in BOINC v7.4.36 allow thread control - initially, limiting the active thread count to 3, so that other projects can continue to make progress on one core while nbody runs. The next test is to run a bundle of tasks to the first checkpoint with an app_config thread limit of one, with the intention of changing app_config when they've all been prepped, and running the MT phase with a 3-thread app_config. Very labour intensive, and not amenable to scripted automation, but might be an interesting proof-of concept. If it works, the project might consider splitting the app at the end of the initialisation phase. a) Send out a single threaded task to perform initialisation b) Return the initialisation data generated as an output file c) Send out the initialistion data as an input file to a new, multithreaded, simulation task. I think with the new version the parallelization of the work tops off at about 10 active threads. Once my CPU gets going, it only uses about 2/3ds of the available core count. As for your tests, that is precisely what I was thinking of. Heres what I cooked up for an earlier post but cut it out: What I would like to know is if the initialization and actual computation of a run can be split for more effective use of available resources. Instead of doing it like this: Download work units Initialize one on one thread (blocking all other cores/threads from useful work) Process the workunit Send in completed work Repeat I propose the work be done like this (2 methods): Batch mode 1.Download workunits 2.Initialize a number of workunits simultaneously. Each unit that is initialized has its state saved to await open resources, and forms a work queue in the order its prep completes. This way the execution resources arent sitting there doing nothing. 3A.When there are sufficient workunits ready, the app switches to multi-threaded mode, grabs an initialized unit, and begins processing. 3B. When complete, the unit is turned in and another initialized unit is pulled from the client side work queue for processing. 4.When the initialized work queue is exhausted, it switches back to initialization mode and preps another batch of work. Stream mode 1.Download workunits 2. Initialize a workunit, then begin processing the moment it is done using the specified thread like how its currently done, but minus 1 thread. 3. While the ready unit is getting chewed through, the open thread is used to prep another unit for processing. Units that are ready before resources open up have their states saved to disk. The open thread is then used to ready another unit. 4. As each unit completes, a ready unit is slotted in and begins processing. This keeps going until the work dries up (either from nothing coming from the project, or the user has set no new work for the project). In either case, 2 programs are necessary. One to initialize, and one to actually process. Now, I'll admit I know very little about programming and thus dont know the viability of switching back and forth between prepping a batch of workunits and chewing through them one at a time when they're ready. A variation of this would have it going from Batch to stream mode as workflow gets moving, and instead of one monolithic MT unit, it does 2 or 3 at once depending on thread allocation count. On my rig for example, 2 blocks of 7 threads would run MT, with the 15th thread initializing units. Batch mode would be suitable for systems that run under the effective thread count (11), while batch to stream mode would be better suited for machines with high core counts. Machines with extremely high thread counts (>=24 threads) would dedicate more than one thread for maintaining the work queue (1 init thread per every work block). So a crazy person running an i4P loaded with 16c/32t Xeons (120T) could end up with 14 8-thread work blocks, with the remaining 8 threads feeding the beast so to speak. Tuning that for optimal workflow would take some time though. I wonder if BOINC allows running apps to have their own daughter processes.[/quote]
32) Message boards : News : New Nbody version 1.48 (Message 63164) Posted 19 Feb 2015 by Captiosus Post: Therefore, for now, we removed all multithreading of the initialization of the dwarf galaxy. This is why it will run only on one thread until the initialization completes. However, after performing a speed profile on the code, it was determined that a majority of the run time of the previous code was spent on a single function (thanks to Roland Judd for catching that!). We optimized this function, leading to a great decrease in the run time. Cheers, Sidd To put some figures on that. I'm running on an i5 laptop (2 cores, 4 threads). Current task was estimated at 5 hours 26 mins. The single-threaded initialisation phase lasted for 18 minutes. (I think the initialisation lasted roughly the same time for the previous task, estimated at 50 minutes, but I didn't have Process Explorer open to monitor). Would I be right in assuming that the initialisation would be expected to have a constant duration for any given host, no matter how long the expected task duration? I think the initialization period varies between workunits. IMO the big problem right now is that the initialization period is blocking the other cores/threads from doing meaningful work. I have an 8c/16t CPU that i use in my main rig (E5-2690, 2.9ghz stock, 3.3ghz all core turbo, very powerful chip) that i have BOINC set to use 15 of those 16 threads explicitly for CPU tasks, with the remaining thread set aside to power the GPUs. What happens when BOINC has Nbody 1.48 running is that it allocates the 15 threads like it should, but then runs in single threaded mode for about 10-30 minutes depending on the unit. IMO the way Milkyway runs needs further tweaking. I had a couple of ideas for helping to prevent excessive idle time, but getting them implemented would require alot of extra work on the part of the devs. If anyone's interested I'll post those ideas, but they may seem a bit outlandish since I know little about programming.
33) Message boards : News : New Nbody version 1.48 (Message 63155) Posted 18 Feb 2015 by Captiosus Post: Andrew-PC and cowboy2199 check the version of the CPU clients you are running. If it reads version 1.46, thats the bugged version and those units will never finish. Abandon those units then force an update for milkyway@home to get v1.48. Heads up though, v1.48 has a lengthy initialization period in which it runs single threaded but ties up all available cores.
34) Message boards : News : New Nbody version 1.48 (Message 63149) Posted 17 Feb 2015 by Captiosus Post: Hey, We had multithreaded the assignments of radii and velocities to bodies. Both of these were done through rejection sampling, using random numbers. However, when that code ran with multiple threads the assignment of radii and velocity were different between runs even with the same random number seed and parameters. This was because which thread ran in which order was indeterminate, meaning which body was assign what radii and velocity was unpredictable. This was a very nasty bug, made nastier because it did not present itself. Runs would complete normally. However, because of the indeterminate nature of the algorithm, a poorer likelihood was reported than would be expected with a set of parameters, even if they were close. Therefore, overall, it led to poor convergence. Therefore, for now, we removed all multithreading of the initialization of the dwarf galaxy. This is why it will run only on one thread until the initialization completes. However, after performing a speed profile on the code, it was determined that a majority of the run time of the previous code was spent on a single function (thanks to Roland Judd for catching that!). We optimized this function, leading to a great decrease in the run time. Cheers, Sidd So, as I understand it, what was happening was on startup, 1.46 was multi-threading some of the starting parameters for the dwarf galaxy's constituent stars. However, the way the code was working meant that the starting parameters would differ between machines even with the same settings due to the multi-threading part being random. This in turn would result in poor overall results even though the run itself completes. Right/wrong? And is there anything you can do to help lower the amount of time resources are spent idle? On my machine the initialization stage takes a good 5-10 minutes, during which there is only a single thread in use with the other 14 allocated threads sitting there twiddling their thumbs and unable to do anything else. When it does go to multi-threaded mode, it only uses about 10 threads, leaving 5 free to sit there doing nothing. Thats about a third of my CPU sitting there doing nothing.
35) Message boards : News : New Nbody version 1.48 (Message 63139) Posted 14 Feb 2015 by Captiosus Post: Oh, I was wondering why it was only using one thread out of the 15 of 16 I have allocated for CPU tasks. Cycling BOINC and setting it to use all cores and threads had no effect in getting the app to use everything. Dont know if the "stalling to 100%" is in effect like what was going on with 1.46, but there's a whole new can of worms opened with this one. App is obviously v1.48, workunit that it's chewing through is ps_nbody_2_13_orphan_sim_1_1422013803_462792_0, BOINC client is 7.4.36 x64. If there's any more information you need, let me know and when I wake up you'll get it to the best of my abilities. Quick edit: About 8 minutes in running single threaded the program finally realized that there's more than one thread available, and started running on another 10 threads. Still not fully utilizing what's been made available, but its better than nothing. I'll let it chew through units throughout the night and see how it goes.

Previous 20