NBody 1.04 Beta Run; Interesting Observations

Author	Message
Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 56878 - Posted: 14 Jan 2013, 11:56:17 UTC As most are aware, we are currently in a beta test run for the new nBody application. One of the parameters of the test is the project team is using 'real' science for the work the app is doing during the run (as opposed to using a 'busy work' set of 'known' tasks for testing purposes). Another of the parameters is, that for the majority hosts, they are running a relatively recent version of Windows (Vista and higher) along with a 7x version of BOINC in a more or less default configuration WRT how project tasks are processed. This results in the most hosts running nBody in single-threaded (ST) mode, as well as running more than one nBody at a time periodically as the host works through its WU cache. That alone is different from what we saw during the last production run for nBody. However, with certain other combinations of Windows and BOINC versions, and/or specifically user configured BOINC installations, it is possible to run the app in multi-threaded mode. One of the interesting observations I have had so far during the run, is that if your host is running XP (or at least XPP-64) and BOINC 6x (or 6.12.34, in my case) the host will run nBody in what I have chosen to call 'Unregulated' MT mode. What this means is the host will run the app MT, and will start and run more than one nBody task if that is what regular BOINC task scheduling says needs to happen. This fact lead to the other interesting observation I just had. Since we are using 'real' science as the test target work, the question arises just how useful is MT for the various types of nBody workunits we see going through the host? Is it uniformly useful for all nBody work, or does it help certain types more than others? As it turns out, I was observing one of my hosts which happened to be running two nBody tasks concurrently. I noticed that one of them was an nBody_105 type and the other was an nBody_orphan_real one. Now for the interesting part. When both of them were running I observed that the 105 was getting ~50% of the total CPU, the orphan_real was getting ~25%, and the rest was split more or less evenly between the other regular ST tasks running at the time. Not really unexpected, except for why would one nBody task be drawing significantly more CPU than another? So as a side experiment, I diddled the host a bit in order to get the 105 to go idle while letting the orphan_real to continue processing. The result was the orphan_real began using 30-40% CPU on average and the remainder split roughly evenly between the other ST tasks running. So I leave consideration of this (and the implications of it) as a thought exercise for the reader. ;-) Al ID: 56878 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 56880 - Posted: 14 Jan 2013, 13:14:06 UTC Last modified: 14 Jan 2013, 13:15:09 UTC Update: I was double checking my observations and discovered I mistakenly switched which main thread was which when viewing nBody with Process Explorer (which doesn't tell you any clearly obvious information about the actual WU the app instances are running, unfortunately). So basically, just switch 105 for orphan_real and vice versa in my first post. The actual numbers are a little different when you look at all three cases (105 only, orphan only, and both together), but the gist of the matter is the same regardless. Sorry about that, my bad! ;-) ID: 56880 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0	Message 56882 - Posted: 14 Jan 2013, 13:32:24 UTC I've been getting several orphan_real too, mostly sub-type orphan_real_CHISQ. Most have ended in seconds, but the current one (ps_nbody_orphan_real_CHISQ_1356215205_306270_0) is on course to be a record-breaker: 10 hours elapsed, 25 hours CPU so far, and still less than 30% done. BOINC is only estimating 2 hours remaining, which makes a change. I'm still running three cores, so the ratio of CPU to elapsed time suggests a degree of inefficiency in the multithreading. When I looked at it with Process Explorer, the thread CPU usages seemed to be bouncing around in the 9% - 10% - 11% range, not up to the 12.5% they should theoretically be. I also saw one thread drop down to 1% CPU momentarily, which reminds me that multithreaded apps usually have to pause periodically for synchronisation if the thread tasklets finish at different times. That cuts down the overall efficiency gains, the more so the more cores/threads you run. ID: 56882 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 56883 - Posted: 14 Jan 2013, 13:39:31 UTC - in response to Message 56882. Last modified: 14 Jan 2013, 13:45:38 UTC Interesting. Mine is a subtype 112_2013, and it's on pace to be a real chewy one too. I'm looking at a little over 92 hours total at this point. Although after I double checked as I mentioned, it seems it gets more bang for the buck from MT than the 105 did (which finished a little while ago) in my case. ID: 56883 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 56978 - Posted: 21 Jan 2013, 17:53:08 UTC Last modified: 21 Jan 2013, 17:54:02 UTC For those who are interested and still running nBody during the Beta testing: It is possible to complete the long running 100K type tasks. Especially if you having been paying attention in the other threads regarding the specific problems and the workarounds for them. Here's an example of a very chewy one which has made it to completion successfully twice. To the Project Team: You guys better keep a closer eye on what's going on with validation on some of these tasks. I know I won't be too happy if tasks like this go down the toilet with no credit just because you didn't think all the test run parameters through all the way (no offense intended, even if that does sound a little harsh). ;-) ID: 56978 · Rating: 0 · rate: / Reply Quote

archeye Send message Joined: 19 Aug 12 Posts: 4 Credit: 3,679,178 RAC: 0	Message 57259 - Posted: 16 Feb 2013, 8:50:50 UTC - in response to Message 56978. Last modified: 16 Feb 2013, 9:06:13 UTC Hi, I was not aware I was running this until I checked up on a Computing Error status. The error log for me reported "Maximum disk usage exceeded" at 35K secs runtime. I have now de-selected this application, I think its too big for my Mac :) To help with this beta run I post this image of the task which also shows another WU Computing Error status with "Maximum disk usage exceeded" at 2M+ secs runtime. http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=291588876 Something to consider is maybe you can do an upload at preset % completion points during the WU to reduce local disk loading. This is an approach adopted by ClimatePrediction project where they do an upload at each 25% point. Alternatively you could allow the WU to auto suspend with a suitable message while waiting for user input to abort or increase the disk allocation. Regards, ID: 57259 · Rating: 0 · rate: / Reply Quote