N-Body 1.18

Author	Message
Richard Haselgrove Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0	Message 58882 - Posted: 15 Jun 2013, 8:37:06 UTC - in response to Message 58880. Interesting WU, which crashed the 1.18 (mt) Win-x64 application on my very very stable mobile Core i7 CPU "720QM": -1073741571 (0xffffffffc00000fd) Unknown error number http://support.microsoft.com/kb/315937 0xc00000fd 'stack overflow'. ID: 58882 · Rating: 0 · rate: / Reply Quote

Jeffery M. Thompson Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0	Message 58890 - Posted: 15 Jun 2013, 14:49:47 UTC Thank you. For the information on the WU. Researching the issues and hopefully we will get a resolution up along with improved time estimates. Jeff ID: 58890 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 58966 - Posted: 18 Jun 2013, 23:08:16 UTC Bad 1.18 task here ID: 58966 · Rating: 0 · rate: / Reply Quote

Jeffery M. Thompson Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0	Message 58975 - Posted: 19 Jun 2013, 16:41:17 UTC We have a fix for the stack overflow in the code base and are testing it. Hoping to get it released. It appears to also be more efficient when calculation certain functions which I am trying to take into account while changing the estimated time code. I expect that early next week to be done with the testing this is barring any major bugs exposed by the change. Jeff ID: 58975 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 59249 - Posted: 5 Jul 2013, 20:31:35 UTC OK, now that the dust is settling fairly well from the roll outs of the new stuff, I have a couple observations for some server side refinefiments you could make in order to get MW back to being more multi project friendly. 1.) Eliminate the 'default' plan_class application for nBody. I assume you aren't intending for it to run on a single core CPU, so to have it there is redundant and has an unintended consequence on Winboxes running 2k or XP. It turns out anything less than Vista (Longhorn code base) has very limited native support in the OS for multi-threaded applications. However, since you're using OpenMP and apparently have XP assigned to the default plan_class the net result is you always end up having nBody running with other CPU tasks, or worse running multiple nBody tasks. This isn't a great situation for a couple of reasons. The better of the two is when running with other CPU tasks. Since nBody can only grab 'slack' time from the other CPU tasks running, the MT speedup is small to none. The worst case is when you get two or more nBodies running at the same time. Not only do you have app limited to any 'slack' time from the other CPU tasks running, they are also 'fighting' with each other since they want to use all the cores as much as they can (up to all the time if they could). Either way, the end result is the elapsed time to run the task increases, so MW ends up 'looking' like it's hogging the machine at times when it gets to the point where BOINC has to get rid of the nBody tasks quickly due to scheduling/deadline issues. 2.) The other problem is the default MT configuration for nBody is to use all the CPU cores available. The problem with that is once one starts GPU's are cut off for the duration. Now if all a user runs is MW that's not such a big deal, but most people aren't going to be happy about their GPU's going completely idle if they run other projects. I'm speculating here this is the reason you reconfigured the MWS(MF) apps to use 0.9 CPU's, but the drawback to that is every instance will grab a CPU core and thus that's one less which could be running a CPU task from another project. Unfortunately, the only solution for both these issues at the moment is to use the anonymous platform (read that as difficult for most folks) to limit the CPU usage for nBody, and thanks to a 'brilliant' design decision on BOINC's part, there is no way to do it the easy way (read that as with app_config). :-( Another possible solution would be to customize the scheduler at your end so the task gets sent to the host with the max CPU's and nthreads command switch set to the (<p_ncpus> - 1), or something along those lines. Although that would obviously add complexity at your end and would be something which would need pretty thorough testing before getting rolled out. ID: 59249 · Rating: 0 · rate: / Reply Quote

Jeffery M. Thompson Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0	Message 59303 - Posted: 10 Jul 2013, 14:40:13 UTC These are excellent points and yes we need to address them. For the first stated point: Would compiling the without open mp for the default class be a better option for older operating systems? For the second point we need to look at the ramifications of changes in the mechanisms. I don't have a solid comment on it currently. Jeff ID: 59303 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 59304 - Posted: 10 Jul 2013, 19:56:40 UTC - in response to Message 59303. Last modified: 10 Jul 2013, 20:01:45 UTC Regarding recompiling without OpenMP, no I don't think that's necessary. The BOINC default for the MT plan_class is to have a minimum of two cores (CPU's), and from experiments Richard and I ran earlier in this nBody beta OpenMP doesn't have a problem running in ST mode if that's all that's available. So all you'd have to do to open nBody to single cores is change the default value to 1. Also, considering you still may have that intermittent stack overflow bug kicking around, recompiling and then having to debug a new app sounds like extra work you really don't need at the moment. As far as dealing with older OS'es, that's probably better off handled at the individual user level at this point. For example, a really easy way to work around part of the problem on XP is to just limit the number of cores that BOINC can use to something less than the maximum available. Nowadays a lot of people have high performance GPU's onboard and have already figured out leaving one core free to feed it gives the best overall performance and that takes care of the GPU problem regardless of OS. This is probably the main reason there hasn't been more complaining about nBody in NC than there has been considering there's still a lot of XP hosts in the field. The issue with running more than one nBody task at a time is a matter of what version of BOINC you're running as much as OS version. On Vista and higher, when run under the MT plan_class and BOINC 7x this doesn't happen. On XP (on BOINC 6.12.34), once I went to the anonymous platform and specified nBody to use 1 less than the max number of CPU's it stopped trying to run more than one at a time, even when it had more than one in the queue. I'm not quite sure why it did that, but that's been my observation so far. I haven't tried to see if BOINC 7x helps with MT issues on XP yet. So basically on these matters, IMO, this is something which really needs to be addressed at the BOINC level. Unfortunately, you are the only project currently using a multi-threaded app I'm aware of, so there hasn't been much pressure on the BOINC dev team to look into or do anything to improve matters when it comes to issues with MT CPU apps. ID: 59304 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0	Message 59305 - Posted: 10 Jul 2013, 20:22:46 UTC - in response to Message 59304. ... you still may have that intermittent stack overflow bug kicking around, recompiling and then having to debug a new app sounds like extra work you really don't need at the moment. Yup, got one today. http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=520735402 ID: 59305 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 59310 - Posted: 11 Jul 2013, 2:13:28 UTC - in response to Message 59305. Last modified: 11 Jul 2013, 2:15:32 UTC Hmmm... Yep, intermittent faults like that can be tough little nuts to crack. :-/ I must have been having senior moment earlier. :-D All of a sudden it dawned on me what the root cause of why running an MT capable app under the default plan_class is not desirable. The reason is the default plan_class explicitly tells BOINC the app is single threaded, so it treats it that way when it comes to task and work fetch scheduling. I remembered in our earlier experiments that on Win7/BOINC 7 setups you would get multiple instances running but they would run in ST mode, where on XP they would run as I described above. Mostly likely something to do with the mysterious ktmw32.dll file! ;-) ID: 59310 · Rating: 0 · rate: / Reply Quote

Jeffery M. Thompson Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0	Message 59313 - Posted: 11 Jul 2013, 13:44:17 UTC We have the function isolated that is causing the segmentation faults and we have a fix in place to resolve this issue. We also had to correct a function in the mathematics of the likelihood calculation. The binaries are on the server currently and Jake is in the process of releasing them. So I am hoping that we can continue the conversation on changes to n-body in a new thread shortly. Jeff ID: 59313 · Rating: 0 · rate: / Reply Quote

Jeffery M. Thompson Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 23 Sep 12 Posts: 159 Credit: 16,977,106 RAC: 0	Message 59315 - Posted: 11 Jul 2013, 14:02:22 UTC I have started a thread in Number Crunching to collect the issues. I also believe like the est_fpops changes it will exist out of specific binary versions so having a place to discuss the changes and effects would be worthwhile. http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3311&postid=59314#59314 Jeff ID: 59315 · Rating: 0 · rate: / Reply Quote

Alinator Send message Joined: 7 Jun 08 Posts: 464 Credit: 56,639,936 RAC: 0	Message 59316 - Posted: 11 Jul 2013, 14:10:00 UTC - in response to Message 59313. Cool... Overall a lot of progress has been made on all fronts the last month and a half or so. Not bad, especially considering it's summertime and you're most likely even more shorthanded than usual! ;-) ID: 59316 · Rating: 0 · rate: / Reply Quote