Nbody 1.68 release

Author	Message
Sidd Project developer Project tester Project scientist Send message Joined: 19 May 14 Posts: 73 Credit: 356,131 RAC: 0	Message 67017 - Posted: 31 Jan 2018, 15:51:18 UTC Hi All, A new version, v1.68, of nbody has just been released. I have not yet released the mac multi-threaded version (OpenMP). I will release this at a later date. In this release we have added a new way of constraining the width of the stream. Previously, we were using a measure of the velocity dispersion in each histogram bin. This led us to fit our parameters quite well. Unfortunately, we found that this may not be the best method in the long run. I have added a measure of the beta coordinate dispersion which, from initial findings, will be (hopefully) easier to fit our parameters with. As always, please let me know if there are issues. Thank you all for your continuing support, Sidd ID: 67017 · Rating: 0 · rate: / Reply Quote

Jake Weiss Volunteer moderator Project developer Project tester Project scientist Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0	Message 67018 - Posted: 31 Jan 2018, 19:05:19 UTC Congrats on the new version! ID: 67018 · Rating: 0 · rate: / Reply Quote

Tom* Send message Joined: 4 Oct 11 Posts: 38 Credit: 309,729,457 RAC: 0	Message 67019 - Posted: 31 Jan 2018, 21:38:17 UTC Last modified: 31 Jan 2018, 21:38:55 UTC Sidd, All three of my systems running 1.68 get the following error, some more than others. <core_client_version>7.6.22</core_client_version> <![CDATA[ <message> The data is invalid. (0xd) - exit code 13 (0xd) </message> <stderr_txt> <search_application> milkyway_nbody 1.68 Windows x86_64 double OpenMP, Crlibm </search_application> Using OpenMP 4 max threads on a system with 8 processors Error evaluating NBodyCtx: [string "-- /* Copyright (c) 2016 Siddhartha Shelton..."]:81: bad argument #1 to 'create' (Missing required named argument 'BetaSigma') Failed to read input parameters file 14:52:09 (8888): called boinc_finish(13) My I7 systems get very few but my I5 system gets a ton. Yet are completed ok by other systems on a resend ID: 67019 · Rating: 0 · rate: / Reply Quote

Sidd Project developer Project tester Project scientist Send message Joined: 19 May 14 Posts: 73 Credit: 356,131 RAC: 0	Message 67020 - Posted: 1 Feb 2018, 0:44:44 UTC - in response to Message 67019. Thanks for letting me know!! I'm checking it out right now. ID: 67020 · Rating: 0 · rate: / Reply Quote

Sidd Project developer Project tester Project scientist Send message Joined: 19 May 14 Posts: 73 Credit: 356,131 RAC: 0	Message 67022 - Posted: 1 Feb 2018, 1:10:13 UTC - in response to Message 67019. I believe I found the workunit that was from. Because I added an entirely new calculation, there were some new parameters needed for future flexibility. Therefore, if you were to use the old parameter files on the new binary it would give that error. It seems for some reason, that workunit did exactly that, the binary being used is the nbody v168 but the workunit is from the v166 runs, using the v166 parameter files. Before releasing, I took down the older runs, and so I was not expecting the work units to do this, and for that I apologize. Fortunately, this error would occur right at the beginning, before anything began to run so it will not cause any wasted computational time. If you have any v166 runs in your queue, you can go ahead and cancel them so they do not give this error. ID: 67022 · Rating: 0 · rate: / Reply Quote

Tom* Send message Joined: 4 Oct 11 Posts: 38 Credit: 309,729,457 RAC: 0	Message 67023 - Posted: 1 Feb 2018, 4:01:47 UTC Thanks Sidd ID: 67023 · Rating: 0 · rate: / Reply Quote

MossyRock Send message Joined: 27 Sep 17 Posts: 6 Credit: 11,419,438 RAC: 29	Message 67025 - Posted: 1 Feb 2018, 22:55:56 UTC Sidd, Most of my v168 runs are blowing up. Do I abort the v168 runs in queue? Thanks. ID: 67025 · Rating: 0 · rate: / Reply Quote

Tom* Send message Joined: 4 Oct 11 Posts: 38 Credit: 309,729,457 RAC: 0	Message 67026 - Posted: 2 Feb 2018, 1:48:00 UTC Last modified: 2 Feb 2018, 1:53:07 UTC Mossy, The problem is when the V168 application tries to process V166 data You have same issue as I if you look inside the stderr de_nbody_1_13_2018_v166_20k__optimizerparameters_diff_seedruns_3_1516211024_96183_4 this is the data version. As Sidd says it only takes 2 or 3 seconds to fail, so if there are no other ramifications (like not getting new tasks:-)) just let them run. otherwise you have to highlite the task in the task list in BOINC then choose properties to see the version v166 or v168 ID: 67026 · Rating: 0 · rate: / Reply Quote

MossyRock Send message Joined: 27 Sep 17 Posts: 6 Credit: 11,419,438 RAC: 29	Message 67027 - Posted: 2 Feb 2018, 4:21:24 UTC - in response to Message 67026. Tom, Gotcha. Thanks for letting me know how to find the mis-matches in the "ready to start" state. I just aborted a few. ID: 67027 · Rating: 0 · rate: / Reply Quote

Schwerrechner Send message Joined: 9 Feb 17 Posts: 1 Credit: 71,380 RAC: 0	Message 67028 - Posted: 2 Feb 2018, 12:23:56 UTC Hey, I am new here. I am getting errors on nbody calculating the optimizerparameter with a Ryzen 1700. The Cpu is prime stable tho I get the erros only on nbody-optimizertasks. Everything else works fine. ID: 67028 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Oct 16 Posts: 167 Credit: 1,014,814,418 RAC: 1,033	Message 67042 - Posted: 9 Feb 2018, 13:54:41 UTC These are still being sent out. :( https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1566643057 ID: 67042 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Oct 16 Posts: 167 Credit: 1,014,814,418 RAC: 1,033	Message 67047 - Posted: 9 Feb 2018, 22:44:59 UTC And another https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=2255257329 ID: 67047 · Rating: 0 · rate: / Reply Quote

Yavanius Send message Joined: 27 Jan 15 Posts: 10 Credit: 1,514,844 RAC: 9	Message 67050 - Posted: 10 Feb 2018, 5:26:00 UTC - in response to Message 67017. I keep the intermittent N-body that runs and runs... the last one I aborted at 15 hours... https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1573440465 Sometimes if I restart BOINC, they'll run properly, but I've seen them bite the dust too shortly after too... ID: 67050 · Rating: 0 · rate: / Reply Quote

Yavanius Send message Joined: 27 Jan 15 Posts: 10 Credit: 1,514,844 RAC: 9	Message 67056 - Posted: 10 Feb 2018, 18:05:49 UTC - in response to Message 67017. Went searching for an answer to this, but couldn't find an answer: Why is the N-Body credit such a pittance with double digit credit even when run time is the same as the regular WU? ID: 67056 · Rating: 0 · rate: / Reply Quote

Mr McGill Send message Joined: 13 Nov 17 Posts: 4 Credit: 3,239,591 RAC: 0	Message 67058 - Posted: 10 Feb 2018, 22:26:24 UTC had great hopes the new model Nbody would resolve faults: still getting Nbody trapped, sometimes suspend helps (so far i am around 5 successes to 20 failures), restart has so far not. Second thought: with our Nbody fails being a pain in the proverbial, are they responsible for some of the failed reporting stuff? In particular their runtime can exceed their report times, which could also cause failures on single processor tasks paused to complete a multicore Nbody that never sees completion? ID: 67058 · Rating: 0 · rate: / Reply Quote

ritterm Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0	Message 67060 - Posted: 11 Feb 2018, 2:36:29 UTC How does this happen? Stderr output <core_client_version>7.8.6</core_client_version> <![CDATA[ <message> process exited with code 13 (0xd, -243)</message> <stderr_txt> <search_application> milkyway_nbody 1.66 Darwin x86_64 double OpenMP, Crlibm </search_application> Using OpenMP 8 max threads on a system with 8 processors Application version too old. Workunit requires version 1.68, but this is 1.66 Failed to read input parameters file 04:21:59 (82996): called boinc_finish(13) </stderr_txt> ]]> ID: 67060 · Rating: 0 · rate: / Reply Quote

ritterm Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0	Message 67062 - Posted: 11 Feb 2018, 12:28:20 UTC - in response to Message 67022. Sidd wrote: It seems for some reason, that workunit did exactly that, the binary being used is the nbody v168 but the workunit is from the v166 runs, using the v166 parameter files. Before releasing, I took down the older runs... Maybe I misunderstand, but v166 tasks are still being sent out. ID: 67062 · Rating: 0 · rate: / Reply Quote

Tom* Send message Joined: 4 Oct 11 Posts: 38 Credit: 309,729,457 RAC: 0	Message 67068 - Posted: 11 Feb 2018, 18:05:24 UTC Last modified: 11 Feb 2018, 18:16:12 UTC Think we need a new version of the application that can process both v166 and v168 data file formats. PLEASE I have only been getting v166 lately is there a pointer the the v166 application? ID: 67068 · Rating: 0 · rate: / Reply Quote

ritterm Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0	Message 67093 - Posted: 16 Feb 2018, 19:41:05 UTC It looked good for a few days, but I've picked up some v166 tasks recently (see examples). ID: 67093 · Rating: 0 · rate: / Reply Quote

Mr McGill Send message Joined: 13 Nov 17 Posts: 4 Credit: 3,239,591 RAC: 0	Message 67421 - Posted: 2 May 2018, 21:22:37 UTC Sorry for the Necro: but a thought came to me, back when we started the 'new' N-Body version faults in N-Body processes seemed reduced! It looked like the problem of them getting stuck with no further gain in progress completion was improved, while now i have seen only 3 reach completion in as many weeks with all others having say 3 or 4 hours of processing time with variable number of hours till completion is reached (from 10 hours to 3 weeks) depending where the process has gotten stuck. A worst case scenario was quoting something like 157 days when it got stuck at ~0.05% for a few hours (just a little beyond the deadline cough) A thought came to me this morning as i clear an overnight frozen Nbody: If everyone hitting a faulty work unit passes it on: will it be passed back into the pool without examination for other machines to attempt? if a packet of data strikes suck a bug will it move up the priority chain to be solved more quickly increasing the 'density' of faulty work units to be computed? Because it just seems odd that so many of these units in particular are faulting: when none of the other variants strike errors. (which i would hope would suggest there is nothing wrong with the logical processors of this device, otherwise i'll need to look to fixing it) TLDR: -Number of Nbody work units locking up seems to be getting worse: are they being prioritized by your distribution system? -Do your systems have a way to measure or monitor how many times such work units are being passed back and forth without reaching completion? Side query A: -At one point your server was no longer sending Nbodies to my machine (solves the problem well enough) presumably due to the 'high error' solution you rolled out, but now they are back and as glitch y as ever: how many processes do i need to abort to be thrown back onto that list (if this is at all how this happened) Side gripe B As a personal complaint: I do wish Bionic could reduce the total cpu% being utilized, this insistence on using 100% processor X% of the time causes thermal spikes that can lead to thermal throttling. Poor laptop. This mornings Nbody lockup happened at 1hr 24 minutes at 15ish% in de_nbody_4_19_2018_v168_20k__Data_1_1523906284_51420 ID: 67421 · Rating: 0 · rate: / Reply Quote