milkyway nbody applications updated

Author	Message
Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 41985 - Posted: 7 Sep 2010, 22:06:08 UTC I've updated the nbody applications to v0.04. Let us know how they're doing here. I think they have quite a few updates the Matt A did that will make them more stable. I'll going to start up some longer WUs tonight for them. ID: 41985 · Rating: 0 · rate: / Reply Quote

MAGIC Quantum Mechanic Send message Joined: 26 Sep 09 Posts: 5 Credit: 10,327 RAC: 0	Message 42007 - Posted: 9 Sep 2010, 11:59:11 UTC - in response to Message 41985. I just saw this and decided to run some on one of my triple-core pc's ....first one in 4min 27sec ....2nd one in 5min 30sec.....3rd one in 6min 49sec (I suspended my Einsteins for a few minutes) The second batch of 3 finished in 6min 52sec ....6min 14sec...6min 35sec Ok I will let it run 3 more triples and then some of the "de_11,12,13,14" Then back to finishing the Einsteins I have loaded up. .....goodnight ID: 42007 · Rating: 0 · rate: / Reply Quote

Phil Send message Joined: 28 Aug 09 Posts: 2 Credit: 1,688,008 RAC: 0	Message 42016 - Posted: 9 Sep 2010, 21:23:43 UTC O OK ID: 42016 · Rating: 0 · rate: / Reply Quote

Len LE/GE Send message Joined: 8 Feb 08 Posts: 261 Credit: 104,050,322 RAC: 0	Message 42018 - Posted: 9 Sep 2010, 22:15:20 UTC While they all seem to validate, the checkpointing still needs some work to function properly. Simple test: Exit BOINC while a WU is running and than start BOINC again. 4 WUs were running when I stopped BOINC, all 4 got the same checkpoint problem and started again from the beginning. <stderr_txt> Starting fresh nbody run Starting nbody system <plummer_r> -2.4192994637763 15.263676398397 5.3555785339933 </plummer_r> <plummer_v> 202.3421311544 97.729366506234 -185.7096944583 </plummer_v> Checkpoint: tnow = 0.505667. time since last = 857362s Checkpoint exists. Attempting to resume from it. Thawing state Didn't find header for checkpoint file. Number of bodies in checkpoint file does not match number expected by context. Got checkpoint file for wrong type. Expected sizeof(real) = 8, got 0 Trying to read interrupted checkpoint file Failed to find end marker in checkpoint file. Failed to resume checkpoint Removing checkpoint file 'nbody_checkpoint' Starting fresh nbody run Starting nbody system <plummer_r> -2.4192994637763 15.263676398397 5.3555785339933 </plummer_r> <plummer_v> 202.3421311544 97.729366506234 -185.7096944583 </plummer_v> Checkpoint: tnow = 0.41049. time since last = 857586s Checkpoint: tnow = 0.767259. time since last = 61.3717s Checkpoint: tnow = 0.942433. time since last = 65.6709s Checkpoint: tnow = 1.12286. time since last = 65.5763s Checkpoint: tnow = 1.30796. time since last = 65.6308s Checkpoint: tnow = 1.55554. time since last = 65.6332s Checkpoint: tnow = 2.07989. time since last = 65.6152s Checkpoint: tnow = 2.58789. time since last = 65.6147s Checkpoint: tnow = 3.11341. time since last = 65.6048s Checkpoint: tnow = 3.79834. time since last = 65.625s Making final checkpoint Simulation complete <search_likelihood>-447.00890499361077</search_likelihood> <search_application>milkywayathome nbody 0.04 Windows x86 double</search_application> Removing checkpoint file 'nbody_checkpoint' </stderr_txt> ID: 42018 · Rating: 0 · rate: / Reply Quote

Matt Arsenault Volunteer moderator Project developer Project tester Project scientist Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0	Message 42019 - Posted: 9 Sep 2010, 22:51:20 UTC - in response to Message 42018. While they all seem to validate, the checkpointing still needs some work to function properly. Simple test: Exit BOINC while a WU is running and than start BOINC again. 4 WUs were running when I stopped BOINC, all 4 got the same checkpoint problem and started again from the beginning. I thought I fixed this, but apparently I didn't for the Windows version. ID: 42019 · Rating: 0 · rate: / Reply Quote

Matt Arsenault Volunteer moderator Project developer Project tester Project scientist Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0	Message 42020 - Posted: 9 Sep 2010, 23:36:41 UTC - in response to Message 42018. While they all seem to validate, the checkpointing still needs some work to function properly. Simple test: Exit BOINC while a WU is running and than start BOINC again. 4 WUs were running when I stopped BOINC, all 4 got the same checkpoint problem and started again from the beginning. Starting fresh nbody run Starting nbody system -2.4192994637763 15.263676398397 5.3555785339933 202.3421311544 97.729366506234 -185.7096944583 Checkpoint: tnow = 0.505667. time since last = 857362s Checkpoint exists. Attempting to resume from it. Thawing state Didn't find header for checkpoint file. Number of bodies in checkpoint file does not match number expected by context. Got checkpoint file for wrong type. Expected sizeof(real) = 8, got 0 Trying to read interrupted checkpoint file Failed to find end marker in checkpoint file. Failed to resume checkpoint Removing checkpoint file 'nbody_checkpoint' Starting fresh nbody run Starting nbody system -2.4192994637763 15.263676398397 5.3555785339933 202.3421311544 97.729366506234 -185.7096944583 Checkpoint: tnow = 0.41049. time since last = 857586s Checkpoint: tnow = 0.767259. time since last = 61.3717s Checkpoint: tnow = 0.942433. time since last = 65.6709s Checkpoint: tnow = 1.12286. time since last = 65.5763s Checkpoint: tnow = 1.30796. time since last = 65.6308s Checkpoint: tnow = 1.55554. time since last = 65.6332s Checkpoint: tnow = 2.07989. time since last = 65.6152s Checkpoint: tnow = 2.58789. time since last = 65.6147s Checkpoint: tnow = 3.11341. time since last = 65.6048s Checkpoint: tnow = 3.79834. time since last = 65.625s Making final checkpoint Simulation complete -447.00890499361077 milkywayathome nbody 0.04 Windows x86 double Removing checkpoint file 'nbody_checkpoint' Actually this isn't what I thought it was. This is fine and expected, although the bug I found for the windows checkpointing is still there. This happens if you interrupt it before the first checkpoint is actually made. It opens the checkpoint at the beginning, and then continually writes to it as needed. If it's interrupted before the first write happens, the checkpoint is empty and it starts over since it never actually checkpointed before. ID: 42020 · Rating: 0 · rate: / Reply Quote

MAGIC Quantum Mechanic Send message Joined: 26 Sep 09 Posts: 5 Credit: 10,327 RAC: 0	Message 42021 - Posted: 9 Sep 2010, 23:51:27 UTC Yeah I noticed that problem when I had one of my batch of 3 that was almost finished started over again after I started my Einsteins back up and since they were running "high priority" it stopped the one we are testing. That was on a triple core AMD w/XP Pro I'm on my quad Win 7 right now so I will try this one when I get a chance. (have to go watch the 1st NFL game of the season right now) ID: 42021 · Rating: 0 · rate: / Reply Quote

NorPer Send message Joined: 20 Mar 09 Posts: 7 Credit: 1,016,945 RAC: 0	Message 42022 - Posted: 10 Sep 2010, 0:02:37 UTC - in response to Message 41985. It still crashes with an error on start on MacOS 10.5.8. This WU gives: <core_client_version>6.10.56</core_client_version> <![CDATA[ <message> process got signal 5 </message> <stderr_txt> dyld: unknown required load command 0x80000022 </stderr_txt> ]]> ID: 42022 · Rating: 0 · rate: / Reply Quote

Matt Arsenault Volunteer moderator Project developer Project tester Project scientist Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0	Message 42023 - Posted: 10 Sep 2010, 0:21:21 UTC - in response to Message 42022. It still crashes with an error on start on MacOS 10.5.8. dyld: unknown required load command 0x80000022 Yeah, there was an issue with linking on 10.5. I tried something that would maybe fix it, but I don't have a way to test on 10.5. ID: 42023 · Rating: 0 · rate: / Reply Quote

Len LE/GE Send message Joined: 8 Feb 08 Posts: 261 Credit: 104,050,322 RAC: 0	Message 42027 - Posted: 10 Sep 2010, 12:08:32 UTC - in response to Message 42020. While they all seem to validate, the checkpointing still needs some work to function properly. Simple test: Exit BOINC while a WU is running and than start BOINC again. 4 WUs were running when I stopped BOINC, all 4 got the same checkpoint problem and started again from the beginning. <stderr_txt> Starting fresh nbody run Starting nbody system <plummer_r> -2.4192994637763 15.263676398397 5.3555785339933 </plummer_r> <plummer_v> 202.3421311544 97.729366506234 -185.7096944583 </plummer_v> Checkpoint: tnow = 0.505667. time since last = 857362s Checkpoint exists. Attempting to resume from it. Thawing state Didn't find header for checkpoint file. Number of bodies in checkpoint file does not match number expected by context. Got checkpoint file for wrong type. Expected sizeof(real) = 8, got 0 Trying to read interrupted checkpoint file Failed to find end marker in checkpoint file. Failed to resume checkpoint Removing checkpoint file 'nbody_checkpoint' Starting fresh nbody run ... </stderr_txt> Actually this isn't what I thought it was. This is fine and expected, although the bug I found for the windows checkpointing is still there. This happens if you interrupt it before the first checkpoint is actually made. It opens the checkpoint at the beginning, and then continually writes to it as needed. If it's interrupted before the first write happens, the checkpoint is empty and it starts over since it never actually checkpointed before. Doesn't the bold marked line indicate a checkpoint was written? Other WUs showed more of those lines before interupted. Additionally I looked for the checkpoint file and it contained all binary(?) data. ID: 42027 · Rating: 0 · rate: / Reply Quote

Len LE/GE Send message Joined: 8 Feb 08 Posts: 261 Credit: 104,050,322 RAC: 0	Message 42033 - Posted: 10 Sep 2010, 21:51:14 UTC Last modified: 10 Sep 2010, 21:53:09 UTC Ok, just did another test. Stopped BOINC and looked into the slots directories. stderr.txt: Starting fresh nbody run Starting nbody system <plummer_r> -9.4255262133787 10.709813623551 11.337814022642 </plummer_r> <plummer_v> 180.90224287396 147.20579827731 -143.34594028339 </plummer_v> Checkpoint: tnow = 0.409331. time since last = 940402s Checkpoint: tnow = 0.757109. time since last = 60.5788s Checkpoint: tnow = 1.10284. time since last = 60.3344s Checkpoint: tnow = 1.49832. time since last = 60.3642s Checkpoint: tnow = 2.00819. time since last = 60.487s Checkpoint: tnow = 2.56063. time since last = 60.3326s Checkpoint: tnow = 3.18796. time since last = 60.4464s That should be 7 checkpoints, right? The file nbody_checkpoint is untouched (7 minutes older). Only files updated are stderr.txt and boinc_task_state.xml. boinc_task_state.xml: <active_task> <project_master_url>http://milkyway.cs.rpi.edu/milkyway/</project_master_url> <result_name>de_nbody_test_10_104310_1282760160_3</result_name> <checkpoint_cpu_time>365.500000</checkpoint_cpu_time> <checkpoint_elapsed_time>424.223515</checkpoint_elapsed_time> <fraction_done>0.828105</fraction_done> </active_task> Started BOINC again and nbody_checkpoint got a new time. boinc_task_state.xml: <active_task> <project_master_url>http://milkyway.cs.rpi.edu/milkyway/</project_master_url> <result_name>de_nbody_test_10_104310_1282760160_3</result_name> <checkpoint_cpu_time>522.750000</checkpoint_cpu_time> <checkpoint_elapsed_time>608.937537</checkpoint_elapsed_time> <fraction_done>0.305077</fraction_done> </active_task> stderr.txt: Starting fresh nbody run Starting nbody system <plummer_r> -9.4255262133787 10.709813623551 11.337814022642 </plummer_r> <plummer_v> 180.90224287396 147.20579827731 -143.34594028339 </plummer_v> Checkpoint: tnow = 0.409331. time since last = 940402s Checkpoint: tnow = 0.757109. time since last = 60.5788s Checkpoint: tnow = 1.10284. time since last = 60.3344s Checkpoint: tnow = 1.49832. time since last = 60.3642s Checkpoint: tnow = 2.00819. time since last = 60.487s Checkpoint: tnow = 2.56063. time since last = 60.3326s Checkpoint: tnow = 3.18796. time since last = 60.4464s Checkpoint exists. Attempting to resume from it. Thawing state Didn't find header for checkpoint file. Number of bodies in checkpoint file does not match number expected by context. Got checkpoint file for wrong type. Expected sizeof(real) = 8, got 0 Trying to read interrupted checkpoint file Failed to find end marker in checkpoint file. Failed to resume checkpoint Removing checkpoint file 'nbody_checkpoint' Starting fresh nbody run Starting nbody system <plummer_r> -9.4255262133787 10.709813623551 11.337814022642 </plummer_r> <plummer_v> 180.90224287396 147.20579827731 -143.34594028339 </plummer_v> Checkpoint: tnow = 0.432927. time since last = 943034s Checkpoint: tnow = 0.805839. time since last = 60.4502s Checkpoint: tnow = 1.16644. time since last = 60.4015s And finally it errored out with Run time 859.165505 CPU time 756.5469 <message> Maximum elapsed time exceeded </message> ID: 42033 · Rating: 0 · rate: / Reply Quote

Matt Arsenault Volunteer moderator Project developer Project tester Project scientist Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0	Message 42034 - Posted: 11 Sep 2010, 0:38:53 UTC - in response to Message 42033. The file nbody_checkpoint is untouched (7 minutes older). Only files updated are stderr.txt and boinc_task_state.xml. I found the problem. Originally I wasn't using the boinc functions for resolving filenames when opening the checkpoints in the first release. I fixed it for the posix versions of the checkpointing functions, but I apparently forgot to fix the win32 ones. ID: 42034 · Rating: 0 · rate: / Reply Quote