rpi_logo
milkyway nbody applications updated
milkyway nbody applications updated
log in

Advanced search

Message boards : News : milkyway nbody applications updated

Author Message
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0

Message 41985 - Posted: 7 Sep 2010, 22:06:08 UTC

I've updated the nbody applications to v0.04. Let us know how they're doing here. I think they have quite a few updates the Matt A did that will make them more stable. I'll going to start up some longer WUs tonight for them.
____________

MAGIC Quantum Mechanic
Send message
Joined: 26 Sep 09
Posts: 5
Credit: 10,327
RAC: 0

Message 42007 - Posted: 9 Sep 2010, 11:59:11 UTC - in response to Message 41985.

I just saw this and decided to run some on one of my triple-core pc's ....first one in 4min 27sec ....2nd one in 5min 30sec.....3rd one in 6min 49sec

(I suspended my Einsteins for a few minutes)

The second batch of 3 finished in 6min 52sec ....6min 14sec...6min 35sec


Ok I will let it run 3 more triples and then some of the "de_11,12,13,14"

Then back to finishing the Einsteins I have loaded up.

.....goodnight


Phil
Send message
Joined: 28 Aug 09
Posts: 2
Credit: 1,400,068
RAC: 1,756

Message 42016 - Posted: 9 Sep 2010, 21:23:43 UTC

O OK

Len LE/GE
Send message
Joined: 8 Feb 08
Posts: 261
Credit: 104,050,322
RAC: 0

Message 42018 - Posted: 9 Sep 2010, 22:15:20 UTC

While they all seem to validate, the checkpointing still needs some work to function properly.
Simple test: Exit BOINC while a WU is running and than start BOINC again.
4 WUs were running when I stopped BOINC, all 4 got the same checkpoint problem and started again from the beginning.

<stderr_txt>
Starting fresh nbody run
Starting nbody system
<plummer_r> -2.4192994637763 15.263676398397 5.3555785339933 </plummer_r>
<plummer_v> 202.3421311544 97.729366506234 -185.7096944583 </plummer_v>
Checkpoint: tnow = 0.505667. time since last = 857362s
Checkpoint exists. Attempting to resume from it.
Thawing state
Didn't find header for checkpoint file.
Number of bodies in checkpoint file does not match number expected by context.
Got checkpoint file for wrong type. Expected sizeof(real) = 8, got 0
Trying to read interrupted checkpoint file
Failed to find end marker in checkpoint file.
Failed to resume checkpoint
Removing checkpoint file 'nbody_checkpoint'
Starting fresh nbody run
Starting nbody system
<plummer_r> -2.4192994637763 15.263676398397 5.3555785339933 </plummer_r>
<plummer_v> 202.3421311544 97.729366506234 -185.7096944583 </plummer_v>
Checkpoint: tnow = 0.41049. time since last = 857586s
Checkpoint: tnow = 0.767259. time since last = 61.3717s
Checkpoint: tnow = 0.942433. time since last = 65.6709s
Checkpoint: tnow = 1.12286. time since last = 65.5763s
Checkpoint: tnow = 1.30796. time since last = 65.6308s
Checkpoint: tnow = 1.55554. time since last = 65.6332s
Checkpoint: tnow = 2.07989. time since last = 65.6152s
Checkpoint: tnow = 2.58789. time since last = 65.6147s
Checkpoint: tnow = 3.11341. time since last = 65.6048s
Checkpoint: tnow = 3.79834. time since last = 65.625s
Making final checkpoint
Simulation complete
<search_likelihood>-447.00890499361077</search_likelihood>
<search_application>milkywayathome nbody 0.04 Windows x86 double</search_application>
Removing checkpoint file 'nbody_checkpoint'

</stderr_txt>

Matt Arsenault
Volunteer moderator
Project developer
Project tester
Project scientist
Send message
Joined: 8 May 10
Posts: 576
Credit: 15,979,383
RAC: 0

Message 42019 - Posted: 9 Sep 2010, 22:51:20 UTC - in response to Message 42018.

While they all seem to validate, the checkpointing still needs some work to function properly.
Simple test: Exit BOINC while a WU is running and than start BOINC again.
4 WUs were running when I stopped BOINC, all 4 got the same checkpoint problem and started again from the beginning.


I thought I fixed this, but apparently I didn't for the Windows version.

Matt Arsenault
Volunteer moderator
Project developer
Project tester
Project scientist
Send message
Joined: 8 May 10
Posts: 576
Credit: 15,979,383
RAC: 0

Message 42020 - Posted: 9 Sep 2010, 23:36:41 UTC - in response to Message 42018.

While they all seem to validate, the checkpointing still needs some work to function properly.
Simple test: Exit BOINC while a WU is running and than start BOINC again.
4 WUs were running when I stopped BOINC, all 4 got the same checkpoint problem and started again from the beginning.


Starting fresh nbody run
Starting nbody system
-2.4192994637763 15.263676398397 5.3555785339933
202.3421311544 97.729366506234 -185.7096944583
Checkpoint: tnow = 0.505667. time since last = 857362s
Checkpoint exists. Attempting to resume from it.
Thawing state
Didn't find header for checkpoint file.
Number of bodies in checkpoint file does not match number expected by context.
Got checkpoint file for wrong type. Expected sizeof(real) = 8, got 0
Trying to read interrupted checkpoint file
Failed to find end marker in checkpoint file.
Failed to resume checkpoint
Removing checkpoint file 'nbody_checkpoint'
Starting fresh nbody run
Starting nbody system
-2.4192994637763 15.263676398397 5.3555785339933
202.3421311544 97.729366506234 -185.7096944583
Checkpoint: tnow = 0.41049. time since last = 857586s
Checkpoint: tnow = 0.767259. time since last = 61.3717s
Checkpoint: tnow = 0.942433. time since last = 65.6709s
Checkpoint: tnow = 1.12286. time since last = 65.5763s
Checkpoint: tnow = 1.30796. time since last = 65.6308s
Checkpoint: tnow = 1.55554. time since last = 65.6332s
Checkpoint: tnow = 2.07989. time since last = 65.6152s
Checkpoint: tnow = 2.58789. time since last = 65.6147s
Checkpoint: tnow = 3.11341. time since last = 65.6048s
Checkpoint: tnow = 3.79834. time since last = 65.625s
Making final checkpoint
Simulation complete
-447.00890499361077
milkywayathome nbody 0.04 Windows x86 double
Removing checkpoint file 'nbody_checkpoint'


Actually this isn't what I thought it was. This is fine and expected, although the bug I found for the windows checkpointing is still there. This happens if you interrupt it before the first checkpoint is actually made. It opens the checkpoint at the beginning, and then continually writes to it as needed. If it's interrupted before the first write happens, the checkpoint is empty and it starts over since it never actually checkpointed before.

MAGIC Quantum Mechanic
Send message
Joined: 26 Sep 09
Posts: 5
Credit: 10,327
RAC: 0

Message 42021 - Posted: 9 Sep 2010, 23:51:27 UTC

Yeah I noticed that problem when I had one of my batch of 3 that was almost finished started over again after I started my Einsteins back up and since they were running "high priority" it stopped the one we are testing.

That was on a triple core AMD w/XP Pro

I'm on my quad Win 7 right now so I will try this one when I get a chance.

(have to go watch the 1st NFL game of the season right now)

NorPer
Send message
Joined: 20 Mar 09
Posts: 7
Credit: 1,016,945
RAC: 0

Message 42022 - Posted: 10 Sep 2010, 0:02:37 UTC - in response to Message 41985.

It still crashes with an error on start on MacOS 10.5.8.

This WU gives:

&lt;core_client_version&gt;6.10.56&lt;/core_client_version&gt; &lt;![CDATA[ &lt;message&gt; process got signal 5 &lt;/message&gt; &lt;stderr_txt&gt; dyld: unknown required load command 0x80000022 &lt;/stderr_txt&gt; ]]&gt;

Matt Arsenault
Volunteer moderator
Project developer
Project tester
Project scientist
Send message
Joined: 8 May 10
Posts: 576
Credit: 15,979,383
RAC: 0

Message 42023 - Posted: 10 Sep 2010, 0:21:21 UTC - in response to Message 42022.

It still crashes with an error on start on MacOS 10.5.8.


dyld: unknown required load command 0x80000022


Yeah, there was an issue with linking on 10.5. I tried something that would maybe fix it, but I don't have a way to test on 10.5.

Len LE/GE
Send message
Joined: 8 Feb 08
Posts: 261
Credit: 104,050,322
RAC: 0

Message 42027 - Posted: 10 Sep 2010, 12:08:32 UTC - in response to Message 42020.

While they all seem to validate, the checkpointing still needs some work to function properly.
Simple test: Exit BOINC while a WU is running and than start BOINC again.
4 WUs were running when I stopped BOINC, all 4 got the same checkpoint problem and started again from the beginning.

<stderr_txt>
Starting fresh nbody run
Starting nbody system
<plummer_r> -2.4192994637763 15.263676398397 5.3555785339933 </plummer_r>
<plummer_v> 202.3421311544 97.729366506234 -185.7096944583 </plummer_v>
Checkpoint: tnow = 0.505667. time since last = 857362s
Checkpoint exists. Attempting to resume from it.
Thawing state
Didn't find header for checkpoint file.
Number of bodies in checkpoint file does not match number expected by context.
Got checkpoint file for wrong type. Expected sizeof(real) = 8, got 0
Trying to read interrupted checkpoint file
Failed to find end marker in checkpoint file.
Failed to resume checkpoint
Removing checkpoint file 'nbody_checkpoint'
Starting fresh nbody run
...
</stderr_txt>

Actually this isn't what I thought it was. This is fine and expected, although the bug I found for the windows checkpointing is still there. This happens if you interrupt it before the first checkpoint is actually made. It opens the checkpoint at the beginning, and then continually writes to it as needed. If it's interrupted before the first write happens, the checkpoint is empty and it starts over since it never actually checkpointed before.


Doesn't the bold marked line indicate a checkpoint was written? Other WUs showed more of those lines before interupted. Additionally I looked for the checkpoint file and it contained all binary(?) data.

Len LE/GE
Send message
Joined: 8 Feb 08
Posts: 261
Credit: 104,050,322
RAC: 0

Message 42033 - Posted: 10 Sep 2010, 21:51:14 UTC
Last modified: 10 Sep 2010, 21:53:09 UTC

Ok, just did another test.
Stopped BOINC and looked into the slots directories.

stderr.txt:
Starting fresh nbody run
Starting nbody system
<plummer_r> -9.4255262133787 10.709813623551 11.337814022642 </plummer_r>
<plummer_v> 180.90224287396 147.20579827731 -143.34594028339 </plummer_v>
Checkpoint: tnow = 0.409331. time since last = 940402s
Checkpoint: tnow = 0.757109. time since last = 60.5788s
Checkpoint: tnow = 1.10284. time since last = 60.3344s
Checkpoint: tnow = 1.49832. time since last = 60.3642s
Checkpoint: tnow = 2.00819. time since last = 60.487s
Checkpoint: tnow = 2.56063. time since last = 60.3326s
Checkpoint: tnow = 3.18796. time since last = 60.4464s

That should be 7 checkpoints, right?

The file nbody_checkpoint is untouched (7 minutes older).
Only files updated are stderr.txt and boinc_task_state.xml.

boinc_task_state.xml:
<active_task>
<project_master_url>http://milkyway.cs.rpi.edu/milkyway/</project_master_url>
<result_name>de_nbody_test_10_104310_1282760160_3</result_name>
<checkpoint_cpu_time>365.500000</checkpoint_cpu_time>
<checkpoint_elapsed_time>424.223515</checkpoint_elapsed_time>
<fraction_done>0.828105</fraction_done>
</active_task>

Started BOINC again and nbody_checkpoint got a new time.

boinc_task_state.xml:
<active_task>
<project_master_url>http://milkyway.cs.rpi.edu/milkyway/</project_master_url>
<result_name>de_nbody_test_10_104310_1282760160_3</result_name>
<checkpoint_cpu_time>522.750000</checkpoint_cpu_time>
<checkpoint_elapsed_time>608.937537</checkpoint_elapsed_time>
<fraction_done>0.305077</fraction_done>
</active_task>

stderr.txt:
Starting fresh nbody run
Starting nbody system
<plummer_r> -9.4255262133787 10.709813623551 11.337814022642 </plummer_r>
<plummer_v> 180.90224287396 147.20579827731 -143.34594028339 </plummer_v>
Checkpoint: tnow = 0.409331. time since last = 940402s
Checkpoint: tnow = 0.757109. time since last = 60.5788s
Checkpoint: tnow = 1.10284. time since last = 60.3344s
Checkpoint: tnow = 1.49832. time since last = 60.3642s
Checkpoint: tnow = 2.00819. time since last = 60.487s
Checkpoint: tnow = 2.56063. time since last = 60.3326s
Checkpoint: tnow = 3.18796. time since last = 60.4464s
Checkpoint exists. Attempting to resume from it.
Thawing state
Didn't find header for checkpoint file.
Number of bodies in checkpoint file does not match number expected by context.
Got checkpoint file for wrong type. Expected sizeof(real) = 8, got 0
Trying to read interrupted checkpoint file
Failed to find end marker in checkpoint file.
Failed to resume checkpoint
Removing checkpoint file 'nbody_checkpoint'
Starting fresh nbody run
Starting nbody system
<plummer_r> -9.4255262133787 10.709813623551 11.337814022642 </plummer_r>
<plummer_v> 180.90224287396 147.20579827731 -143.34594028339 </plummer_v>
Checkpoint: tnow = 0.432927. time since last = 943034s
Checkpoint: tnow = 0.805839. time since last = 60.4502s
Checkpoint: tnow = 1.16644. time since last = 60.4015s


And finally it errored out with
Run time 859.165505
CPU time 756.5469
<message>
Maximum elapsed time exceeded
</message>

Matt Arsenault
Volunteer moderator
Project developer
Project tester
Project scientist
Send message
Joined: 8 May 10
Posts: 576
Credit: 15,979,383
RAC: 0

Message 42034 - Posted: 11 Sep 2010, 0:38:53 UTC - in response to Message 42033.


The file nbody_checkpoint is untouched (7 minutes older).
Only files updated are stderr.txt and boinc_task_state.xml.


I found the problem. Originally I wasn't using the boinc functions for resolving filenames when opening the checkpoints in the first release. I fixed it for the posix versions of the checkpointing functions, but I apparently forgot to fix the win32 ones.


Post to thread

Message boards : News : milkyway nbody applications updated


Main page · Your account · Message boards


Copyright © 2018 AstroInformatics Group