Message boards :
News :
milkyway nbody applications updated
Message board moderation
Author | Message |
---|---|
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
I've updated the nbody applications to v0.04. Let us know how they're doing here. I think they have quite a few updates the Matt A did that will make them more stable. I'll going to start up some longer WUs tonight for them. |
Send message Joined: 26 Sep 09 Posts: 5 Credit: 10,327 RAC: 0 |
I just saw this and decided to run some on one of my triple-core pc's ....first one in 4min 27sec ....2nd one in 5min 30sec.....3rd one in 6min 49sec (I suspended my Einsteins for a few minutes) The second batch of 3 finished in 6min 52sec ....6min 14sec...6min 35sec Ok I will let it run 3 more triples and then some of the "de_11,12,13,14" Then back to finishing the Einsteins I have loaded up. .....goodnight |
Send message Joined: 28 Aug 09 Posts: 2 Credit: 1,688,008 RAC: 0 |
O OK |
Send message Joined: 8 Feb 08 Posts: 261 Credit: 104,050,322 RAC: 0 |
While they all seem to validate, the checkpointing still needs some work to function properly. Simple test: Exit BOINC while a WU is running and than start BOINC again. 4 WUs were running when I stopped BOINC, all 4 got the same checkpoint problem and started again from the beginning. <stderr_txt> Starting fresh nbody run Starting nbody system <plummer_r> -2.4192994637763 15.263676398397 5.3555785339933 </plummer_r> <plummer_v> 202.3421311544 97.729366506234 -185.7096944583 </plummer_v> Checkpoint: tnow = 0.505667. time since last = 857362s Checkpoint exists. Attempting to resume from it. Thawing state Didn't find header for checkpoint file. Number of bodies in checkpoint file does not match number expected by context. Got checkpoint file for wrong type. Expected sizeof(real) = 8, got 0 Trying to read interrupted checkpoint file Failed to find end marker in checkpoint file. Failed to resume checkpoint Removing checkpoint file 'nbody_checkpoint' Starting fresh nbody run Starting nbody system <plummer_r> -2.4192994637763 15.263676398397 5.3555785339933 </plummer_r> <plummer_v> 202.3421311544 97.729366506234 -185.7096944583 </plummer_v> Checkpoint: tnow = 0.41049. time since last = 857586s Checkpoint: tnow = 0.767259. time since last = 61.3717s Checkpoint: tnow = 0.942433. time since last = 65.6709s Checkpoint: tnow = 1.12286. time since last = 65.5763s Checkpoint: tnow = 1.30796. time since last = 65.6308s Checkpoint: tnow = 1.55554. time since last = 65.6332s Checkpoint: tnow = 2.07989. time since last = 65.6152s Checkpoint: tnow = 2.58789. time since last = 65.6147s Checkpoint: tnow = 3.11341. time since last = 65.6048s Checkpoint: tnow = 3.79834. time since last = 65.625s Making final checkpoint Simulation complete <search_likelihood>-447.00890499361077</search_likelihood> <search_application>milkywayathome nbody 0.04 Windows x86 double</search_application> Removing checkpoint file 'nbody_checkpoint' </stderr_txt> |
Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0 |
While they all seem to validate, the checkpointing still needs some work to function properly. I thought I fixed this, but apparently I didn't for the Windows version. |
Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0 |
While they all seem to validate, the checkpointing still needs some work to function properly. Actually this isn't what I thought it was. This is fine and expected, although the bug I found for the windows checkpointing is still there. This happens if you interrupt it before the first checkpoint is actually made. It opens the checkpoint at the beginning, and then continually writes to it as needed. If it's interrupted before the first write happens, the checkpoint is empty and it starts over since it never actually checkpointed before. |
Send message Joined: 26 Sep 09 Posts: 5 Credit: 10,327 RAC: 0 |
Yeah I noticed that problem when I had one of my batch of 3 that was almost finished started over again after I started my Einsteins back up and since they were running "high priority" it stopped the one we are testing. That was on a triple core AMD w/XP Pro I'm on my quad Win 7 right now so I will try this one when I get a chance. (have to go watch the 1st NFL game of the season right now) |
Send message Joined: 20 Mar 09 Posts: 7 Credit: 1,016,945 RAC: 0 |
It still crashes with an error on start on MacOS 10.5.8. This WU gives: <core_client_version>6.10.56</core_client_version> <![CDATA[ <message> process got signal 5 </message> <stderr_txt> dyld: unknown required load command 0x80000022 </stderr_txt> ]]> |
Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0 |
It still crashes with an error on start on MacOS 10.5.8. dyld: unknown required load command 0x80000022 Yeah, there was an issue with linking on 10.5. I tried something that would maybe fix it, but I don't have a way to test on 10.5. |
Send message Joined: 8 Feb 08 Posts: 261 Credit: 104,050,322 RAC: 0 |
While they all seem to validate, the checkpointing still needs some work to function properly. Doesn't the bold marked line indicate a checkpoint was written? Other WUs showed more of those lines before interupted. Additionally I looked for the checkpoint file and it contained all binary(?) data. |
Send message Joined: 8 Feb 08 Posts: 261 Credit: 104,050,322 RAC: 0 |
Ok, just did another test. Stopped BOINC and looked into the slots directories. stderr.txt: Starting fresh nbody run Starting nbody system <plummer_r> -9.4255262133787 10.709813623551 11.337814022642 </plummer_r> <plummer_v> 180.90224287396 147.20579827731 -143.34594028339 </plummer_v> Checkpoint: tnow = 0.409331. time since last = 940402s Checkpoint: tnow = 0.757109. time since last = 60.5788s Checkpoint: tnow = 1.10284. time since last = 60.3344s Checkpoint: tnow = 1.49832. time since last = 60.3642s Checkpoint: tnow = 2.00819. time since last = 60.487s Checkpoint: tnow = 2.56063. time since last = 60.3326s Checkpoint: tnow = 3.18796. time since last = 60.4464s That should be 7 checkpoints, right? The file nbody_checkpoint is untouched (7 minutes older). Only files updated are stderr.txt and boinc_task_state.xml. boinc_task_state.xml: <active_task> <project_master_url>http://milkyway.cs.rpi.edu/milkyway/</project_master_url> <result_name>de_nbody_test_10_104310_1282760160_3</result_name> <checkpoint_cpu_time>365.500000</checkpoint_cpu_time> <checkpoint_elapsed_time>424.223515</checkpoint_elapsed_time> <fraction_done>0.828105</fraction_done> </active_task> Started BOINC again and nbody_checkpoint got a new time. boinc_task_state.xml: <active_task> <project_master_url>http://milkyway.cs.rpi.edu/milkyway/</project_master_url> <result_name>de_nbody_test_10_104310_1282760160_3</result_name> <checkpoint_cpu_time>522.750000</checkpoint_cpu_time> <checkpoint_elapsed_time>608.937537</checkpoint_elapsed_time> <fraction_done>0.305077</fraction_done> </active_task> stderr.txt: Starting fresh nbody run Starting nbody system <plummer_r> -9.4255262133787 10.709813623551 11.337814022642 </plummer_r> <plummer_v> 180.90224287396 147.20579827731 -143.34594028339 </plummer_v> Checkpoint: tnow = 0.409331. time since last = 940402s Checkpoint: tnow = 0.757109. time since last = 60.5788s Checkpoint: tnow = 1.10284. time since last = 60.3344s Checkpoint: tnow = 1.49832. time since last = 60.3642s Checkpoint: tnow = 2.00819. time since last = 60.487s Checkpoint: tnow = 2.56063. time since last = 60.3326s Checkpoint: tnow = 3.18796. time since last = 60.4464s Checkpoint exists. Attempting to resume from it. Thawing state Didn't find header for checkpoint file. Number of bodies in checkpoint file does not match number expected by context. Got checkpoint file for wrong type. Expected sizeof(real) = 8, got 0 Trying to read interrupted checkpoint file Failed to find end marker in checkpoint file. Failed to resume checkpoint Removing checkpoint file 'nbody_checkpoint' Starting fresh nbody run Starting nbody system <plummer_r> -9.4255262133787 10.709813623551 11.337814022642 </plummer_r> <plummer_v> 180.90224287396 147.20579827731 -143.34594028339 </plummer_v> Checkpoint: tnow = 0.432927. time since last = 943034s Checkpoint: tnow = 0.805839. time since last = 60.4502s Checkpoint: tnow = 1.16644. time since last = 60.4015s And finally it errored out with Run time 859.165505 CPU time 756.5469 <message> Maximum elapsed time exceeded </message> |
Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0 |
I found the problem. Originally I wasn't using the boinc functions for resolving filenames when opening the checkpoints in the first release. I fixed it for the posix versions of the checkpointing functions, but I apparently forgot to fix the win32 ones. |
©2024 Astroinformatics Group