Welcome to MilkyWay@home

started a new nbody search: de_nbody_model1_1

Message boards : News : started a new nbody search: de_nbody_model1_1
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 42036 - Posted: 11 Sep 2010, 2:21:42 UTC

The workunits should take much longer to complete. Let me know how they are doing here (and I suppose you can complain if the credit is too much/too little). This should hopefully fix the problem with the workunits terminating prematurely as well.
ID: 42036 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Toppie*

Send message
Joined: 28 Mar 09
Posts: 68
Credit: 1,003,982,681
RAC: 0
Message 42045 - Posted: 11 Sep 2010, 18:25:51 UTC - in response to Message 42036.  

ummm...how much longer?
I have a couple running.
One after 4 hrs only 6% complete.
Another after 4 hrs 15% complete.
Another after 3 hrs 70% complete.
ID: 42045 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TLSI2000

Send message
Joined: 15 Mar 10
Posts: 17
Credit: 1,221,936,867
RAC: 0
Message 42052 - Posted: 11 Sep 2010, 23:21:34 UTC - in response to Message 42036.  
Last modified: 11 Sep 2010, 23:58:45 UTC

Most (about 70%) abort in the first second.

On my two systems, they are taking 20-40 minutes, of the few that don't abort immediately.

And I have had a couple of 'runaways', that completed less than 1% after 20-25 minutes, with an ever increasing estimated time of completion well over an hour.
I aborted these manually
ID: 42052 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 42056 - Posted: 12 Sep 2010, 3:36:28 UTC - in response to Message 42045.  

ummm...how much longer?
I have a couple running.
One after 4 hrs only 6% complete.
Another after 4 hrs 15% complete.
Another after 3 hrs 70% complete.


The runtimes are probably going to vary pretty drastically depending on the input parameters.
ID: 42056 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Danger

Send message
Joined: 20 Feb 10
Posts: 3
Credit: 5,030,476
RAC: 0
Message 42069 - Posted: 12 Sep 2010, 14:28:51 UTC - in response to Message 42036.  

These are indeed taking quite a bit longer. I have one that has been running for 25 hours and is just about completed and I have 8 cores running 90% utilized.

Most are estimated at about 10-12 hours though.




ID: 42069 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Marcin
Avatar

Send message
Joined: 18 Jun 09
Posts: 34
Credit: 10,888,570
RAC: 19,852
Message 42078 - Posted: 12 Sep 2010, 20:10:27 UTC

well for me the new workunits go hell faster
normal unit works for 15-18 hours and the nbody just TWO hours to do-isn't it wieerd?
ID: 42078 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Phil1911

Send message
Joined: 8 Jun 10
Posts: 2
Credit: 109,213
RAC: 0
Message 42083 - Posted: 12 Sep 2010, 21:59:45 UTC

My workunits abort either immediately or after no more than 5 seconds. What`s going on?
ID: 42083 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
(retired account)
Avatar

Send message
Joined: 17 Oct 08
Posts: 36
Credit: 411,744
RAC: 0
Message 42085 - Posted: 13 Sep 2010, 1:08:44 UTC
Last modified: 13 Sep 2010, 1:10:15 UTC

I got my first workunits tonight for N-Body Simulation v0.04. The outcome was rather odd: Three workunits were from the de_nbody_test_10 series and they were all completed and validated. The five others were from the de_nbody_model1_1 series and they all crashed after one or two seconds. Looking on the wingmen I can not see any pattern. Sometimes they crash also on a wingman, sometimes they seem to finish without error.

Well, I try to get some more, maybe I catch a good one *g* ...

Regards

List of Error results
ID: 42085 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
(retired account)
Avatar

Send message
Joined: 17 Oct 08
Posts: 36
Credit: 411,744
RAC: 0
Message 42088 - Posted: 13 Sep 2010, 9:02:00 UTC
Last modified: 13 Sep 2010, 9:06:13 UTC

Update: Five more workunits, all de_nbody_model1_1, and for a change all now completed, four of them already validated. Run time between 30 and 60 minutes. Still don't have a clue why some crash and others not.
ID: 42088 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matt Arsenault
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 8 May 10
Posts: 576
Credit: 15,979,383
RAC: 0
Message 42093 - Posted: 13 Sep 2010, 14:19:40 UTC - in response to Message 42088.  

Update: Five more workunits, all de_nbody_model1_1, and for a change all now completed, four of them already validated. Run time between 30 and 60 minutes. Still don't have a clue why some crash and others not.


The Windows checkpointing is currently broken (it will always restart from the beginning), but I think I've fixed all the problems with it.

There were some things I fixed a long time ago in the posix version of the checkpointing, which I apparently didn't also fix in the Win32 version, as well as a few windows specific problems.

I think some of the problems are because I was using some temporary file flag when opening the checkpoint file on Windows, even though it shouldn't count as one for the Windows checkpointing. Also weird permission problem seem to sometimes happen on Windows 7. I think that some might end up sometimes crashing if it attempts to open the checkpoint after restarting with some permission related error.

There's also the linking problem which causes it to crash on OS X 10.5, which I might have fixed (again), but I don't have a way to test on 10.5 so I'm not sure.

I'll try to update the binaries sometime today.
ID: 42093 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
w1hue

Send message
Joined: 13 Feb 09
Posts: 49
Credit: 72,372,187
RAC: 0
Message 42096 - Posted: 13 Sep 2010, 16:42:19 UTC - in response to Message 42088.  

Well, I had one yesterday that had run 20-some hours and showed 137 hours remaining! I aborted it. Another has been running about 8 hours and shows another 23 hours to go. Guess I'll leave that one alone and see what happens. I'm running a dual core 2.4 GHz AMD CPU. My GPU won't handle milkyway WWs.

ID: 42096 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Werkstatt

Send message
Joined: 19 Feb 08
Posts: 350
Credit: 141,284,369
RAC: 0
Message 42097 - Posted: 13 Sep 2010, 17:24:26 UTC

Could someone please post a download-link for the actual version?

THX
Alexander
ID: 42097 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Marcin
Avatar

Send message
Joined: 18 Jun 09
Posts: 34
Credit: 10,888,570
RAC: 19,852
Message 42100 - Posted: 13 Sep 2010, 18:59:40 UTC - in response to Message 42088.  

Update: Five more workunits, all de_nbody_model1_1, and for a change all now completed, four of them already validated. Run time between 30 and 60 minutes. Still don't have a clue why some crash and others not.

ooh i'm running milkyway on MAC OS X 10.6.4 so maybe the sprint-times of nbody are cause of this instead of Windows?
ID: 42100 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Priebe

Send message
Joined: 27 Nov 09
Posts: 108
Credit: 430,760,953
RAC: 0
Message 42101 - Posted: 13 Sep 2010, 19:45:45 UTC - in response to Message 42096.  

Well, I had one yesterday that had run 20-some hours and showed 137 hours remaining!
I also had one of these "model1" WU's self-abort with "maximum time exceeded" after 29.6 hours of processing time. I hope this doesn't become a habit.
ID: 42101 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
(retired account)
Avatar

Send message
Joined: 17 Oct 08
Posts: 36
Credit: 411,744
RAC: 0
Message 42102 - Posted: 13 Sep 2010, 20:17:49 UTC - in response to Message 42093.  


The Windows checkpointing is currently broken (it will always restart from the beginning), but I think I've fixed all the problems with it. (...) I'll try to update the binaries sometime today.


Hello Matt, is this fix already included in the current version 0.04? Or will it be in the upcoming one?

I currently have the longest running workunit up to now. 7 h run time were already done and approx. 8 h were still to go, when I had to close BOINC. After restart, it started again at 0 % progress, but run time started at the approx. 7 h were I stopped it before. So currently I am at 5.4 % again and the total run time has risen from 15 h to approx. 22 h now. So something is wrong with checkpointing, I guess.

Regards
Alex
ID: 42102 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Paul Forsdick

Send message
Joined: 19 Feb 09
Posts: 29
Credit: 5,452,255
RAC: 0
Message 42104 - Posted: 13 Sep 2010, 22:25:31 UTC

Hi my latest one is a de_12 and has run 2 hours and is showing 9.259% done so this looks to be going to take over 200 hours it is due by 21/9 so I will need be running for 24 hours a day to get it done in time or should I abort it.

regards Paul
ID: 42104 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
(retired account)
Avatar

Send message
Joined: 17 Oct 08
Posts: 36
Credit: 411,744
RAC: 0
Message 42105 - Posted: 13 Sep 2010, 22:46:43 UTC - in response to Message 42104.  

has run 2 hours and is showing 9.259% done


Hi Paul, 10% in 2 hours should be 100% in 20 hours, right? So this should be fine.

Brian has also reported here that the workunits will be terminated with "max. time exceeded" error at some point (should depend on the system on which they run), I guess that means they can not really run into the deadline of 8 days until you have a very slow system.
ID: 42105 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Paul Forsdick

Send message
Joined: 19 Feb 09
Posts: 29
Credit: 5,452,255
RAC: 0
Message 42106 - Posted: 13 Sep 2010, 22:52:09 UTC

Hi it is nealy midnight in the UK so my maths have gone up the creek today

thanks
ID: 42106 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matt Arsenault
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 8 May 10
Posts: 576
Credit: 15,979,383
RAC: 0
Message 42107 - Posted: 13 Sep 2010, 22:56:04 UTC - in response to Message 42102.  
Last modified: 14 Sep 2010, 16:19:36 UTC

Hello Matt, is this fix already included in the current version 0.04? Or will it be in the upcoming one?

The upcoming one.

So currently I am at 5.4 % again and the total run time has risen from 15 h to approx. 22 h now.

Also the run times vary widely with the parameters. In the worst possible case for 10,000 bodies, it took about 12.5 hours to run on my core 2 q6600 @3Ghz, 64 bit. I'm not sure about some of the other sizes.

Edit: Remove comment about 64 bit version being faster. I'm not sure it's true anymore; it was last time I checked months ago.
ID: 42107 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
(retired account)
Avatar

Send message
Joined: 17 Oct 08
Posts: 36
Credit: 411,744
RAC: 0
Message 42113 - Posted: 14 Sep 2010, 13:16:33 UTC - in response to Message 42102.  


I currently have the longest running workunit up to now. 7 h run time were already done and approx. 8 h were still to go, when I had to close BOINC. After restart, it started again at 0 % progress, but run time started at the approx. 7 h were I stopped it before. So currently I am at 5.4 % again and the total run time has risen from 15 h to approx. 22 h now.


Just for the records (because we now have moved to a new app version): the workunit mentioned above was finished this morning and is now validated. The stderr out has some interesting info about the checkpointing problem, excerpt:


Checkpoint: tnow = 1.20291. time since last = 360.466s
Checkpoint: tnow = 1.22032. time since last = 361.073s
Checkpoint: tnow = 1.238. time since last = 360.637s
Checkpoint: tnow = 1.25557. time since last = 362.311s
Checkpoint exists. Attempting to resume from it.
Thawing state
Didn't find header for checkpoint file.
Number of bodies in checkpoint file does not match number expected by context.
Got checkpoint file for wrong type. Expected sizeof(real) = 8, got 0
Trying to read interrupted checkpoint file
Failed to find end marker in checkpoint file.
Failed to resume checkpoint
Removing checkpoint file 'nbody_checkpoint'
Starting fresh nbody run
Starting nbody system
<plummer_r> -38.146212235604 2.2104695431195 32.223568725294 </plummer_r>
<plummer_v> 69.480777935001 95.95483517654 -100.99755377651 </plummer_v>
Checkpoint: tnow = 0.0197762. time since last = 903435s
Checkpoint: tnow = 0.0406272. time since last = 394.064s
Checkpoint: tnow = 0.0593286. time since last = 366.626s



Btw, claimed credit 495.43, granted credit 65.73 is a bit disappointing. Never mind. ;)
ID: 42113 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : News : started a new nbody search: de_nbody_model1_1

©2024 Astroinformatics Group