Welcome to MilkyWay@home

nbody

Message boards : Number crunching : nbody
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile verstapp
Avatar

Send message
Joined: 26 Jan 09
Posts: 589
Credit: 497,834,261
RAC: 0
Message 41631 - Posted: 21 Aug 2010, 10:28:46 UTC

Is there any special incantation required to get nbody wus flowing [assuming the assimilator is allowing any wus out at all, of course], or should they just flow freely, intermingled with normal ones?
Cheers,

PeterV

.
ID: 41631 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Werkstatt

Send message
Joined: 19 Feb 08
Posts: 350
Credit: 141,284,369
RAC: 0
Message 41632 - Posted: 21 Aug 2010, 10:47:59 UTC - in response to Message 41631.  

Hello Verstapp,
This problem can be solved with an app_info.xml, posted from Benzini here http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=1880&nowrap=true#41580
It works fine, but most nbody's faild, so I've turned it off. I'm waiting for a reworked app.

Regards,
Alexander

ID: 41632 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile apohawk
Avatar

Send message
Joined: 13 Sep 08
Posts: 12
Credit: 131,420,119
RAC: 0
Message 41633 - Posted: 21 Aug 2010, 12:18:52 UTC - in response to Message 41632.  

When will ATI application for those be available?
ID: 41633 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Werkstatt

Send message
Joined: 19 Feb 08
Posts: 350
Credit: 141,284,369
RAC: 0
Message 41634 - Posted: 21 Aug 2010, 12:43:25 UTC - in response to Message 41633.  

When will ATI application for those be available?

Hi,
here http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=1505&nowrap=true#41596 is your answer

Alexander
ID: 41634 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matt Arsenault
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 8 May 10
Posts: 576
Credit: 15,979,383
RAC: 0
Message 41639 - Posted: 21 Aug 2010, 15:35:49 UTC - in response to Message 41634.  

When will ATI application for those be available?

here http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=1505&nowrap=true#41596 is your answer


No, that's not for the nbody. GPU nbody will come later.
ID: 41639 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Werkstatt

Send message
Joined: 19 Feb 08
Posts: 350
Credit: 141,284,369
RAC: 0
Message 41644 - Posted: 21 Aug 2010, 16:56:35 UTC - in response to Message 41639.  


No, that's not for the nbody. GPU nbody will come later.


Hi Matt,

so it means, the cuda app will be replaced by an openCL (for the Fermi-cards)?

BTW, could you kick the validator please?

Alexander
ID: 41644 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ironworker16
Avatar

Send message
Joined: 31 Jan 09
Posts: 31
Credit: 69,908,565
RAC: 0
Message 41651 - Posted: 21 Aug 2010, 20:24:14 UTC - in response to Message 41631.  

Is there any special incantation required to get nbody wus flowing [assuming the assimilator is allowing any wus out at all, of course], or should they just flow freely, intermingled with normal ones?


I had to detached from Milkyway and reattached and the work units started flowing, CPU only.
ID: 41651 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matt Arsenault
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 8 May 10
Posts: 576
Credit: 15,979,383
RAC: 0
Message 41653 - Posted: 21 Aug 2010, 20:45:45 UTC - in response to Message 41644.  

so it means, the cuda app will be replaced by an openCL (for the Fermi-cards)?

Hopefully, assuming it ends up better which I'm quite hopeful it will. I don't have actual hardware available right now to benchmark it.


BTW, could you kick the validator please?


I'm not sure I can do that. I seem to not have any permissions on the server.

Running the N-body on the GPU is more complicated and I haven't quite decided how to do it best yet, but it should work eventually.
ID: 41653 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Len LE/GE

Send message
Joined: 8 Feb 08
Posts: 261
Credit: 104,050,322
RAC: 0
Message 41666 - Posted: 22 Aug 2010, 14:08:47 UTC
Last modified: 22 Aug 2010, 14:15:36 UTC

Thought I give the nbody simulation a try and run a few of them.
All validated but ...

While the nbody test was running I stopped BOINC, started it again and a little later I found this:

<stderr_txt>
Starting fresh nbody run
Starting nbody system
<plummer_r> -12.634466571333 7.8542996334764 13.79073292416 </plummer_r>
<plummer_v> 165.9741794224 159.98384005138 -122.37229274443 </plummer_v>
Checkpoint: tnow = 0.550334. time since last = 686193s
Checkpoint: tnow = 1.06401. time since last = 65.6174s
Checkpoint: tnow = 1.62975. time since last = 65.601s
Checkpoint exists. Attempting to resume from it.
Thawing state
Didn't find header for checkpoint file.
Number of bodies in checkpoint file does not match number expected by context.
Got checkpoint file for wrong type. Expected sizeof(real) = 8, got 0
Trying to read interrupted checkpoint file
Failed to find end marker in checkpoint file.
Failed to resume checkpoint

Removing checkpoint file 'nbody_checkpoint'
Starting fresh nbody run
Starting nbody system
<plummer_r> -12.634466571333 7.8542996334764 13.79073292416 </plummer_r>
<plummer_v> 165.9741794224 159.98384005138 -122.37229274443 </plummer_v>
Checkpoint: tnow = 0.429749. time since last = 686747s
Checkpoint: tnow = 0.827625. time since last = 65.6608s
Checkpoint: tnow = 1.24675. time since last = 65.5823s
Checkpoint: tnow = 1.71687. time since last = 65.6785s
Checkpoint: tnow = 2.33148. time since last = 65.6073s
Checkpoint: tnow = 2.9769. time since last = 65.5902s
Checkpoint: tnow = 3.8518. time since last = 65.6181s
Simulation complete
<search_likelihood>-1096.6110382107113</search_likelihood>
<search_application>milkywayathome nbody 0.1 Windows x86 double</search_application>
Removing checkpoint file 'nbody_checkpoint'

</stderr_txt>

So the checkpoint/resume procedures needs a little more work?


EDIT: And the credits given does not seem to correspond to the runtimes or claimed credits.
ID: 41666 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matt Arsenault
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 8 May 10
Posts: 576
Credit: 15,979,383
RAC: 0
Message 41670 - Posted: 22 Aug 2010, 14:43:27 UTC - in response to Message 41666.  

Checkpoint: tnow = 0.550334. time since last = 686193s
Checkpoint: tnow = 0.429749. time since last = 686747s

These don't mean anything. It's just from subtracting an arbitrary time from 0.0 when the first checkpoint happens (which I added here http://github.com/Milkyway-at-home/milkywayathome_client/commit/b8ea7ee37035eb2e69403cc8c4767f7a58111c54). It's just debug printing since it seems like on some systems the checkpointing is happening way too often. The BOINC default time is supposedly 300 seconds, but most systems seem to do it around 60 seconds. A fair number also seem to be checkpointing every 10 seconds for some reason, which is helping slow things down and might partially explains some of the maximum time exceeded errors.

I've also found logs of ones where it looks like the simulation finishes and removes the old checkpoint, but is interrupted before the program actually ends, so then it ends up starting over. It doesn't look like that's what's happening in your log though. Do you have the number for the workunit?

EDIT: And the credits given does not seem to correspond to the runtimes or claimed credits.

This is one of the problems I'm working on. The estimates on how long it will take are really wrong. It's kind of hard to predict the runtimes of the workunits right now, so I'm trying to come up with something better. The parameters of the simulation can cause a wide variation in how long it takes. For example, over the range of masses being fit, the run times vary by a factor of 10. This also means on lots of systems, the workunits exceed the maximum allowed time and then get killed by BOINC before finishing. Another effect is the credits are most likely wrong.[/url]
ID: 41670 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
(retired account)
Avatar

Send message
Joined: 17 Oct 08
Posts: 36
Credit: 411,744
RAC: 0
Message 41672 - Posted: 22 Aug 2010, 17:42:25 UTC - in response to Message 41670.  


The parameters of the simulation can cause a wide variation in how long it takes. For example, over the range of masses being fit, the run times vary by a factor of 10. This also means on lots of systems, the workunits exceed the maximum allowed time and then get killed by BOINC before finishing.


Hmm.. is a variation in run time by a factor of 10 really that much? I remember that at RCN the workunit run times can vary between a few seconds and hundreds of hours, which is factor of 1,000,000 (however, if not changed by the user, they will stop at 24 hours even if no result is found and are reissued later in a less complex calculation, if I remember correctly).

Maybe a dumb question, but can't you simply extend the maximum allowed time by a factor of 10? Does it really matter if some workunits run for example 10 hours instead of one as intended? Well, I'm sure I'm missing something. Otherwise you guys already had solved the problem. ;-)
ID: 41672 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matt Arsenault
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 8 May 10
Posts: 576
Credit: 15,979,383
RAC: 0
Message 41682 - Posted: 23 Aug 2010, 1:16:25 UTC - in response to Message 41672.  

Hmm.. is a variation in run time by a factor of 10 really that much? I remember that at RCN the workunit run times can vary between a few seconds and hundreds of hours, which is factor of 1,000,000 (however, if not changed by the user, they will stop at 24 hours even if no result is found and are reissued later in a less complex calculation, if I remember correctly).

I don't know the details of what they were doing, but it's quite possible and likely that even with workunits that varied greatly they were able to predict ahead of time which ones would run long. BOINC requires an estimate of how many floating point operations a workunit will take ahead of time to prevent things from getting stuck. I gave Travis a formula estimate before sending these out, but it's apparently quite wrong. I mostly just used fudge factors for different pieces for around 100,000 bodies. I've found a few surprising things since then. I was expecting the times to vary slightly, with the simulation tending to run faster as it progresses. Things are varying much more than I expected.

Maybe a dumb question, but can't you simply extend the maximum allowed time by a factor of 10? Does it really matter if some workunits run for example 10 hours instead of one as intended? Well, I'm sure I'm missing something. Otherwise you guys already had solved the problem. ;-)

I think Travis said he already was multiplying by 10 over the terrible estimate formula I gave him. The estimate also has something to do with the credits given, although for that part it looks like there's something to report how many operations actually happen. I don't think that helps with the initial estimate / BOINC killing it before it's finished problem.
ID: 41682 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Len LE/GE

Send message
Joined: 8 Feb 08
Posts: 261
Credit: 104,050,322
RAC: 0
Message 41683 - Posted: 23 Aug 2010, 1:29:34 UTC - in response to Message 41670.  


I've also found logs of ones where it looks like the simulation finishes and removes the old checkpoint, but is interrupted before the program actually ends, so then it ends up starting over. It doesn't look like that's what's happening in your log though. Do you have the number for the workunit?


Workunit: 141497232
Task: 179870328

I stopped BOINC, which should send the client a signal to stop in a defined state.
From the log I would rather think it stopped in the middle of writing the checkpoint leading to the problem not being able to resume when I started BOINC again.
ID: 41683 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mdhittle*
Avatar

Send message
Joined: 25 Jun 10
Posts: 284
Credit: 260,490,091
RAC: 0
Message 41687 - Posted: 23 Aug 2010, 2:37:16 UTC

Matt,

Was stopping the flow of the regular GPU workunits caused by the testing of the n-body workunits? Can the server handle dishing out both kinds of workunits at the same time? Will the regular GPU workunits be available again sometime in the near future?
ID: 41687 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matt Arsenault
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 8 May 10
Posts: 576
Credit: 15,979,383
RAC: 0
Message 41689 - Posted: 23 Aug 2010, 4:32:11 UTC - in response to Message 41687.  

Matt,

Was stopping the flow of the regular GPU workunits caused by the testing of the n-body workunits? Can the server handle dishing out both kinds of workunits at the same time? Will the regular GPU workunits be available again sometime in the near future?

I don't know that much about the server but it should be fine.
ID: 41689 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mdhittle*
Avatar

Send message
Joined: 25 Jun 10
Posts: 284
Credit: 260,490,091
RAC: 0
Message 41691 - Posted: 23 Aug 2010, 5:18:09 UTC - in response to Message 41689.  

Matt,

Was stopping the flow of the regular GPU workunits caused by the testing of the n-body workunits? Can the server handle dishing out both kinds of workunits at the same time? Will the regular GPU workunits be available again sometime in the near future?

I don't know that much about the server but it should be fine.


Well, it hasn't been fine for most of the weekend. The only workunits that were available were the N-Body workunits. ALL work on the regular workunits came to a stop.

Maybe someone should look into this, and why the validator quits working.
ID: 41691 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mdhittle*
Avatar

Send message
Joined: 25 Jun 10
Posts: 284
Credit: 260,490,091
RAC: 0
Message 41707 - Posted: 23 Aug 2010, 23:12:04 UTC

Since the N-Body workunits have been being issued over the weekend, I have only been able to get 1 regular workunit at a time. I have 12 cores and 4 ATI GPUs and should be able to build a cache up of 72 workunits. But I can't get enough workunits to keep 2 GPUs busy part time. Here is a print out of the sched_op_debug showing that I am requesting over 600,000 seconds of work and I only get 1 work unit in response.

I have checked my debt, and it was ok. But, just in case, I reset it to zero with no effect. I have tried running with and without an app_info file, no help.

Any help you can give in debugging this would be a great help.

8/23/2010 6:54:46 PM Milkyway@home Reporting 1 completed tasks, requesting new tasks for GPU
8/23/2010 6:54:46 PM Milkyway@home [sched_op_debug] CPU work request: 0.00 seconds; 0.00 CPUs
8/23/2010 6:54:46 PM Milkyway@home [sched_op_debug] ATI GPU work request: 6863331.47 seconds; 3.00 GPUs
8/23/2010 6:54:48 PM Milkyway@home Scheduler request completed: got 1 new tasks
8/23/2010 6:54:48 PM Milkyway@home [sched_op_debug] Server version 611
8/23/2010 6:54:48 PM Milkyway@home Project requested delay of 61 seconds
8/23/2010 6:54:48 PM Milkyway@home [sched_op_debug] estimated total CPU job duration: 0 seconds
8/23/2010 6:54:48 PM Milkyway@home [sched_op_debug] estimated total ATI GPU job duration: 470 seconds
8/23/2010 6:54:48 PM Milkyway@home [sched_op_debug] handle_scheduler_reply(): got ack for result de_16_3s_2_147968_1282603700_0
8/23/2010 6:54:48 PM Milkyway@home [sched_op_debug] Deferring communication for 1 min 1 sec
8/23/2010 6:54:48 PM Milkyway@home [sched_op_debug] Reason: requested by project

ID: 41707 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Gary Roberts

Send message
Joined: 1 Mar 09
Posts: 56
Credit: 1,984,937,499
RAC: 0
Message 41711 - Posted: 24 Aug 2010, 0:08:32 UTC - in response to Message 41707.  

You shouldn't post the same question in multiple threads. For a possible solution, check the response posted in the News thread. I don't think it has anything to do with nbody tasks as they are CPU only at this stage.

Cheers,
Gary.
ID: 41711 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mdhittle*
Avatar

Send message
Joined: 25 Jun 10
Posts: 284
Credit: 260,490,091
RAC: 0
Message 41712 - Posted: 24 Aug 2010, 0:19:22 UTC - in response to Message 41711.  

You shouldn't post the same question in multiple threads. For a possible solution, check the response posted in the News thread. I don't think it has anything to do with nbody tasks as they are CPU only at this stage.


I think it has to do with the server and the N-Body workunits. The problem didn't start until this weekend when the N-Body workunits were released.
ID: 41712 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 12 Apr 08
Posts: 621
Credit: 161,934,067
RAC: 0
Message 41736 - Posted: 25 Aug 2010, 8:40:04 UTC - in response to Message 41670.  

Checkpoint: tnow = 0.550334. time since last = 686193s
Checkpoint: tnow = 0.429749. time since last = 686747s

These don't mean anything. It's just from subtracting an arbitrary time from 0.0 when the first checkpoint happens (which I added here http://github.com/Milkyway-at-home/milkywayathome_client/commit/b8ea7ee37035eb2e69403cc8c4767f7a58111c54). It's just debug printing since it seems like on some systems the checkpointing is happening way too often. The BOINC default time is supposedly 300 seconds, but most systems seem to do it around 60 seconds. A fair number also seem to be checkpointing every 10 seconds for some reason, which is helping slow things down and might partially explains some of the maximum time exceeded errors.

They have made changes in the checkpointing... I forget which version... somewhere in the 3 or 4 series the number of CPUs was taken into account because people were setting it to a value and because of the multiple CPUs the setting was effectively divided by that count... so if you set 4 minutes on an 8 CPU system the effective checkpointing interval was 30 seconds ...

Recently, ( and I don't recall how recently ) the multiplier was removed so we are back to much more rapid CKPTs than most expect... especially on GPU equipped systems (which add processing elements and tasks in work)...
ID: 41736 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : nbody

©2024 Astroinformatics Group