Message boards :
Number crunching :
nbody
Message board moderation
Author | Message |
---|---|
Send message Joined: 26 Jan 09 Posts: 589 Credit: 497,834,261 RAC: 0 |
|
Send message Joined: 19 Feb 08 Posts: 350 Credit: 141,284,369 RAC: 0 |
Hello Verstapp, This problem can be solved with an app_info.xml, posted from Benzini here http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=1880&nowrap=true#41580 It works fine, but most nbody's faild, so I've turned it off. I'm waiting for a reworked app. Regards, Alexander |
Send message Joined: 13 Sep 08 Posts: 12 Credit: 131,420,119 RAC: 0 |
|
Send message Joined: 19 Feb 08 Posts: 350 Credit: 141,284,369 RAC: 0 |
When will ATI application for those be available? Hi, here http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=1505&nowrap=true#41596 is your answer Alexander |
Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0 |
When will ATI application for those be available? No, that's not for the nbody. GPU nbody will come later. |
Send message Joined: 19 Feb 08 Posts: 350 Credit: 141,284,369 RAC: 0 |
Hi Matt, so it means, the cuda app will be replaced by an openCL (for the Fermi-cards)? BTW, could you kick the validator please? Alexander |
Send message Joined: 31 Jan 09 Posts: 31 Credit: 69,908,565 RAC: 0 |
Is there any special incantation required to get nbody wus flowing [assuming the assimilator is allowing any wus out at all, of course], or should they just flow freely, intermingled with normal ones? I had to detached from Milkyway and reattached and the work units started flowing, CPU only. |
Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0 |
so it means, the cuda app will be replaced by an openCL (for the Fermi-cards)? Hopefully, assuming it ends up better which I'm quite hopeful it will. I don't have actual hardware available right now to benchmark it.
I'm not sure I can do that. I seem to not have any permissions on the server. Running the N-body on the GPU is more complicated and I haven't quite decided how to do it best yet, but it should work eventually. |
Send message Joined: 8 Feb 08 Posts: 261 Credit: 104,050,322 RAC: 0 |
Thought I give the nbody simulation a try and run a few of them. All validated but ... While the nbody test was running I stopped BOINC, started it again and a little later I found this: <stderr_txt> Starting fresh nbody run Starting nbody system <plummer_r> -12.634466571333 7.8542996334764 13.79073292416 </plummer_r> <plummer_v> 165.9741794224 159.98384005138 -122.37229274443 </plummer_v> Checkpoint: tnow = 0.550334. time since last = 686193s Checkpoint: tnow = 1.06401. time since last = 65.6174s Checkpoint: tnow = 1.62975. time since last = 65.601s Checkpoint exists. Attempting to resume from it. Thawing state Didn't find header for checkpoint file. Number of bodies in checkpoint file does not match number expected by context. Got checkpoint file for wrong type. Expected sizeof(real) = 8, got 0 Trying to read interrupted checkpoint file Failed to find end marker in checkpoint file. Failed to resume checkpoint Removing checkpoint file 'nbody_checkpoint' Starting fresh nbody run Starting nbody system <plummer_r> -12.634466571333 7.8542996334764 13.79073292416 </plummer_r> <plummer_v> 165.9741794224 159.98384005138 -122.37229274443 </plummer_v> Checkpoint: tnow = 0.429749. time since last = 686747s Checkpoint: tnow = 0.827625. time since last = 65.6608s Checkpoint: tnow = 1.24675. time since last = 65.5823s Checkpoint: tnow = 1.71687. time since last = 65.6785s Checkpoint: tnow = 2.33148. time since last = 65.6073s Checkpoint: tnow = 2.9769. time since last = 65.5902s Checkpoint: tnow = 3.8518. time since last = 65.6181s Simulation complete <search_likelihood>-1096.6110382107113</search_likelihood> <search_application>milkywayathome nbody 0.1 Windows x86 double</search_application> Removing checkpoint file 'nbody_checkpoint' </stderr_txt> So the checkpoint/resume procedures needs a little more work? EDIT: And the credits given does not seem to correspond to the runtimes or claimed credits. |
Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0 |
Checkpoint: tnow = 0.550334. time since last = 686193s These don't mean anything. It's just from subtracting an arbitrary time from 0.0 when the first checkpoint happens (which I added here http://github.com/Milkyway-at-home/milkywayathome_client/commit/b8ea7ee37035eb2e69403cc8c4767f7a58111c54). It's just debug printing since it seems like on some systems the checkpointing is happening way too often. The BOINC default time is supposedly 300 seconds, but most systems seem to do it around 60 seconds. A fair number also seem to be checkpointing every 10 seconds for some reason, which is helping slow things down and might partially explains some of the maximum time exceeded errors. I've also found logs of ones where it looks like the simulation finishes and removes the old checkpoint, but is interrupted before the program actually ends, so then it ends up starting over. It doesn't look like that's what's happening in your log though. Do you have the number for the workunit? EDIT: And the credits given does not seem to correspond to the runtimes or claimed credits. This is one of the problems I'm working on. The estimates on how long it will take are really wrong. It's kind of hard to predict the runtimes of the workunits right now, so I'm trying to come up with something better. The parameters of the simulation can cause a wide variation in how long it takes. For example, over the range of masses being fit, the run times vary by a factor of 10. This also means on lots of systems, the workunits exceed the maximum allowed time and then get killed by BOINC before finishing. Another effect is the credits are most likely wrong.[/url] |
Send message Joined: 17 Oct 08 Posts: 36 Credit: 411,744 RAC: 0 |
Hmm.. is a variation in run time by a factor of 10 really that much? I remember that at RCN the workunit run times can vary between a few seconds and hundreds of hours, which is factor of 1,000,000 (however, if not changed by the user, they will stop at 24 hours even if no result is found and are reissued later in a less complex calculation, if I remember correctly). Maybe a dumb question, but can't you simply extend the maximum allowed time by a factor of 10? Does it really matter if some workunits run for example 10 hours instead of one as intended? Well, I'm sure I'm missing something. Otherwise you guys already had solved the problem. ;-) |
Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0 |
Hmm.. is a variation in run time by a factor of 10 really that much? I remember that at RCN the workunit run times can vary between a few seconds and hundreds of hours, which is factor of 1,000,000 (however, if not changed by the user, they will stop at 24 hours even if no result is found and are reissued later in a less complex calculation, if I remember correctly). I don't know the details of what they were doing, but it's quite possible and likely that even with workunits that varied greatly they were able to predict ahead of time which ones would run long. BOINC requires an estimate of how many floating point operations a workunit will take ahead of time to prevent things from getting stuck. I gave Travis a formula estimate before sending these out, but it's apparently quite wrong. I mostly just used fudge factors for different pieces for around 100,000 bodies. I've found a few surprising things since then. I was expecting the times to vary slightly, with the simulation tending to run faster as it progresses. Things are varying much more than I expected. Maybe a dumb question, but can't you simply extend the maximum allowed time by a factor of 10? Does it really matter if some workunits run for example 10 hours instead of one as intended? Well, I'm sure I'm missing something. Otherwise you guys already had solved the problem. ;-) I think Travis said he already was multiplying by 10 over the terrible estimate formula I gave him. The estimate also has something to do with the credits given, although for that part it looks like there's something to report how many operations actually happen. I don't think that helps with the initial estimate / BOINC killing it before it's finished problem. |
Send message Joined: 8 Feb 08 Posts: 261 Credit: 104,050,322 RAC: 0 |
Workunit: 141497232 Task: 179870328 I stopped BOINC, which should send the client a signal to stop in a defined state. From the log I would rather think it stopped in the middle of writing the checkpoint leading to the problem not being able to resume when I started BOINC again. |
Send message Joined: 25 Jun 10 Posts: 284 Credit: 260,490,091 RAC: 0 |
Matt, Was stopping the flow of the regular GPU workunits caused by the testing of the n-body workunits? Can the server handle dishing out both kinds of workunits at the same time? Will the regular GPU workunits be available again sometime in the near future? |
Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0 |
Matt, I don't know that much about the server but it should be fine. |
Send message Joined: 25 Jun 10 Posts: 284 Credit: 260,490,091 RAC: 0 |
Matt, Well, it hasn't been fine for most of the weekend. The only workunits that were available were the N-Body workunits. ALL work on the regular workunits came to a stop. Maybe someone should look into this, and why the validator quits working. |
Send message Joined: 25 Jun 10 Posts: 284 Credit: 260,490,091 RAC: 0 |
Since the N-Body workunits have been being issued over the weekend, I have only been able to get 1 regular workunit at a time. I have 12 cores and 4 ATI GPUs and should be able to build a cache up of 72 workunits. But I can't get enough workunits to keep 2 GPUs busy part time. Here is a print out of the sched_op_debug showing that I am requesting over 600,000 seconds of work and I only get 1 work unit in response. I have checked my debt, and it was ok. But, just in case, I reset it to zero with no effect. I have tried running with and without an app_info file, no help. Any help you can give in debugging this would be a great help. 8/23/2010 6:54:46 PM Milkyway@home Reporting 1 completed tasks, requesting new tasks for GPU 8/23/2010 6:54:46 PM Milkyway@home [sched_op_debug] CPU work request: 0.00 seconds; 0.00 CPUs 8/23/2010 6:54:46 PM Milkyway@home [sched_op_debug] ATI GPU work request: 6863331.47 seconds; 3.00 GPUs 8/23/2010 6:54:48 PM Milkyway@home Scheduler request completed: got 1 new tasks 8/23/2010 6:54:48 PM Milkyway@home [sched_op_debug] Server version 611 8/23/2010 6:54:48 PM Milkyway@home Project requested delay of 61 seconds 8/23/2010 6:54:48 PM Milkyway@home [sched_op_debug] estimated total CPU job duration: 0 seconds 8/23/2010 6:54:48 PM Milkyway@home [sched_op_debug] estimated total ATI GPU job duration: 470 seconds 8/23/2010 6:54:48 PM Milkyway@home [sched_op_debug] handle_scheduler_reply(): got ack for result de_16_3s_2_147968_1282603700_0 8/23/2010 6:54:48 PM Milkyway@home [sched_op_debug] Deferring communication for 1 min 1 sec 8/23/2010 6:54:48 PM Milkyway@home [sched_op_debug] Reason: requested by project |
Send message Joined: 1 Mar 09 Posts: 56 Credit: 1,984,937,499 RAC: 0 |
You shouldn't post the same question in multiple threads. For a possible solution, check the response posted in the News thread. I don't think it has anything to do with nbody tasks as they are CPU only at this stage. Cheers, Gary. |
Send message Joined: 25 Jun 10 Posts: 284 Credit: 260,490,091 RAC: 0 |
You shouldn't post the same question in multiple threads. For a possible solution, check the response posted in the News thread. I don't think it has anything to do with nbody tasks as they are CPU only at this stage. I think it has to do with the server and the N-Body workunits. The problem didn't start until this weekend when the N-Body workunits were released. |
Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0 |
Checkpoint: tnow = 0.550334. time since last = 686193s They have made changes in the checkpointing... I forget which version... somewhere in the 3 or 4 series the number of CPUs was taken into account because people were setting it to a value and because of the multiple CPUs the setting was effectively divided by that count... so if you set 4 minutes on an 8 CPU system the effective checkpointing interval was 30 seconds ... Recently, ( and I don't recall how recently ) the multiplier was removed so we are back to much more rapid CKPTs than most expect... especially on GPU equipped systems (which add processing elements and tasks in work)... |
©2024 Astroinformatics Group