New Nbody version 1.48
log in

Advanced search

Message boards : News : New Nbody version 1.48

Previous · 1 · 2 · 3
Author Message
Knut Petter
Send message
Joined: 29 Apr 09
Posts: 4
Credit: 74,209,059
RAC: 0

Message 63222 - Posted: 12 Mar 2015, 18:44:44 UTC

That would probably explain it. Though for the project, I would imagine the computation power of the 6950 would be more valuable than my poor old Opterons.

bowlinra
Avatar
Send message
Joined: 16 Feb 15
Posts: 9
Credit: 13,905,328
RAC: 0

Message 63223 - Posted: 12 Mar 2015, 19:18:34 UTC

Another stalled 1.48, usually runtime 17mins.. this one 3hr 32m.

Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_1_1422013803_709308_1 03:32:22 (03:52:51) 100.0 100.0 - 10d,10:40:41 16C Running 26.99 MB fah23 - Amd 6176se 2.713Ghz

____________
SMP: 4x AMD 61xx@3.0Ghz - 4x AMD 6176SE@2.71Ghz - 4x AMD 6172@2.41Ghz - 2x AMD 62xx@3.3Ghz

"Unless someone like you cares a whole awful lot, nothing is going to get better. It's not." Dr. Seuss, The Lorax

Xray Quadrant
Send message
Joined: 17 Jul 10
Posts: 4
Credit: 63,470,421
RAC: 64,244

Message 63225 - Posted: 13 Mar 2015, 23:25:58 UTC

I keep having the New Nbody version 1.48 continue to run after it shows 100% completion. Is there a way to keep from down loading these files. I do not want to suspend the whole Milkyway program, just want to stop this from happening. Here is an example: This file de_nbody_2_13_orphan_sim_1_1422013803_566832_0 showed a total run time of about 8 minutes and now has run for 8:58:56 with a remaining time of ---.

bowlinra
Avatar
Send message
Joined: 16 Feb 15
Posts: 9
Credit: 13,905,328
RAC: 0

Message 63226 - Posted: 14 Mar 2015, 0:01:35 UTC - in response to Message 63225.

I keep having the New Nbody version 1.48 continue to run after it shows 100% completion. Is there a way to keep from down loading these files. I do not want to suspend the whole Milkyway program, just want to stop this from happening. Here is an example: This file de_nbody_2_13_orphan_sim_1_1422013803_566832_0 showed a total run time of about 8 minutes and now has run for 8:58:56 with a remaining time of ---.

Login your MilkyWay@Homeaccount. Edit MilkyWay@Home preferences -> Run only the selected applications
Uncheck - MilkyWay@Home N-Body Simulation

I'm only seeing the stalling on the 16C versions.
____________
SMP: 4x AMD 61xx@3.0Ghz - 4x AMD 6176SE@2.71Ghz - 4x AMD 6172@2.41Ghz - 2x AMD 62xx@3.3Ghz

"Unless someone like you cares a whole awful lot, nothing is going to get better. It's not." Dr. Seuss, The Lorax

Xray Quadrant
Send message
Joined: 17 Jul 10
Posts: 4
Credit: 63,470,421
RAC: 64,244

Message 63227 - Posted: 14 Mar 2015, 1:16:23 UTC

I keep having the New Nbody version 1.48 continue to run after it shows 100% completion. Is there a way to keep from down loading these files. I do not want to suspend the whole Milkyway program, just want to stop this from happening. Here is an example: This file de_nbody_2_13_orphan_sim_1_1422013803_566832_0 showed a total run time of about 8 minutes and now has run for 8:58:56 with a remaining time of ---.

Login your MilkyWay@Homeaccount. Edit MilkyWay@Home preferences -> Run only the selected applications
Uncheck - MilkyWay@Home N-Body Simulation

I'm only seeing the stalling on the 16C versions.


Thanks

europa
Send message
Joined: 29 Oct 10
Posts: 89
Credit: 39,246,947
RAC: 0

Message 63229 - Posted: 14 Mar 2015, 14:05:07 UTC

My N-body WU's start off really cranking but gradually slow down as more and more of the WU is completed.

Finally, after additional hours and hours, they're at 99.999% and become asymtotic WU's...........approaching infinitely close but never actually completing!

Right now, they're just wasting alot of computing power and electricity.

Regards,
Steve

Jacob Klein
Send message
Joined: 22 Jun 11
Posts: 32
Credit: 2,437,000
RAC: 6,907

Message 63230 - Posted: 14 Mar 2015, 14:08:11 UTC - in response to Message 63229.
Last modified: 14 Mar 2015, 14:09:35 UTC

If you think of it as a process that is incapable of determining its own length/progress (which some BOINC tasks are) .... then what should it do?

Approach 100%, but never hit it?
Sit at 100%?
Does it matter?

My point is that it's actually probably NOT a waste of electricity, if it's still running. It's probably just incapable of determining its own progress correctly. That number, is just a number. Even when it "started off really cranking", that number was just a number.

Pascal
Send message
Joined: 17 Oct 14
Posts: 3
Credit: 842,084
RAC: 0

Message 63258 - Posted: 25 Mar 2015, 4:35:23 UTC

The biggest problem with initialization taking longer and not checkpointing comes when initialization takes longer than 60 minutes. By default BOINC switches tasks every 60 minutes. I just figured out today that for the past few weeks one of my boxes has spent 60 minutes running Nbody initialization, then it switches to a different project for 60 minutes, then it restarts the same initialization from scratch. Basically it has completely wasted half of the processing time since upgrading to 1.48.

Jacob Klein
Send message
Joined: 22 Jun 11
Posts: 32
Credit: 2,437,000
RAC: 6,907

Message 63259 - Posted: 25 Mar 2015, 5:03:12 UTC
Last modified: 25 Mar 2015, 5:10:38 UTC

The biggest problem with initialization taking longer and not checkpointing comes when initialization takes longer than 60 minutes. By default BOINC switches tasks every 60 minutes. I just figured out today that for the past few weeks one of my boxes has spent 60 minutes running Nbody initialization, then it switches to a different project for 60 minutes, then it restarts the same initialization from scratch. Basically it has completely wasted half of the processing time since upgrading to 1.48.


That setting "Switch between apps every x minutes" is more sophisticated than you think. I was under the impression that it was only supposed to switch away from an app only when it had checkpointed, or was being pre-empted by a task that was in deadline jeopardy. So... the user's setting is basically just a suggestion for BOINC to try to switch if it can, but only when a checkpoint occurs.

So maybe you're hitting on a BOINC bug there? Are you using the latest version of BOINC? Are you using the "Leave application in memory" option? And can you describe the steps necessary to easily reproduce the problem?

Pascal
Send message
Joined: 17 Oct 14
Posts: 3
Credit: 842,084
RAC: 0

Message 63261 - Posted: 25 Mar 2015, 9:05:41 UTC - in response to Message 63259.

We had multithreaded the assignments of radii and velocities to bodies. Both of these were done through rejection sampling, using random numbers. However, when that code ran with multiple threads the assignment of radii and velocity were different between runs even with the same random number seed and parameters. This was because which thread ran in which order was indeterminate, meaning which body was assign what radii and velocity was unpredictable. This was a very nasty bug, made nastier because it did not present itself. Runs would complete normally. However, because of the indeterminate nature of the algorithm, a poorer likelihood was reported than would be expected with a set of parameters, even if they were close. Therefore, overall, it led to poor convergence.

It sounds like you were trying to use a single PRNG for all threads. One solution is to use multiple independent PRNGs. If body distribution to threads is deterministic then the original thread can pass each new thread a seed from its PRNG they can use to seed their own PRNG. If not, then every time a thread picks up a new body it will need to reseed its PRNG using something specific to the body. erand48(), nrand48(), or jrand48() may work for you.

So maybe you're hitting on a BOINC bug there? Are you using the latest version of BOINC? Are you using the "Leave application in memory" option? And can you describe the steps necessary to easily reproduce the problem?

I ran into the problem on BOINC 7.2.42 on Linux version 3.10.0-123.13.2.el7.x86_64. That is the current "Recommended version" for that OS. No, leave_apps_in_memory was unset. It may be possible to easily reproduce by setting cpu_scheduling_period_minutes to 1. I'll think about it.

Richard Haselgrove
Send message
Joined: 4 Sep 12
Posts: 218
Credit: 448,778
RAC: 0

Message 63267 - Posted: 25 Mar 2015, 22:41:47 UTC - in response to Message 63259.

The biggest problem with initialization taking longer and not checkpointing comes when initialization takes longer than 60 minutes. By default BOINC switches tasks every 60 minutes. I just figured out today that for the past few weeks one of my boxes has spent 60 minutes running Nbody initialization, then it switches to a different project for 60 minutes, then it restarts the same initialization from scratch. Basically it has completely wasted half of the processing time since upgrading to 1.48.

That setting "Switch between apps every x minutes" is more sophisticated than you think. I was under the impression that it was only supposed to switch away from an app only when it had checkpointed, or was being pre-empted by a task that was in deadline jeopardy. So... the user's setting is basically just a suggestion for BOINC to try to switch if it can, but only when a checkpoint occurs.

So maybe you're hitting on a BOINC bug there? Are you using the latest version of BOINC? Are you using the "Leave application in memory" option? And can you describe the steps necessary to easily reproduce the problem?

There are well-known weaknesses in MT scheduling. More commonly in the opposite direction:

If you are running a multiplicity of other CPU projects, and BOINC decides to schedule an MT task, it will wait until one of the other project tasks is preemptible - checkpointed, task exit, or something like that. But as soon as the MT task starts, all the other tasks (well, enough to meet the MT thread count) will be preempted immediately, ready or not.

I'd not heard the reverse problem, MT preempting before checkpoint, before - but it's worth investigating, to find what is triggering the switch. Something in 'even higher priority', perhaps? And as Jacob says, if the MT task is kept in memory when suspended, it shouldn't restart from square zero.

Pascal
Send message
Joined: 17 Oct 14
Posts: 3
Credit: 842,084
RAC: 0

Message 63269 - Posted: 26 Mar 2015, 1:02:16 UTC - in response to Message 63267.

I'd not heard the reverse problem, MT preempting before checkpoint, before - but it's worth investigating, to find what is triggering the switch. Something in 'even higher priority', perhaps? And as Jacob says, if the MT task is kept in memory when suspended, it shouldn't restart from square zero.

Not worth investigating. I was wrong. BOINC appears to be behaving correctly. The problem, however, was actually worse than I thought.

All my ssh sessions are automatically logged. I just found this reviewing yesterday's logs:

8) -----------
name: de_nbody_2_13_orphan_sim_2_1422013803_804445_1
WU name: de_nbody_2_13_orphan_sim_2_1422013803_804445
project URL: http://milkyway.cs.rpi.edu/milkyway/
report deadline: Thu Mar 26 19:38:57 2015
ready to report: no
got server ack: no
final CPU time: 0.000000
state: downloaded
scheduler state: scheduled
exit_status: 0
signal: 0
suspended via GUI: no
active_task_state: EXECUTING
app version num: 148
checkpoint CPU time: 0.000000
current CPU time: 734050.600000
fraction done: 1.000000
swap size: 22847488.000000
working set size: 3469312.000000
estimated CPU time remaining: -735547.873428
9) -----------

That's just under 8.5 days straight Nbody spent on that task without checkpointing, apparently uninterrupted by BOINC. The CPU is an Intel Xeon E3-1271 v3 @ 3.60GHz. Not the latest & greatest, but certainly no slouch. Other tasks on the box use about half a core.

I first started looking at the box yesterday after I noticed BOINC was only using 1 of the 8 cores. It looks like my troubleshooting was what caused that task to start over several times yesterday, not BOINC. Eventually I aborted all the Nbody tasks and set nomorework for Milkyway.

Profile Wisesooth
Send message
Joined: 2 Oct 14
Posts: 33
Credit: 19,784,058
RAC: 29,270

Message 63329 - Posted: 7 Apr 2015, 17:32:38 UTC - in response to Message 63230.

I had to abort n-body tasks after they ran anywhere from 25 hours to three days at 100% before I aborted them. 100% is not just a number; it also is a clue.

If the task is in a long loop while using multiple-core threads, the cause may be a persistent timing issue. This anomaly occurs when a task uses multiple processor threads to calculate a heuristic solution. Boinc uses machines of all kinds of capabilities. Most of them are not designed for intense scientific computation. Statisticians talk about "degrees of freedom." Grid computing by its very nature introduces potential instability in multi-core tasks.

I currently am having no problems with n-body 1 tasks. I am having problems with n-body 2 tasks.
____________

Ed
Send message
Joined: 3 Dec 13
Posts: 1
Credit: 158,288,347
RAC: 408,604

Message 63359 - Posted: 12 Apr 2015, 2:35:08 UTC - in response to Message 63137.

I am having a lot of problems with the multi-thread work units. The work units reserve all of the threads on my intel i5 and i7 machines (4 and 8 threads respectively), but the CPU consumption goes anyware from 4% to a max of about 80% on each thread. This is very wasteful of the processing power and I have started aborting all mt threads that I catch. Would you please look into either fixing the mt problem or consider eliminating it completely. The BOINC manager does a great job of ensuring the processor threads are fully utilized. Relying on the BOINC manager seems like a good alternative to doing the work management internally in the Milkyway code.

Previous · 1 · 2 · 3
Post to thread

Message boards : News : New Nbody version 1.48


Main page · Your account · Message boards


Copyright © 2017 AstroInformatics Group