New Nbody version 1.48

Author	Message
Knut Petter Send message Joined: 29 Apr 09 Posts: 4 Credit: 74,209,059 RAC: 0	Message 63222 - Posted: 12 Mar 2015, 18:44:44 UTC That would probably explain it. Though for the project, I would imagine the computation power of the 6950 would be more valuable than my poor old Opterons. ID: 63222 · Rating: 0 · rate: / Reply Quote

bowlinra Send message Joined: 16 Feb 15 Posts: 9 Credit: 13,905,328 RAC: 0	Message 63223 - Posted: 12 Mar 2015, 19:18:34 UTC Another stalled 1.48, usually runtime 17mins.. this one 3hr 32m. Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_1_1422013803_709308_1 03:32:22 (03:52:51) 100.0 100.0 - 10d,10:40:41 16C Running 26.99 MB fah23 - Amd 6176se 2.713Ghz SMP: 4x AMD 61xx@3.0Ghz - 4x AMD 6176SE@2.71Ghz - 4x AMD 6172@2.41Ghz - 2x AMD 62xx@3.3Ghz "Unless someone like you cares a whole awful lot, nothing is going to get better. It's not." Dr. Seuss, The Lorax ID: 63223 · Rating: 0 · rate: / Reply Quote

Xray Quadrant Send message Joined: 17 Jul 10 Posts: 4 Credit: 316,889,147 RAC: 0	Message 63225 - Posted: 13 Mar 2015, 23:25:58 UTC I keep having the New Nbody version 1.48 continue to run after it shows 100% completion. Is there a way to keep from down loading these files. I do not want to suspend the whole Milkyway program, just want to stop this from happening. Here is an example: This file de_nbody_2_13_orphan_sim_1_1422013803_566832_0 showed a total run time of about 8 minutes and now has run for 8:58:56 with a remaining time of ---. ID: 63225 · Rating: 0 · rate: / Reply Quote

bowlinra Send message Joined: 16 Feb 15 Posts: 9 Credit: 13,905,328 RAC: 0	Message 63226 - Posted: 14 Mar 2015, 0:01:35 UTC - in response to Message 63225. I keep having the New Nbody version 1.48 continue to run after it shows 100% completion. Is there a way to keep from down loading these files. I do not want to suspend the whole Milkyway program, just want to stop this from happening. Here is an example: This file de_nbody_2_13_orphan_sim_1_1422013803_566832_0 showed a total run time of about 8 minutes and now has run for 8:58:56 with a remaining time of ---. Login your MilkyWay@Homeaccount. Edit MilkyWay@Home preferences -> Run only the selected applications Uncheck - MilkyWay@Home N-Body Simulation I'm only seeing the stalling on the 16C versions. SMP: 4x AMD 61xx@3.0Ghz - 4x AMD 6176SE@2.71Ghz - 4x AMD 6172@2.41Ghz - 2x AMD 62xx@3.3Ghz "Unless someone like you cares a whole awful lot, nothing is going to get better. It's not." Dr. Seuss, The Lorax ID: 63226 · Rating: 0 · rate: / Reply Quote

Xray Quadrant Send message Joined: 17 Jul 10 Posts: 4 Credit: 316,889,147 RAC: 0	Message 63227 - Posted: 14 Mar 2015, 1:16:23 UTC I keep having the New Nbody version 1.48 continue to run after it shows 100% completion. Is there a way to keep from down loading these files. I do not want to suspend the whole Milkyway program, just want to stop this from happening. Here is an example: This file de_nbody_2_13_orphan_sim_1_1422013803_566832_0 showed a total run time of about 8 minutes and now has run for 8:58:56 with a remaining time of ---. Login your MilkyWay@Homeaccount. Edit MilkyWay@Home preferences -> Run only the selected applications Uncheck - MilkyWay@Home N-Body Simulation I'm only seeing the stalling on the 16C versions. Thanks ID: 63227 · Rating: 0 · rate: / Reply Quote

europa Send message Joined: 29 Oct 10 Posts: 89 Credit: 39,246,947 RAC: 0	Message 63229 - Posted: 14 Mar 2015, 14:05:07 UTC My N-body WU's start off really cranking but gradually slow down as more and more of the WU is completed. Finally, after additional hours and hours, they're at 99.999% and become asymtotic WU's...........approaching infinitely close but never actually completing! Right now, they're just wasting alot of computing power and electricity. Regards, Steve ID: 63229 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 22 Jun 11 Posts: 32 Credit: 41,852,496 RAC: 0	Message 63230 - Posted: 14 Mar 2015, 14:08:11 UTC - in response to Message 63229. Last modified: 14 Mar 2015, 14:09:35 UTC If you think of it as a process that is incapable of determining its own length/progress (which some BOINC tasks are) .... then what should it do? Approach 100%, but never hit it? Sit at 100%? Does it matter? My point is that it's actually probably NOT a waste of electricity, if it's still running. It's probably just incapable of determining its own progress correctly. That number, is just a number. Even when it "started off really cranking", that number was just a number. ID: 63230 · Rating: 0 · rate: / Reply Quote

Pascal Send message Joined: 17 Oct 14 Posts: 3 Credit: 842,084 RAC: 0	Message 63258 - Posted: 25 Mar 2015, 4:35:23 UTC The biggest problem with initialization taking longer and not checkpointing comes when initialization takes longer than 60 minutes. By default BOINC switches tasks every 60 minutes. I just figured out today that for the past few weeks one of my boxes has spent 60 minutes running Nbody initialization, then it switches to a different project for 60 minutes, then it restarts the same initialization from scratch. Basically it has completely wasted half of the processing time since upgrading to 1.48. ID: 63258 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 22 Jun 11 Posts: 32 Credit: 41,852,496 RAC: 0	Message 63259 - Posted: 25 Mar 2015, 5:03:12 UTC Last modified: 25 Mar 2015, 5:10:38 UTC The biggest problem with initialization taking longer and not checkpointing comes when initialization takes longer than 60 minutes. By default BOINC switches tasks every 60 minutes. I just figured out today that for the past few weeks one of my boxes has spent 60 minutes running Nbody initialization, then it switches to a different project for 60 minutes, then it restarts the same initialization from scratch. Basically it has completely wasted half of the processing time since upgrading to 1.48. That setting "Switch between apps every x minutes" is more sophisticated than you think. I was under the impression that it was only supposed to switch away from an app only when it had checkpointed, or was being pre-empted by a task that was in deadline jeopardy. So... the user's setting is basically just a suggestion for BOINC to try to switch if it can, but only when a checkpoint occurs. So maybe you're hitting on a BOINC bug there? Are you using the latest version of BOINC? Are you using the "Leave application in memory" option? And can you describe the steps necessary to easily reproduce the problem? ID: 63259 · Rating: 0 · rate: / Reply Quote

Pascal Send message Joined: 17 Oct 14 Posts: 3 Credit: 842,084 RAC: 0	Message 63261 - Posted: 25 Mar 2015, 9:05:41 UTC - in response to Message 63259. We had multithreaded the assignments of radii and velocities to bodies. Both of these were done through rejection sampling, using random numbers. However, when that code ran with multiple threads the assignment of radii and velocity were different between runs even with the same random number seed and parameters. This was because which thread ran in which order was indeterminate, meaning which body was assign what radii and velocity was unpredictable. This was a very nasty bug, made nastier because it did not present itself. Runs would complete normally. However, because of the indeterminate nature of the algorithm, a poorer likelihood was reported than would be expected with a set of parameters, even if they were close. Therefore, overall, it led to poor convergence. It sounds like you were trying to use a single PRNG for all threads. One solution is to use multiple independent PRNGs. If body distribution to threads is deterministic then the original thread can pass each new thread a seed from its PRNG they can use to seed their own PRNG. If not, then every time a thread picks up a new body it will need to reseed its PRNG using something specific to the body. erand48(), nrand48(), or jrand48() may work for you. So maybe you're hitting on a BOINC bug there? Are you using the latest version of BOINC? Are you using the "Leave application in memory" option? And can you describe the steps necessary to easily reproduce the problem? I ran into the problem on BOINC 7.2.42 on Linux version 3.10.0-123.13.2.el7.x86_64. That is the current "Recommended version" for that OS. No, leave_apps_in_memory was unset. It may be possible to easily reproduce by setting cpu_scheduling_period_minutes to 1. I'll think about it. ID: 63261 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0	Message 63267 - Posted: 25 Mar 2015, 22:41:47 UTC - in response to Message 63259. The biggest problem with initialization taking longer and not checkpointing comes when initialization takes longer than 60 minutes. By default BOINC switches tasks every 60 minutes. I just figured out today that for the past few weeks one of my boxes has spent 60 minutes running Nbody initialization, then it switches to a different project for 60 minutes, then it restarts the same initialization from scratch. Basically it has completely wasted half of the processing time since upgrading to 1.48. That setting "Switch between apps every x minutes" is more sophisticated than you think. I was under the impression that it was only supposed to switch away from an app only when it had checkpointed, or was being pre-empted by a task that was in deadline jeopardy. So... the user's setting is basically just a suggestion for BOINC to try to switch if it can, but only when a checkpoint occurs. So maybe you're hitting on a BOINC bug there? Are you using the latest version of BOINC? Are you using the "Leave application in memory" option? And can you describe the steps necessary to easily reproduce the problem? There are well-known weaknesses in MT scheduling. More commonly in the opposite direction: If you are running a multiplicity of other CPU projects, and BOINC decides to schedule an MT task, it will wait until one of the other project tasks is preemptible - checkpointed, task exit, or something like that. But as soon as the MT task starts, all the other tasks (well, enough to meet the MT thread count) will be preempted immediately, ready or not. I'd not heard the reverse problem, MT preempting before checkpoint, before - but it's worth investigating, to find what is triggering the switch. Something in 'even higher priority', perhaps? And as Jacob says, if the MT task is kept in memory when suspended, it shouldn't restart from square zero. ID: 63267 · Rating: 0 · rate: / Reply Quote

Pascal Send message Joined: 17 Oct 14 Posts: 3 Credit: 842,084 RAC: 0	Message 63269 - Posted: 26 Mar 2015, 1:02:16 UTC - in response to Message 63267. I'd not heard the reverse problem, MT preempting before checkpoint, before - but it's worth investigating, to find what is triggering the switch. Something in 'even higher priority', perhaps? And as Jacob says, if the MT task is kept in memory when suspended, it shouldn't restart from square zero. Not worth investigating. I was wrong. BOINC appears to be behaving correctly. The problem, however, was actually worse than I thought. All my ssh sessions are automatically logged. I just found this reviewing yesterday's logs: 8) ----------- name: de_nbody_2_13_orphan_sim_2_1422013803_804445_1 WU name: de_nbody_2_13_orphan_sim_2_1422013803_804445 project URL: http://milkyway.cs.rpi.edu/milkyway/ report deadline: Thu Mar 26 19:38:57 2015 ready to report: no got server ack: no final CPU time: 0.000000 state: downloaded scheduler state: scheduled exit_status: 0 signal: 0 suspended via GUI: no active_task_state: EXECUTING app version num: 148 checkpoint CPU time: 0.000000 current CPU time: 734050.600000 fraction done: 1.000000 swap size: 22847488.000000 working set size: 3469312.000000 estimated CPU time remaining: -735547.873428 9) ----------- That's just under 8.5 days straight Nbody spent on that task without checkpointing, apparently uninterrupted by BOINC. The CPU is an Intel Xeon E3-1271 v3 @ 3.60GHz. Not the latest & greatest, but certainly no slouch. Other tasks on the box use about half a core. I first started looking at the box yesterday after I noticed BOINC was only using 1 of the 8 cores. It looks like my troubleshooting was what caused that task to start over several times yesterday, not BOINC. Eventually I aborted all the Nbody tasks and set nomorework for Milkyway. ID: 63269 · Rating: 0 · rate: / Reply Quote

Wisesooth Send message Joined: 2 Oct 14 Posts: 43 Credit: 55,124,908 RAC: 1,467	Message 63329 - Posted: 7 Apr 2015, 17:32:38 UTC - in response to Message 63230. I had to abort n-body tasks after they ran anywhere from 25 hours to three days at 100% before I aborted them. 100% is not just a number; it also is a clue. If the task is in a long loop while using multiple-core threads, the cause may be a persistent timing issue. This anomaly occurs when a task uses multiple processor threads to calculate a heuristic solution. Boinc uses machines of all kinds of capabilities. Most of them are not designed for intense scientific computation. Statisticians talk about "degrees of freedom." Grid computing by its very nature introduces potential instability in multi-core tasks. I currently am having no problems with n-body 1 tasks. I am having problems with n-body 2 tasks. ID: 63329 · Rating: 0 · rate: / Reply Quote

Ed Send message Joined: 3 Dec 13 Posts: 1 Credit: 947,030,513 RAC: 5,456	Message 63359 - Posted: 12 Apr 2015, 2:35:08 UTC - in response to Message 63137. I am having a lot of problems with the multi-thread work units. The work units reserve all of the threads on my intel i5 and i7 machines (4 and 8 threads respectively), but the CPU consumption goes anyware from 4% to a max of about 80% on each thread. This is very wasteful of the processing power and I have started aborting all mt threads that I catch. Would you please look into either fixing the mt problem or consider eliminating it completely. The BOINC manager does a great job of ensuring the processor threads are fully utilized. Relying on the BOINC manager seems like a good alternative to doing the work management internally in the Milkyway code. ID: 63359 · Rating: 0 · rate: / Reply Quote