New Nbody version 1.48

Author	Message
Richard Haselgrove Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0	Message 63170 - Posted: 20 Feb 2015, 18:50:46 UTC - in response to Message 63165. Reporting on the split-run experiment. I ran six tasks with one thread to the first checkpoint, and multi-threaded thereafter. It shows in the stderr_txt: <stderr_txt> <search_application> milkyway_nbody 1.48 Windows x86_64 double OpenMP, Crlibm </search_application> Using OpenMP 1 max threads on a system with 4 processors Using OpenMP 3 max threads on a system with 4 processors Poor likelihood. Returning worst case. <search_likelihood>-9999999.900000000400000</search_likelihood> 21:46:43 (8552): called boinc_finish </stderr_txt> So far, three WUs have validated against the extended quorum of three completed tasks, and three are still waiting for 64-bit clients to finish them off. No errors reported. Wingmate 520623 was interesting: Using OpenMP 16 max threads on a system with 24 processors He's using an older BOINC v7.0.64 client, so he can't be using the same facility as me to reduce the thread count - maybe the server has been set to a maximum of 16 threads. But he's getting nothing like a 16::1 CPU time to runtime ratio for nbody tasks. ID: 63170 · Rating: 0 · rate: / Reply Quote

bowlinra Send message Joined: 16 Feb 15 Posts: 9 Credit: 13,905,328 RAC: 0	Message 63179 - Posted: 24 Feb 2015, 5:19:22 UTC Last modified: 24 Feb 2015, 5:28:10 UTC I've been running boinc for about 10 days now and was ran into your 1.46 wu. Pulled out most of my hair, because I thought I was doing something wrong. I've clear them out and have found two hung 1.48 WUs on different boxes. I'm not sure how best to report / show this issue, that could help someone solve it. So here goes.. Server is 48 cores - Amd 61xxes 2.8Ghz 877 Watts [H] v9.1 Client: 7.0.65 on Ubuntu 12.04 Expected run should be about 1hr 30mins App: ps_nbody_12_20_orphan_sim_3_1422013803_339990_3 Elapsed time: 03d,04:31:57 (03d,04:33:46) 100.0 100.0 - Deadline: 08d,17:27:28 16C Suspended by user Server is 32 cores - Amd 6272es 3.1Ghz 495 Watts [H] v9.1 Client: 7.0.65 on Ubuntu 12.04 Expected run should be about 45mins App: de_nbody_2_13_orphan_sim_1_1422013803_601749_1 Elapsed time: 01d,08:40:36 (01d,08:40:45) 100.0 100.0 - Deadline: 10d,02:05:59 16C Suspended by user Also the WU doesn't appear to be running with all 16C.. here is a top screen shot with 2 1.48 16C WUs running between 26% & 46% completed. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7191 boinc 39 19 154m 14m 1860 R 1056 0.1 136:32.74 milkyway_nbody_ 7320 boinc 39 19 25260 7440 1684 R 100 0.1 3:39.90 milkyway_nbody_ 1206 root 20 0 214m 24m 7640 S 0 0.2 4:50.99 Xorg Please let me know, if there a better way to report things like this in the future. SMP: 4x AMD 61xx@3.0Ghz - 4x AMD 6176SE@2.71Ghz - 4x AMD 6172@2.41Ghz - 2x AMD 62xx@3.3Ghz "Unless someone like you cares a whole awful lot, nothing is going to get better. It's not." Dr. Seuss, The Lorax ID: 63179 · Rating: 0 · rate: / Reply Quote

bowlinra Send message Joined: 16 Feb 15 Posts: 9 Credit: 13,905,328 RAC: 0	Message 63182 - Posted: 26 Feb 2015, 2:49:28 UTC - in response to Message 63179. Had another 4-6 stalled units, most appear be going thru ok. SMP: 4x AMD 61xx@3.0Ghz - 4x AMD 6176SE@2.71Ghz - 4x AMD 6172@2.41Ghz - 2x AMD 62xx@3.3Ghz "Unless someone like you cares a whole awful lot, nothing is going to get better. It's not." Dr. Seuss, The Lorax ID: 63182 · Rating: 0 · rate: / Reply Quote

bowlinra Send message Joined: 16 Feb 15 Posts: 9 Credit: 13,905,328 RAC: 0	Message 63184 - Posted: 28 Feb 2015, 7:42:22 UTC - in response to Message 63182. I have heard what to do with all these stalled 100% done WUs, except abort them. Am I in the wrong message board to post these problems? SMP: 4x AMD 61xx@3.0Ghz - 4x AMD 6176SE@2.71Ghz - 4x AMD 6172@2.41Ghz - 2x AMD 62xx@3.3Ghz "Unless someone like you cares a whole awful lot, nothing is going to get better. It's not." Dr. Seuss, The Lorax ID: 63184 · Rating: 0 · rate: / Reply Quote

Sidd Project developer Project tester Project scientist Send message Joined: 19 May 14 Posts: 73 Credit: 356,131 RAC: 0	Message 63191 - Posted: 2 Mar 2015, 20:27:51 UTC - in response to Message 63184. Hey, It seems some of the runs you had were v1.46. If these are stalled you can abort them. ID: 63191 · Rating: 0 · rate: / Reply Quote

bowlinra Send message Joined: 16 Feb 15 Posts: 9 Credit: 13,905,328 RAC: 0	Message 63192 - Posted: 3 Mar 2015, 5:39:24 UTC - in response to Message 63191. Last modified: 3 Mar 2015, 5:42:43 UTC I cleared out all the 1.46.. I'm still get some rare 1.48 that are stall at 100%. Here are all the one I can easily find from my BoincTasks History. 3 Unbuntu 32/48 core server on boinc 7.0.65 and 1 i7-920 PC on boinc 7.4.36. I'd be glad to give you more information if you need it. Here are some: Project, Application, Name, Elapsed Time, Completed, Reported, Use, CPU%, Status, Computer, Virtual, Memory Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_1_1422013803_601749_1 01d,08:43:36 (01d,08:43:47) 25-02-2015 01:35 AM 25-02-2015 11:09 AM 16C 100.0 Aborted (203) fah20 - Amd 6272es 3.1Ghz Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_1_1422013803_643065_2 00:03:56 (00:03:56) 25-02-2015 01:35 AM 25-02-2015 11:09 AM 16C 100.0 Aborted (203) fah20 - Amd 6272es 3.1Ghz Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_1_1422013803_475467_1 00:01:19 (00:01:19) 25-02-2015 01:35 AM 25-02-2015 11:09 AM 16C 100.0 Aborted (203) fah20 - Amd 6272es 3.1Ghz Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_2_1422013803_808241_1 00:05:00 (00:05:00) 03-03-2015 12:21 AM 16C 100.0 Aborted (203) fah20 - Amd 6272es 3.1Ghz 24.54 MB 6.82 MB Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_1_1422013803_782698_1 10:10:25 (10:10:26) 02-03-2015 07:29 AM 02-03-2015 09:53 AM 16C 100.0 Aborted (203) fah20 - Amd 6272es 3.1Ghz 26.34 MB 8.85 MB Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) ps_nbody_12_20_orphan_sim_3_1422013803_339990_3 14:13:21 (14:13:36) 26-02-2015 01:40 AM 26-02-2015 05:25 AM 16C 100.0 Aborted (203) fah21 - Amd 61xxes 3.0Ghz 22.35 MB 4.68 MB Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_12_20_orphan_sim_3_1413455402_1688519_7 07:39:32 (07:39:39) 26-02-2015 06:31 PM 26-02-2015 09:08 PM 16C 100.0 Aborted (203) fah21 - Amd 61xxes 3.0Ghz 21.19 MB 3.65 MB Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_1_1422013803_730744_0 04:02:34 (04:02:36) 28-02-2015 02:35 AM 28-02-2015 03:14 AM 16C 100.0 Aborted (203) fah21 - Amd 61xxes 3.0Ghz 26.73 MB 9.07 MB Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_1_1422013803_505170_3 03:11:49 (03:12:03) 25-02-2015 01:33 AM 25-02-2015 01:36 AM 16C 100.0 Aborted (203) fah18 - Amd 6176se 2.713Ghz Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_1_1422013803_563877_0 00:51:54 (00:52:08) 25-02-2015 01:33 AM 25-02-2015 01:36 AM 16C 100.0 Aborted (203) fah18 - Amd 6176se 2.713Ghz Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_2_1422013803_652564_0 03:45:25 (21:35:00) 24-02-2015 06:34 PM 24-02-2015 07:34 PM 8C 100.0 Reported: Computation error (4,) Christy-i7-920 33.69 MB 34.88 MB * This one has an ODD time stamp; I just start crunch in 2/14/15 Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_2_1422013803_760581_0 05:20:30 (05:20:10) 01-03-2015 04:28 AM 01-03-2015 04:32 AM 8C 99.9 Aborted (203) Christy-i7-920 5.42 MB 6.52 MB SMP: 4x AMD 61xx@3.0Ghz - 4x AMD 6176SE@2.71Ghz - 4x AMD 6172@2.41Ghz - 2x AMD 62xx@3.3Ghz "Unless someone like you cares a whole awful lot, nothing is going to get better. It's not." Dr. Seuss, The Lorax ID: 63192 · Rating: 0 · rate: / Reply Quote

bowlinra Send message Joined: 16 Feb 15 Posts: 9 Credit: 13,905,328 RAC: 0	Message 63193 - Posted: 3 Mar 2015, 5:51:47 UTC - in response to Message 63192. Another observation, the 1.48 16C units are not full utilitizing all the cores. I would expect to close to 1600 on the %CPU. I have 48 core AMD 6100ES, currently running 3x 1.48 16C and is only showing about 55% usage: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 57743 boinc 39 19 164m 25m 1788 R 936 0.1 21:28.01 milkyway_nbody_ 57169 boinc 39 19 157m 16m 1824 R 897 0.1 1885:26 milkyway_nbody_ 57711 boinc 39 19 154m 14m 1792 R 813 0.0 64:59.87 milkyway_nbody_ 267 root 20 0 0 0 0 S 0 0.0 0:16.82 kworker/28:1 1503 root 20 0 203m 21m 7532 S 0 0.1 26:24.08 Xorg Another server 32 core AMD 6272ES, currently running 2x 1.48 16C and is only showing about 70% PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 32002 boinc 39 19 168m 29m 1848 R 1184 0.2 45:34.81 milkyway_nbody_ 31999 boinc 39 19 154m 14m 1856 R 1082 0.1 102:30.81 milkyway_nbody_ 1206 root 20 0 214m 21m 7460 S 0 0.2 29:40.73 Xorg 32302 horde 20 0 17476 1408 944 R 0 0.0 0:00.18 top SMP: 4x AMD 61xx@3.0Ghz - 4x AMD 6176SE@2.71Ghz - 4x AMD 6172@2.41Ghz - 2x AMD 62xx@3.3Ghz "Unless someone like you cares a whole awful lot, nothing is going to get better. It's not." Dr. Seuss, The Lorax ID: 63193 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 22 Jun 11 Posts: 32 Credit: 41,852,496 RAC: 0	Message 63194 - Posted: 3 Mar 2015, 8:18:11 UTC Last modified: 3 Mar 2015, 8:21:49 UTC Unless I am mistaken, the question "Can 1.48 tasks get stuck in a loop at 100%?" remains unanswered. Can we get a solid answer on it? And, if yes, can we get instruction as to what to do? Simple questions, really. Trying to figure out what we can expect from the 1.48 mt app... ID: 63194 · Rating: 0 · rate: / Reply Quote

bowlinra Send message Joined: 16 Feb 15 Posts: 9 Credit: 13,905,328 RAC: 0	Message 63196 - Posted: 3 Mar 2015, 22:01:39 UTC - in response to Message 63192. Another one today: Project, Application, Name, Elapsed Time, Completed, Reported, Use, CPU%, Status, Computer, Virtual, Memory Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) ps_nbody_12_20_orphan_sim_3_1422013803_316785_5 09:23:50 (09:23:58) 100.0 100.0 - 11d,12:51:16 16C Running 22.22 MB fah21 - Amd 61xxes 3.0Ghz SMP: 4x AMD 61xx@3.0Ghz - 4x AMD 6176SE@2.71Ghz - 4x AMD 6172@2.41Ghz - 2x AMD 62xx@3.3Ghz "Unless someone like you cares a whole awful lot, nothing is going to get better. It's not." Dr. Seuss, The Lorax ID: 63196 · Rating: 0 · rate: / Reply Quote

Sidd Project developer Project tester Project scientist Send message Joined: 19 May 14 Posts: 73 Credit: 356,131 RAC: 0	Message 63203 - Posted: 5 Mar 2015, 21:28:54 UTC - in response to Message 63194. Hey, The initialization procedure, unfortunately, is not currently worked into the time estimate. We are working on this. Since some parameters will take a bit longer to run than others, in some work units the estimate will say 100% but the simulation may still be running. However, if the simulation takes an egregious amount of time (on the order of a large fraction of the simulation time thus far) to finish after reaching 100% you can go ahead and cancel it. However, there should not be any infinite loops occurring in this version. ID: 63203 · Rating: 0 · rate: / Reply Quote

Odysseus Send message Joined: 10 Nov 07 Posts: 96 Credit: 29,931,027 RAC: 0	Message 63205 - Posted: 8 Mar 2015, 0:43:05 UTC Iâ€™ve got a troublesome 1.48 task on my MacBook Pro. (In case itâ€™s relevant, BOINC is â€œthrottledâ€ to 25% usage on this machine, which has cooling issues including a bad fan.) Around 0h local time on Friday it showed 99.75% done after ~3 h â€œRunning (2 CPUs)â€, 30 s to go. The next morning it had reached 99.998% in ~5.7 h; the remaining estimate was down to 0 s. Hereâ€™s where it gets weird. Late that afternoon the log shows Einstein@Home starting a task. I observed at about 18h that the progress was showing 100% complete, ~7.5 h time, 0 s to goâ€”still claiming to use both CPUs despite obviously being down to one. It continued without change (aside from incrementing CPU time) until late this afternoon, when I noticed that the â€œ00:00:00â€œ in the time-remaining column had changed to â€œ---â€, after about 13 h of CPU time. (Meanwhile the E@h task seems to be progressing normally.) Iâ€™m considering aborting this one if it doesnâ€™t finish pretty soon. ID: 63205 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0	Message 63206 - Posted: 8 Mar 2015, 1:04:36 UTC - in response to Message 63205. Try to use independent tools - Windows would be Task Manager and Process Explorer, I don't know what the Mac OS X equivalents would be - to distinguish between 'what BOINC says is going on' and 'what is really happening'. Recent BOINCs give users a simulated 'pseudo %age' display to avoid anxieties. But it sounds as if this task is making no real progress at all, and all the time estimates are simulations. You mention thermal problems and BOINC being 'throttled'. Is that a restriction on the number of cores used, or the proportion of time BOINC is allowed to run? There is a particular problem with the nbody tasks not checkpointing during the initialisation phase: If BOINC interupts computations during this time (especially if applications are not kept in memory while suspended), you might be re-winding to the very beginning at every interruption. ID: 63206 · Rating: 0 · rate: / Reply Quote

Odysseus Send message Joined: 10 Nov 07 Posts: 96 Credit: 29,931,027 RAC: 0	Message 63207 - Posted: 8 Mar 2015, 4:43:59 UTC - in response to Message 63206. What I see in Appleâ€™s Activity Monitor, which I gather is just a GUI for Unixâ€™s top command, is consistent with one app running on each core, but it doesnâ€˜t explicitly tell me what threads are running where. Iâ€™ve always kept applications in memory. I had been running this machine at 50% on a single core until recently; after a v1.46 WU (now removed from the database) held up other projects until I aborted it, I changed the setting to 25% on both cores. Since I allow BOINC to run while the computerâ€˜s in use, the interruptions are frequent but always brief. ID: 63207 · Rating: 0 · rate: / Reply Quote

Odysseus Send message Joined: 10 Nov 07 Posts: 96 Credit: 29,931,027 RAC: 0	Message 63208 - Posted: 8 Mar 2015, 19:26:16 UTC - in response to Message 63207. Update: The E@H task is now complete, uploaded, and validated. Nothing seems to have changed with the MW@h task (except that itâ€™s showing 18 h CPU time. No other projectâ€™s task has been resumed on the other CPU, but I donâ€™t think MW@h is fully using both of them either (despite still saying so), because my â€œCPU Proximityâ€ temperature reads about 5 CÂ° below normal. ID: 63208 · Rating: 0 · rate: / Reply Quote

Odysseus Send message Joined: 10 Nov 07 Posts: 96 Credit: 29,931,027 RAC: 0	Message 63209 - Posted: 9 Mar 2015, 6:16:42 UTC - in response to Message 63208. Well, I gave up (after 20:45 CPU time) to give my cached task a chance of getting done before deadline (and my other projects some time as well). Weâ€™ll see how this one goes â€¦ ID: 63209 · Rating: 0 · rate: / Reply Quote

bowlinra Send message Joined: 16 Feb 15 Posts: 9 Credit: 13,905,328 RAC: 0	Message 63215 - Posted: 12 Mar 2015, 4:03:26 UTC Another stalled task 19+Hrs, other WU in this series (ps_nbody_12_20_orphan_sim_314) run under 18 mins Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) ps_nbody_12_20_orphan_sim_3_1413455402_1997284_5 19:48:43 (19:49:19) 100.0 100.0 - 11d,00:16:35 16C Suspended by user fah24 - Amd 6272es 3.1Ghz SMP: 4x AMD 61xx@3.0Ghz - 4x AMD 6176SE@2.71Ghz - 4x AMD 6172@2.41Ghz - 2x AMD 62xx@3.3Ghz "Unless someone like you cares a whole awful lot, nothing is going to get better. It's not." Dr. Seuss, The Lorax* ID: 63215 · Rating: 0 · rate: / Reply Quote

Odysseus Send message Joined: 10 Nov 07 Posts: 96 Credit: 29,931,027 RAC: 0	Message 63216 - Posted: 12 Mar 2015, 7:15:41 UTC My next WU seemed to run fine, but canâ€™t validateâ€”looks like the bugs in 1.46 are still biting â€¦ ID: 63216 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0	Message 63218 - Posted: 12 Mar 2015, 8:55:24 UTC - in response to Message 63216. My next WU seemed to run fine, but canâ€™t validateâ€”looks like the bugs in 1.46 are still biting â€¦ Well, as I pointed out three weeks ago, v1.46 is still being issued to 32-bit hosts, and still failing - the v1.48 deployment was incomplete. Has anybody seen Sid(d)? Please tell him... ID: 63218 · Rating: 0 · rate: / Reply Quote

Knut Petter Send message Joined: 29 Apr 09 Posts: 4 Credit: 74,209,059 RAC: 0	Message 63220 - Posted: 12 Mar 2015, 18:18:59 UTC Last modified: 12 Mar 2015, 18:24:39 UTC Not sure if this is because of the new N-Body. But basically one of my GPUs is not processing anything, because of CPU limitations. 12/03/2015 19:17:11 \| Milkyway@Home \| [cpu_sched_debug] all CPUs used (8.38 >= 8), skipping de_80_DR8_Rev_8_5_00004_1422013802_26039449_0 I never had this issue before. Computing preferences are set to no restriction. Though the "Use at most" always jumps back to 100% from 0. EDIT: And earlier I was able to process CPU jobs on all eight cores, while running a GPU job on each of my GPUs. 12/03/2015 19:10:09 \| \| CUDA: NVIDIA GPU 0: GeForce GTX 460 (driver version 347.09, CUDA version 7.0, compute capability 2.1, 1024MB, 685MB available, 1075 GFLOPS peak) 12/03/2015 19:10:09 \| \| CAL: ATI GPU 0: AMD Radeon HD 6900 series (Cayman) (CAL version 1.4.1720, 2048MB, 2016MB available, 5914 GFLOPS peak) 12/03/2015 19:10:09 \| \| OpenCL: NVIDIA GPU 0: GeForce GTX 460 (driver version 347.09, device version OpenCL 1.1 CUDA, 1024MB, 685MB available, 1075 GFLOPS peak) 12/03/2015 19:10:09 \| \| OpenCL: AMD/ATI GPU 0: AMD Radeon HD 6900 series (Cayman) (driver version CAL 1.4.1720 (VM), device version OpenCL 1.2 AMD-APP (923.1), 2048MB, 2016MB available, 5914 GFLOPS peak) 12/03/2015 19:10:09 \| \| OpenCL CPU: Quad-Core AMD Opteron(tm) Processor 2384 (OpenCL driver vendor: Advanced Micro Devices, Inc., driver version 2.0 (sse2), device version OpenCL 1.2 AMD-APP (923.1)) EDIT2( :D ): Suspended the N-Body task, it started a new N-Body simulation AND a opencl_amd_ati computation. ID: 63220 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 22 Jun 11 Posts: 32 Credit: 41,852,496 RAC: 0	Message 63221 - Posted: 12 Mar 2015, 18:26:23 UTC - in response to Message 63220. Is it possible that your CPU task(s) are in jeopardy of meeting their deadline? If BOINC thinks any of them are, they correctly get prioritized ahead of GPU tasks, and GPUs can be left idle. It happens sometimes to me, especially on projects that have tasks with tight (5 days or less) deadlines. ID: 63221 · Rating: 0 · rate: / Reply Quote