Message boards :
News :
New Nbody version 1.48
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0 |
Reporting on the split-run experiment. I ran six tasks with one thread to the first checkpoint, and multi-threaded thereafter. It shows in the stderr_txt: <stderr_txt> So far, three WUs have validated against the extended quorum of three completed tasks, and three are still waiting for 64-bit clients to finish them off. No errors reported. Wingmate 520623 was interesting: Using OpenMP 16 max threads on a system with 24 processors He's using an older BOINC v7.0.64 client, so he can't be using the same facility as me to reduce the thread count - maybe the server has been set to a maximum of 16 threads. But he's getting nothing like a 16::1 CPU time to runtime ratio for nbody tasks. |
Send message Joined: 16 Feb 15 Posts: 9 Credit: 13,905,328 RAC: 0 |
I've been running boinc for about 10 days now and was ran into your 1.46 wu. Pulled out most of my hair, because I thought I was doing something wrong. I've clear them out and have found two hung 1.48 WUs on different boxes. I'm not sure how best to report / show this issue, that could help someone solve it. So here goes.. Server is 48 cores - Amd 61xxes 2.8Ghz 877 Watts [H] v9.1 Client: 7.0.65 on Ubuntu 12.04 Expected run should be about 1hr 30mins App: ps_nbody_12_20_orphan_sim_3_1422013803_339990_3 Elapsed time: 03d,04:31:57 (03d,04:33:46) 100.0 100.0 - Deadline: 08d,17:27:28 16C Suspended by user Server is 32 cores - Amd 6272es 3.1Ghz 495 Watts [H] v9.1 Client: 7.0.65 on Ubuntu 12.04 Expected run should be about 45mins App: de_nbody_2_13_orphan_sim_1_1422013803_601749_1 Elapsed time: 01d,08:40:36 (01d,08:40:45) 100.0 100.0 - Deadline: 10d,02:05:59 16C Suspended by user Also the WU doesn't appear to be running with all 16C.. here is a top screen shot with 2 1.48 16C WUs running between 26% & 46% completed. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7191 boinc 39 19 154m 14m 1860 R 1056 0.1 136:32.74 milkyway_nbody_ 7320 boinc 39 19 25260 7440 1684 R 100 0.1 3:39.90 milkyway_nbody_ 1206 root 20 0 214m 24m 7640 S 0 0.2 4:50.99 Xorg Please let me know, if there a better way to report things like this in the future. SMP: 4x AMD 61xx@3.0Ghz - 4x AMD 6176SE@2.71Ghz - 4x AMD 6172@2.41Ghz - 2x AMD 62xx@3.3Ghz "Unless someone like you cares a whole awful lot, nothing is going to get better. It's not." Dr. Seuss, The Lorax |
Send message Joined: 16 Feb 15 Posts: 9 Credit: 13,905,328 RAC: 0 |
Had another 4-6 stalled units, most appear be going thru ok. SMP: 4x AMD 61xx@3.0Ghz - 4x AMD 6176SE@2.71Ghz - 4x AMD 6172@2.41Ghz - 2x AMD 62xx@3.3Ghz "Unless someone like you cares a whole awful lot, nothing is going to get better. It's not." Dr. Seuss, The Lorax |
Send message Joined: 16 Feb 15 Posts: 9 Credit: 13,905,328 RAC: 0 |
I have heard what to do with all these stalled 100% done WUs, except abort them. Am I in the wrong message board to post these problems? SMP: 4x AMD 61xx@3.0Ghz - 4x AMD 6176SE@2.71Ghz - 4x AMD 6172@2.41Ghz - 2x AMD 62xx@3.3Ghz "Unless someone like you cares a whole awful lot, nothing is going to get better. It's not." Dr. Seuss, The Lorax |
Send message Joined: 19 May 14 Posts: 73 Credit: 356,131 RAC: 0 |
Hey, It seems some of the runs you had were v1.46. If these are stalled you can abort them. |
Send message Joined: 16 Feb 15 Posts: 9 Credit: 13,905,328 RAC: 0 |
I cleared out all the 1.46.. I'm still get some rare 1.48 that are stall at 100%. Here are all the one I can easily find from my BoincTasks History. 3 Unbuntu 32/48 core server on boinc 7.0.65 and 1 i7-920 PC on boinc 7.4.36. I'd be glad to give you more information if you need it. Here are some: Project, Application, Name, Elapsed Time, Completed, Reported, Use, CPU%, Status, Computer, Virtual, Memory Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_1_1422013803_601749_1 01d,08:43:36 (01d,08:43:47) 25-02-2015 01:35 AM 25-02-2015 11:09 AM 16C 100.0 Aborted (203) fah20 - Amd 6272es 3.1Ghz Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_1_1422013803_643065_2 00:03:56 (00:03:56) 25-02-2015 01:35 AM 25-02-2015 11:09 AM 16C 100.0 Aborted (203) fah20 - Amd 6272es 3.1Ghz Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_1_1422013803_475467_1 00:01:19 (00:01:19) 25-02-2015 01:35 AM 25-02-2015 11:09 AM 16C 100.0 Aborted (203) fah20 - Amd 6272es 3.1Ghz Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_2_1422013803_808241_1 00:05:00 (00:05:00) 03-03-2015 12:21 AM 16C 100.0 Aborted (203) fah20 - Amd 6272es 3.1Ghz 24.54 MB 6.82 MB Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_1_1422013803_782698_1 10:10:25 (10:10:26) 02-03-2015 07:29 AM 02-03-2015 09:53 AM 16C 100.0 Aborted (203) fah20 - Amd 6272es 3.1Ghz 26.34 MB 8.85 MB Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) ps_nbody_12_20_orphan_sim_3_1422013803_339990_3 14:13:21 (14:13:36) 26-02-2015 01:40 AM 26-02-2015 05:25 AM 16C 100.0 Aborted (203) fah21 - Amd 61xxes 3.0Ghz 22.35 MB 4.68 MB Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_12_20_orphan_sim_3_1413455402_1688519_7 07:39:32 (07:39:39) 26-02-2015 06:31 PM 26-02-2015 09:08 PM 16C 100.0 Aborted (203) fah21 - Amd 61xxes 3.0Ghz 21.19 MB 3.65 MB Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_1_1422013803_730744_0 04:02:34 (04:02:36) 28-02-2015 02:35 AM 28-02-2015 03:14 AM 16C 100.0 Aborted (203) fah21 - Amd 61xxes 3.0Ghz 26.73 MB 9.07 MB Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_1_1422013803_505170_3 03:11:49 (03:12:03) 25-02-2015 01:33 AM 25-02-2015 01:36 AM 16C 100.0 Aborted (203) fah18 - Amd 6176se 2.713Ghz Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_1_1422013803_563877_0 00:51:54 (00:52:08) 25-02-2015 01:33 AM 25-02-2015 01:36 AM 16C 100.0 Aborted (203) fah18 - Amd 6176se 2.713Ghz Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_2_1422013803_652564_0 03:45:25 (21:35:00) 24-02-2015 06:34 PM 24-02-2015 07:34 PM 8C 100.0 Reported: Computation error (4,) Christy-i7-920 33.69 MB 34.88 MB * This one has an ODD time stamp; I just start crunch in 2/14/15 Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) de_nbody_2_13_orphan_sim_2_1422013803_760581_0 05:20:30 (05:20:10) 01-03-2015 04:28 AM 01-03-2015 04:32 AM 8C 99.9 Aborted (203) Christy-i7-920 5.42 MB 6.52 MB SMP: 4x AMD 61xx@3.0Ghz - 4x AMD 6176SE@2.71Ghz - 4x AMD 6172@2.41Ghz - 2x AMD 62xx@3.3Ghz "Unless someone like you cares a whole awful lot, nothing is going to get better. It's not." Dr. Seuss, The Lorax |
Send message Joined: 16 Feb 15 Posts: 9 Credit: 13,905,328 RAC: 0 |
Another observation, the 1.48 16C units are not full utilitizing all the cores. I would expect to close to 1600 on the %CPU. I have 48 core AMD 6100ES, currently running 3x 1.48 16C and is only showing about 55% usage: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 57743 boinc 39 19 164m 25m 1788 R 936 0.1 21:28.01 milkyway_nbody_ 57169 boinc 39 19 157m 16m 1824 R 897 0.1 1885:26 milkyway_nbody_ 57711 boinc 39 19 154m 14m 1792 R 813 0.0 64:59.87 milkyway_nbody_ 267 root 20 0 0 0 0 S 0 0.0 0:16.82 kworker/28:1 1503 root 20 0 203m 21m 7532 S 0 0.1 26:24.08 Xorg Another server 32 core AMD 6272ES, currently running 2x 1.48 16C and is only showing about 70% PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 32002 boinc 39 19 168m 29m 1848 R 1184 0.2 45:34.81 milkyway_nbody_ 31999 boinc 39 19 154m 14m 1856 R 1082 0.1 102:30.81 milkyway_nbody_ 1206 root 20 0 214m 21m 7460 S 0 0.2 29:40.73 Xorg 32302 horde 20 0 17476 1408 944 R 0 0.0 0:00.18 top SMP: 4x AMD 61xx@3.0Ghz - 4x AMD 6176SE@2.71Ghz - 4x AMD 6172@2.41Ghz - 2x AMD 62xx@3.3Ghz "Unless someone like you cares a whole awful lot, nothing is going to get better. It's not." Dr. Seuss, The Lorax |
Send message Joined: 22 Jun 11 Posts: 32 Credit: 41,852,496 RAC: 0 |
Unless I am mistaken, the question "Can 1.48 tasks get stuck in a loop at 100%?" remains unanswered. Can we get a solid answer on it? And, if yes, can we get instruction as to what to do? Simple questions, really. Trying to figure out what we can expect from the 1.48 mt app... |
Send message Joined: 16 Feb 15 Posts: 9 Credit: 13,905,328 RAC: 0 |
Another one today: Project, Application, Name, Elapsed Time, Completed, Reported, Use, CPU%, Status, Computer, Virtual, Memory Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) ps_nbody_12_20_orphan_sim_3_1422013803_316785_5 09:23:50 (09:23:58) 100.0 100.0 - 11d,12:51:16 16C Running 22.22 MB fah21 - Amd 61xxes 3.0Ghz SMP: 4x AMD 61xx@3.0Ghz - 4x AMD 6176SE@2.71Ghz - 4x AMD 6172@2.41Ghz - 2x AMD 62xx@3.3Ghz "Unless someone like you cares a whole awful lot, nothing is going to get better. It's not." Dr. Seuss, The Lorax |
Send message Joined: 19 May 14 Posts: 73 Credit: 356,131 RAC: 0 |
Hey, The initialization procedure, unfortunately, is not currently worked into the time estimate. We are working on this. Since some parameters will take a bit longer to run than others, in some work units the estimate will say 100% but the simulation may still be running. However, if the simulation takes an egregious amount of time (on the order of a large fraction of the simulation time thus far) to finish after reaching 100% you can go ahead and cancel it. However, there should not be any infinite loops occurring in this version. |
Send message Joined: 10 Nov 07 Posts: 96 Credit: 29,931,027 RAC: 0 |
I’ve got a troublesome 1.48 task on my MacBook Pro. (In case it’s relevant, BOINC is “throttled†to 25% usage on this machine, which has cooling issues including a bad fan.) Around 0h local time on Friday it showed 99.75% done after ~3 h “Running (2 CPUs)â€, 30 s to go. The next morning it had reached 99.998% in ~5.7 h; the remaining estimate was down to 0 s. Here’s where it gets weird. Late that afternoon the log shows Einstein@Home starting a task. I observed at about 18h that the progress was showing 100% complete, ~7.5 h time, 0 s to go—still claiming to use both CPUs despite obviously being down to one. It continued without change (aside from incrementing CPU time) until late this afternoon, when I noticed that the “00:00:00“ in the time-remaining column had changed to “---â€, after about 13 h of CPU time. (Meanwhile the E@h task seems to be progressing normally.) I’m considering aborting this one if it doesn’t finish pretty soon. |
Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0 |
Try to use independent tools - Windows would be Task Manager and Process Explorer, I don't know what the Mac OS X equivalents would be - to distinguish between 'what BOINC says is going on' and 'what is really happening'. Recent BOINCs give users a simulated 'pseudo %age' display to avoid anxieties. But it sounds as if this task is making no real progress at all, and all the time estimates are simulations. You mention thermal problems and BOINC being 'throttled'. Is that a restriction on the number of cores used, or the proportion of time BOINC is allowed to run? There is a particular problem with the nbody tasks not checkpointing during the initialisation phase: If BOINC interupts computations during this time (especially if applications are not kept in memory while suspended), you might be re-winding to the very beginning at every interruption. |
Send message Joined: 10 Nov 07 Posts: 96 Credit: 29,931,027 RAC: 0 |
What I see in Apple’s Activity Monitor, which I gather is just a GUI for Unix’s top command, is consistent with one app running on each core, but it doesn‘t explicitly tell me what threads are running where. I’ve always kept applications in memory. I had been running this machine at 50% on a single core until recently; after a v1.46 WU (now removed from the database) held up other projects until I aborted it, I changed the setting to 25% on both cores. Since I allow BOINC to run while the computer‘s in use, the interruptions are frequent but always brief. |
Send message Joined: 10 Nov 07 Posts: 96 Credit: 29,931,027 RAC: 0 |
Update: The E@H task is now complete, uploaded, and validated. Nothing seems to have changed with the MW@h task (except that it’s showing 18 h CPU time. No other project’s task has been resumed on the other CPU, but I don’t think MW@h is fully using both of them either (despite still saying so), because my “CPU Proximity†temperature reads about 5 C° below normal. |
Send message Joined: 10 Nov 07 Posts: 96 Credit: 29,931,027 RAC: 0 |
Well, I gave up (after 20:45 CPU time) to give my cached task a chance of getting done before deadline (and my other projects some time as well). We’ll see how this one goes … |
Send message Joined: 16 Feb 15 Posts: 9 Credit: 13,905,328 RAC: 0 |
Another stalled task 19+Hrs, other WU in this series (ps_nbody_12_20_orphan_sim_314*) run under 18 mins Milkyway@Home 1.48 MilkyWay@Home N-Body Simulation (mt) ps_nbody_12_20_orphan_sim_3_1413455402_1997284_5 19:48:43 (19:49:19) 100.0 100.0 - 11d,00:16:35 16C Suspended by user fah24 - Amd 6272es 3.1Ghz SMP: 4x AMD 61xx@3.0Ghz - 4x AMD 6176SE@2.71Ghz - 4x AMD 6172@2.41Ghz - 2x AMD 62xx@3.3Ghz "Unless someone like you cares a whole awful lot, nothing is going to get better. It's not." Dr. Seuss, The Lorax |
Send message Joined: 10 Nov 07 Posts: 96 Credit: 29,931,027 RAC: 0 |
My next WU seemed to run fine, but can’t validate—looks like the bugs in 1.46 are still biting … |
Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0 |
My next WU seemed to run fine, but can’t validate—looks like the bugs in 1.46 are still biting … Well, as I pointed out three weeks ago, v1.46 is still being issued to 32-bit hosts, and still failing - the v1.48 deployment was incomplete. Has anybody seen Sid(d)? Please tell him... |
Send message Joined: 29 Apr 09 Posts: 4 Credit: 74,209,059 RAC: 0 |
Not sure if this is because of the new N-Body. But basically one of my GPUs is not processing anything, because of CPU limitations. 12/03/2015 19:17:11 | Milkyway@Home | [cpu_sched_debug] all CPUs used (8.38 >= 8), skipping de_80_DR8_Rev_8_5_00004_1422013802_26039449_0 I never had this issue before. Computing preferences are set to no restriction. Though the "Use at most" always jumps back to 100% from 0. EDIT: And earlier I was able to process CPU jobs on all eight cores, while running a GPU job on each of my GPUs. 12/03/2015 19:10:09 | | CUDA: NVIDIA GPU 0: GeForce GTX 460 (driver version 347.09, CUDA version 7.0, compute capability 2.1, 1024MB, 685MB available, 1075 GFLOPS peak) 12/03/2015 19:10:09 | | CAL: ATI GPU 0: AMD Radeon HD 6900 series (Cayman) (CAL version 1.4.1720, 2048MB, 2016MB available, 5914 GFLOPS peak) 12/03/2015 19:10:09 | | OpenCL: NVIDIA GPU 0: GeForce GTX 460 (driver version 347.09, device version OpenCL 1.1 CUDA, 1024MB, 685MB available, 1075 GFLOPS peak) 12/03/2015 19:10:09 | | OpenCL: AMD/ATI GPU 0: AMD Radeon HD 6900 series (Cayman) (driver version CAL 1.4.1720 (VM), device version OpenCL 1.2 AMD-APP (923.1), 2048MB, 2016MB available, 5914 GFLOPS peak) 12/03/2015 19:10:09 | | OpenCL CPU: Quad-Core AMD Opteron(tm) Processor 2384 (OpenCL driver vendor: Advanced Micro Devices, Inc., driver version 2.0 (sse2), device version OpenCL 1.2 AMD-APP (923.1)) EDIT2( :D ): Suspended the N-Body task, it started a new N-Body simulation AND a opencl_amd_ati computation. |
Send message Joined: 22 Jun 11 Posts: 32 Credit: 41,852,496 RAC: 0 |
Is it possible that your CPU task(s) are in jeopardy of meeting their deadline? If BOINC thinks any of them are, they correctly get prioritized ahead of GPU tasks, and GPUs can be left idle. It happens sometimes to me, especially on projects that have tasks with tight (5 days or less) deadlines. |
©2024 Astroinformatics Group