Message boards :
News :
New Nbody version 1.46
Message board moderation
Author | Message |
---|---|
Send message Joined: 19 May 14 Posts: 73 Credit: 356,131 RAC: 0 |
Hey Everyone, There is now a new version of Nbody, 1.46. This version calculates the initial Dwarf galaxy using a different method than previously. It should solve many of the work unit stalling issues some users were having. If there are any errors let us know. The runs currently up are: ps_nbody_12_19_orphan_sim_1 de_nbody_12_19_orphan_sim_1 |
Send message Joined: 16 Dec 07 Posts: 37 Credit: 25,768,067 RAC: 4,871 |
Just Retuned de_nbody_12_19_orphan_sim_1_1413455402_1432766_0 due to Computational error running 1.46. <core_client_version>7.4.27</core_client_version> <![CDATA[ <message> The system cannot find the drive specified. (0xf) - exit code 15 (0xf) </message> <stderr_txt> <search_application> milkyway_nbody 1.46 Windows x86_64 double , Crlibm </search_application> Error reading histogram line 37: 1 -48.5294117647 0.0439655511 0.0013148967 21:05:57 (4564): called boinc_finish </stderr_txt> ]]> Appeared to have run ps_nbody_08_05_orphan_sim_0_1413455402_1036953_3 successfully with 1.46 (just awaiting validation) |
Send message Joined: 19 Aug 08 Posts: 5 Credit: 17,837,653 RAC: 0 |
ps_nbody_08_05_orphan_sim_0_1413455402_437700_4 has been running for over 4 hours. It shows 44% complete and 1:06:46 time remaining. The time remaining and the percent complete are both increasing, very slowly. Looks like it will never finish. I'm aborting it now. This is on all 16 threads of an i7-5960x overclocked to 4.0 GHz. /Dave |
Send message Joined: 19 Aug 08 Posts: 5 Credit: 17,837,653 RAC: 0 |
Just looked at my account. Recent statistics for NBody: 65 validated 15 error while computing 14 invalid 97 validation inconclusive Nearly all of these are 1.46, with a few 1.44. Other types of Milky Way work are completing without problem. This is on an i7-5960x The nbody workunits each use all 16 threads. /Dave |
Send message Joined: 19 Aug 08 Posts: 5 Credit: 17,837,653 RAC: 0 |
null |
Send message Joined: 4 Mar 10 Posts: 65 Credit: 639,958,626 RAC: 0 |
All (156) · In progress (48) · Validation pending (0) · Validation inconclusive (51) · Valid (54) · Invalid (1) · Error (2) in half day....-( |
Send message Joined: 8 Oct 07 Posts: 52 Credit: 5,850,746 RAC: 3,406 |
ps_nbody_12_19_orphan_sim_0_1413455402_1435063 - Completed, can't validate Too many total results EDIT: And it looks like de_nbody_12_20_orphan_sim_2_1413455402_1448056 may be headed that way. EDIT2: Just noticed that another, in progress, unit has been underway for over 45 minutes and is over 40% complete but it has not yet checkpointed. |
Send message Joined: 9 May 08 Posts: 2 Credit: 784,856 RAC: 0 |
Not having any success with these tasks on an Intel iMac. http://milkyway.cs.rpi.edu/milkyway/results.php?userid=5032&offset=0&show_names=0&state=0&appid=7 |
Send message Joined: 10 Nov 07 Posts: 96 Credit: 29,931,027 RAC: 0 |
I have a task on my MacBook Pro that was causing me concern, de_nbody_12_19_orphan_sim_1_1413455402_1431667, which was paused (superseded) showing more than 7000 hours to go (and a very small % complete) after 4.5 h of computation. But now that it’s resumed, I see the progress has reached 70% done with just 2 h to go from 5.2 h of CPU time. So it may be that there’s an initial period during which little progress is registered, causing BOINC to overestimate the time that will be required. Hoping so, anyway … |
Send message Joined: 10 Nov 07 Posts: 96 Credit: 29,931,027 RAC: 0 |
Well, it finished pretty quickly … but with a computation error: “15 (0xf) Unknown error numberâ€. I have two other tasks on board that were downloaded at the same time; the first of them has been progressing normally so far. |
Send message Joined: 13 Mar 10 Posts: 8 Credit: 4,126,109 RAC: 140 |
de_nbody_12_20_orphan_sim_2_1413455402_1485127_0 now up to 51+ hours, apparently most of that apparently after reaching 100% complete (I haven't checked every day, unfortunately). Similar jobs (de_nbody with the same 10-digit serial number) are listed as requiring ~1 hour to complete. I will abort this job, and hope the problem is fixed in the near future. I have had similar problems before, but hoped the new version 1.46 had fixed them. BTW it would help to show the specific job failing if one could copy and paste from the BOINC Tasks listing, which I was unable to do. Best regards Pete k5gm@amsat.org Pete K5GM |
Send message Joined: 25 Dec 14 Posts: 8 Credit: 1,007,280 RAC: 0 |
Although I have a lot of experience with BOINC, I am quite new to MilkyWay@Home. A couple of days ago I downloaded the latest version of the software for a first time installation on an old Pentium laptop (for testing purposes) as well as a fairly good duo core with 2.67 GHz P9600 chip. With the old Pentium, I do have two results which have completed successfully but are waiting validation. However, the third task which started has been stuck on 5.3% completed for over 19 hrs while the first two tasks suggest a maximum run time of about 10 hrs. This is the task of interest: de_nbody_12_20_orphan_sim_2_1413455402_1484793 .... it seems to be stuck in some sort of infinite loop since the elapsed time just keeps increasing while the percentage completed is stuck at 5.267%. I have paused the task and started another one and I will resume the former once the latter task completes. Of the fourteen tasks completed the past 15 hrs or so, four have been credited whereas the other ten are waiting validation and confirmation. Weird ratio to say the least since my result is amongst two other results for some of the completed tasks with no quorum yet. |
Send message Joined: 18 Nov 10 Posts: 19 Credit: 180,955,042 RAC: 16,713 |
ps_nbody_12_20_orphan_sim_2_1413455402_1450477_2 I have one too and it appears to be stuck in a loop where its exit is based on a floating point compare. It has been running nbody 1.46mt at 100% completion for several hours on Ubuntu Linux. Only one CPU is active of the 8 CPUs. perf top indicates that execution is stuck in the pow_rn function for that single CPU running. The only functions measuring non-zero execution time (using perf top) are: 88% pow_rn 11.75% 0x000....9cf72 0.03% pow_exact_rn 0.01% dsfmt_gen_rand_all http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=676782111 I ran a perf record -a -- sleep 5 to capture what the machine was doing. perf report shows .... 2.4% of the total 88% time above is being spent in the "subsd" instruction, underlined below from the objdump. The remaining instructions down to the "jne" loop exit are at about 1% of execution time. 497716: 66 45 0f 28 cd movapd %xmm13,%xmm9 49771b: f2 45 0f 5c cf subsd %xmm15,%xmm9 497720: 66 45 0f 28 f9 movapd %xmm9,%xmm15 497725: f2 44 0f 10 4c 24 c8 movsd -0x38(%rsp),%xmm9 49772c: f2 45 0f 59 ce mulsd %xmm14,%xmm9 497731: f2 44 0f 59 74 24 e8 mulsd -0x18(%rsp),%xmm14 497738: f2 45 0f 5c cd subsd %xmm13,%xmm9 49773d: f2 44 0f 10 6c 24 f0 movsd -0x10(%rsp),%xmm13 497744: f2 44 0f 59 6c 24 c8 mulsd -0x38(%rsp),%xmm13 49774b: f2 45 0f 58 ce addsd %xmm14,%xmm9 497750: f2 45 0f 58 cd addsd %xmm13,%xmm9 497755: f2 44 0f 10 6c 24 f0 movsd -0x10(%rsp),%xmm13 49775c: f2 44 0f 59 6c 24 e8 mulsd -0x18(%rsp),%xmm13 497763: f2 45 0f 58 cd addsd %xmm13,%xmm9 497768: f2 45 0f 58 cc addsd %xmm12,%xmm9 49776d: f2 45 0f 58 f9 addsd %xmm9,%xmm15 497772: f2 44 0f 10 0d 85 00 movsd 0x30085(%rip),%xmm9 # 4c7800 <scs_sixinv+0x9180> 497779: 03 00 49777b: f2 45 0f 59 cf mulsd %xmm15,%xmm9 497780: f2 44 0f 58 0d 7f 00 addsd 0x3007f(%rip),%xmm9 # 4c7808 <scs_sixinv+0x9188> 497787: 03 00 497789: f2 45 0f 59 cf mulsd %xmm15,%xmm9 49778e: f2 44 0f 58 0d 71 1b addsd 0x21b71(%rip),%xmm9 # 4b9308 <p_n+0x2e28> 497795: 02 00 497797: f2 45 0f 59 cf mulsd %xmm15,%xmm9 49779c: f2 45 0f 59 d1 mulsd %xmm9,%xmm10 4977a1: f2 45 0f 58 d1 addsd %xmm9,%xmm10 4977a6: f2 44 0f 58 54 24 d8 addsd -0x28(%rsp),%xmm10 4977ad: f2 44 0f 59 54 24 d0 mulsd -0x30(%rsp),%xmm10 4977b4: f2 45 0f 58 da addsd %xmm10,%xmm11 4977b9: 66 45 0f 28 cb movapd %xmm11,%xmm9 4977be: f2 44 0f 5c 4c 24 d0 subsd -0x30(%rsp),%xmm9 4977c5: f2 45 0f 5c d1 subsd %xmm9,%xmm10 4977ca: 66 45 0f 28 cb movapd %xmm11,%xmm9 4977cf: f2 44 0f 58 54 24 e0 addsd -0x20(%rsp),%xmm10 4977d6: f2 45 0f 58 ca addsd %xmm10,%xmm9 4977db: 66 45 0f 28 e1 movapd %xmm9,%xmm12 4977e0: f2 45 0f 5c e3 subsd %xmm11,%xmm12 4977e5: f2 45 0f 5c d4 subsd %xmm12,%xmm10 4977ea: 0f 8c f8 00 00 00 jl 4978e8 <pow_rn+0x9c8> 4977f0: f2 44 0f 59 15 67 9c mulsd 0x29c67(%rip),%xmm10 # 4c1460 <scs_sixinv+0x2de0> 4977f7: 02 00 4977f9: f2 45 0f 58 d1 addsd %xmm9,%xmm10 4977fe: 66 45 0f 2e ca ucomisd %xmm10,%xmm9 497803: 0f 85 cf 00 00 00 jne 4978d8 <pow_rn+0x9b8> 497809: 0f 8a c9 00 00 00 jp 4978d8 <pow_rn+0x9b8> 49780f: 81 fe fe 03 00 00 cmp $0x3fe,%esi [/b] |
Send message Joined: 16 Dec 07 Posts: 37 Credit: 25,768,067 RAC: 4,871 |
Returned ps_nbody_12_20_orphan_sim_2_1413455402_1482094 and it can't validate because of too many results. de_nbody_08_05_orphan_sim_0_1413455402_1236584 looks like going the same way. How can 6 or 4 (mt) apps not come up with a qurom between them and my standard app is the odd one out making up the numbers. Should 1.46 be taking 90 minutes to reach the first checkpoint on a single core of http://milkyway.cs.rpi.edu/milkyway/show_host_detail.php?hostid=496692 |
Send message Joined: 25 Dec 14 Posts: 8 Credit: 1,007,280 RAC: 0 |
I am losing a lot of tasks for the same reason due to "too many results". In looking at the output log from various users for a particular task, the reported results vary dramatically with no two results being the same. Many times a negative result is also very informative and users should at least be credited rather than being penalized with 0 credits. |
Send message Joined: 27 Oct 14 Posts: 1 Credit: 3,672,526 RAC: 0 |
Job name: de_nbody_12_20_orphan_sim_2_1413455402_1542933_0 System spec: the instance running this job is a KVM virtual machine running gentoo linux with 6 vCPU attached to it. RAM is not an issue. (still more then half of it un-used and no swap being touched at any point, allocating more did not help). Physical CPU is AMD Phenom(tm) II X6 1090T Processor Issue: Job claims to be running 6 CPUs, but never uses more then one. Ran the job for 52 hours+ and it never showed a sign of giving an estimated time of completion. |
Send message Joined: 19 May 14 Posts: 73 Credit: 356,131 RAC: 0 |
Hey All, It seems there has been an issue with the newly added multithreading routines in this version. It seems this is the main error most people are having.I will work on finding a solution to this problem. This version's initialization algorithm requires significantly more time to complete which would explain why some people would have a low percentage increase at the start. We will try to change the time estimation routines to account for this properly. Cheers, Sidd |
Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0 |
It seems there has been an issue with the newly added multithreading routines in this version... Is it just the MTs? I've seen the same kind of errors in the non-MTs. WU 690529199, for example. |
Send message Joined: 8 Oct 07 Posts: 52 Credit: 5,850,746 RAC: 3,406 |
I posted this message here a couple of weeks ago regarding "can't validate" 1.46 WU's - and I'm still getting a high percentage of them. Also posted that the program didn't appear to be checkpointing. Since then I've looked at a lot more "in progess" 1.46 tasks and still have not seen one that had checkpointed. |
Send message Joined: 8 Oct 07 Posts: 52 Credit: 5,850,746 RAC: 3,406 |
OK - I just looked at Task 945867764 and it has checkpointed. It has now run over for over 7 hours and is around 30% complete. When I looked earlier (before it had checkpointed), it had run for well over an hour over and was around 7% complete. I posted this message here a couple of weeks ago regarding "can't validate" 1.46 WU's - and I'm still getting a high percentage of them. Also posted that the program didn't appear to be checkpointing. Since then I've looked at a lot more "in progess" 1.46 tasks and still have not seen one that had checkpointed. |
©2024 Astroinformatics Group