Welcome to MilkyWay@home

New Nbody version 1.46

Message boards : News : New Nbody version 1.46
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Sidd
Project developer
Project tester
Project scientist

Send message
Joined: 19 May 14
Posts: 73
Credit: 356,131
RAC: 0
Message 62863 - Posted: 19 Dec 2014, 19:24:07 UTC

Hey Everyone,

There is now a new version of Nbody, 1.46. This version calculates the initial Dwarf galaxy using a different method than previously. It should solve many of the work unit stalling issues some users were having. If there are any errors let us know.

The runs currently up are:
ps_nbody_12_19_orphan_sim_1
de_nbody_12_19_orphan_sim_1
ID: 62863 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cameron

Send message
Joined: 16 Dec 07
Posts: 37
Credit: 25,830,848
RAC: 4,767
Message 62865 - Posted: 20 Dec 2014, 11:02:35 UTC

Just Retuned de_nbody_12_19_orphan_sim_1_1413455402_1432766_0 due to Computational error running 1.46.

<core_client_version>7.4.27</core_client_version>
<![CDATA[
<message>
The system cannot find the drive specified.
(0xf) - exit code 15 (0xf)
</message>
<stderr_txt>
<search_application> milkyway_nbody 1.46 Windows x86_64 double , Crlibm </search_application>
Error reading histogram line 37: 1 -48.5294117647 0.0439655511 0.0013148967
21:05:57 (4564): called boinc_finish

</stderr_txt>
]]>

Appeared to have run ps_nbody_08_05_orphan_sim_0_1413455402_1036953_3 successfully with 1.46 (just awaiting validation)
ID: 62865 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Dave

Send message
Joined: 19 Aug 08
Posts: 5
Credit: 17,837,653
RAC: 0
Message 62870 - Posted: 21 Dec 2014, 14:58:46 UTC

ps_nbody_08_05_orphan_sim_0_1413455402_437700_4 has been running for over 4 hours. It shows 44% complete and 1:06:46 time remaining. The time remaining and the percent complete are both increasing, very slowly. Looks like it will never finish. I'm aborting it now. This is on all 16 threads of an i7-5960x overclocked to 4.0 GHz.

/Dave
ID: 62870 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Dave

Send message
Joined: 19 Aug 08
Posts: 5
Credit: 17,837,653
RAC: 0
Message 62871 - Posted: 21 Dec 2014, 15:29:14 UTC
Last modified: 21 Dec 2014, 15:31:31 UTC

Just looked at my account. Recent statistics for NBody:

65 validated
15 error while computing
14 invalid
97 validation inconclusive

Nearly all of these are 1.46, with a few 1.44.

Other types of Milky Way work are completing without problem.

This is on an i7-5960x The nbody workunits each use all 16 threads.

/Dave
ID: 62871 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Dave

Send message
Joined: 19 Aug 08
Posts: 5
Credit: 17,837,653
RAC: 0
Message 62872 - Posted: 21 Dec 2014, 15:30:39 UTC - in response to Message 62871.  
Last modified: 21 Dec 2014, 15:33:28 UTC

null
ID: 62872 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Jozef J

Send message
Joined: 4 Mar 10
Posts: 65
Credit: 639,958,626
RAC: 0
Message 62874 - Posted: 22 Dec 2014, 15:13:56 UTC

All (156) · In progress (48) · Validation pending (0) · Validation inconclusive (51) · Valid (54) · Invalid (1) · Error (2)
in half day....-(
ID: 62874 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stick

Send message
Joined: 8 Oct 07
Posts: 52
Credit: 5,923,986
RAC: 5,331
Message 62882 - Posted: 24 Dec 2014, 2:43:43 UTC
Last modified: 24 Dec 2014, 3:18:17 UTC

ps_nbody_12_19_orphan_sim_0_1413455402_1435063 - Completed, can't validate Too many total results

EDIT: And it looks like de_nbody_12_20_orphan_sim_2_1413455402_1448056 may be headed that way.

EDIT2: Just noticed that another, in progress, unit has been underway for over 45 minutes and is over 40% complete but it has not yet checkpointed.
ID: 62882 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Billy

Send message
Joined: 9 May 08
Posts: 2
Credit: 784,856
RAC: 0
Message 62884 - Posted: 25 Dec 2014, 13:54:04 UTC

ID: 62884 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Odysseus

Send message
Joined: 10 Nov 07
Posts: 96
Credit: 29,931,027
RAC: 0
Message 62886 - Posted: 26 Dec 2014, 0:21:56 UTC

I have a task on my MacBook Pro that was causing me concern, de_nbody_12_19_orphan_sim_1_1413455402_1431667, which was paused (superseded) showing more than 7000 hours to go (and a very small % complete) after 4.5 h of computation. But now that it’s resumed, I see the progress has reached 70% done with just 2 h to go from 5.2 h of CPU time. So it may be that there’s an initial period during which little progress is registered, causing BOINC to overestimate the time that will be required. Hoping so, anyway …
ID: 62886 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Odysseus

Send message
Joined: 10 Nov 07
Posts: 96
Credit: 29,931,027
RAC: 0
Message 62889 - Posted: 26 Dec 2014, 10:51:50 UTC - in response to Message 62886.  

Well, it finished pretty quickly … but with a computation error: “15 (0xf) Unknown error number”. I have two other tasks on board that were downloaded at the same time; the first of them has been progressing normally so far.
ID: 62889 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter

Send message
Joined: 13 Mar 10
Posts: 8
Credit: 4,126,109
RAC: 43
Message 62895 - Posted: 27 Dec 2014, 4:37:38 UTC

de_nbody_12_20_orphan_sim_2_1413455402_1485127_0
now up to 51+ hours, apparently most of that apparently after reaching 100% complete (I haven't checked every day, unfortunately). Similar jobs (de_nbody with the same 10-digit serial number) are listed as requiring ~1 hour to complete.

I will abort this job, and hope the problem is fixed in the near future. I have had similar problems before, but hoped the new version 1.46 had fixed them.

BTW it would help to show the specific job failing if one could copy and paste from the BOINC Tasks listing, which I was unable to do.

Best regards
Pete
k5gm@amsat.org
Pete K5GM
ID: 62895 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Anthony Ayiomamitis

Send message
Joined: 25 Dec 14
Posts: 8
Credit: 1,007,280
RAC: 0
Message 62897 - Posted: 27 Dec 2014, 12:20:32 UTC

Although I have a lot of experience with BOINC, I am quite new to MilkyWay@Home.

A couple of days ago I downloaded the latest version of the software for a first time installation on an old Pentium laptop (for testing purposes) as well as a fairly good duo core with 2.67 GHz P9600 chip.

With the old Pentium, I do have two results which have completed successfully but are waiting validation. However, the third task which started has been stuck on 5.3% completed for over 19 hrs while the first two tasks suggest a maximum run time of about 10 hrs. This is the task of interest: de_nbody_12_20_orphan_sim_2_1413455402_1484793 .... it seems to be stuck in some sort of infinite loop since the elapsed time just keeps increasing while the percentage completed is stuck at 5.267%. I have paused the task and started another one and I will resume the former once the latter task completes.

Of the fourteen tasks completed the past 15 hrs or so, four have been credited whereas the other ten are waiting validation and confirmation. Weird ratio to say the least since my result is amongst two other results for some of the completed tasks with no quorum yet.
ID: 62897 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rjs5

Send message
Joined: 18 Nov 10
Posts: 19
Credit: 181,063,873
RAC: 11,216
Message 62904 - Posted: 28 Dec 2014, 19:01:09 UTC - in response to Message 62895.  

ps_nbody_12_20_orphan_sim_2_1413455402_1450477_2

I have one too and it appears to be stuck in a loop where its exit is based on a floating point compare.

It has been running nbody 1.46mt at 100% completion for several hours on Ubuntu Linux.
Only one CPU is active of the 8 CPUs.
perf top indicates that execution is stuck in the pow_rn function for that single CPU running.

The only functions measuring non-zero execution time (using perf top) are:
88% pow_rn
11.75% 0x000....9cf72
0.03% pow_exact_rn
0.01% dsfmt_gen_rand_all


http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=676782111

I ran a perf record -a -- sleep 5 to capture what the machine was doing.
perf report shows ....

2.4% of the total 88% time above is being spent in the "subsd" instruction, underlined below from the objdump. The remaining instructions down to the "jne" loop exit are at about 1% of execution time.


497716: 66 45 0f 28 cd movapd %xmm13,%xmm9
49771b: f2 45 0f 5c cf subsd %xmm15,%xmm9
497720: 66 45 0f 28 f9 movapd %xmm9,%xmm15
497725: f2 44 0f 10 4c 24 c8 movsd -0x38(%rsp),%xmm9
49772c: f2 45 0f 59 ce mulsd %xmm14,%xmm9
497731: f2 44 0f 59 74 24 e8 mulsd -0x18(%rsp),%xmm14

497738: f2 45 0f 5c cd subsd %xmm13,%xmm9

49773d: f2 44 0f 10 6c 24 f0 movsd -0x10(%rsp),%xmm13
497744: f2 44 0f 59 6c 24 c8 mulsd -0x38(%rsp),%xmm13
49774b: f2 45 0f 58 ce addsd %xmm14,%xmm9
497750: f2 45 0f 58 cd addsd %xmm13,%xmm9
497755: f2 44 0f 10 6c 24 f0 movsd -0x10(%rsp),%xmm13
49775c: f2 44 0f 59 6c 24 e8 mulsd -0x18(%rsp),%xmm13
497763: f2 45 0f 58 cd addsd %xmm13,%xmm9
497768: f2 45 0f 58 cc addsd %xmm12,%xmm9
49776d: f2 45 0f 58 f9 addsd %xmm9,%xmm15
497772: f2 44 0f 10 0d 85 00 movsd 0x30085(%rip),%xmm9 # 4c7800 <scs_sixinv+0x9180>
497779: 03 00
49777b: f2 45 0f 59 cf mulsd %xmm15,%xmm9
497780: f2 44 0f 58 0d 7f 00 addsd 0x3007f(%rip),%xmm9 # 4c7808 <scs_sixinv+0x9188>
497787: 03 00
497789: f2 45 0f 59 cf mulsd %xmm15,%xmm9
49778e: f2 44 0f 58 0d 71 1b addsd 0x21b71(%rip),%xmm9 # 4b9308 <p_n+0x2e28>
497795: 02 00
497797: f2 45 0f 59 cf mulsd %xmm15,%xmm9
49779c: f2 45 0f 59 d1 mulsd %xmm9,%xmm10
4977a1: f2 45 0f 58 d1 addsd %xmm9,%xmm10
4977a6: f2 44 0f 58 54 24 d8 addsd -0x28(%rsp),%xmm10
4977ad: f2 44 0f 59 54 24 d0 mulsd -0x30(%rsp),%xmm10
4977b4: f2 45 0f 58 da addsd %xmm10,%xmm11
4977b9: 66 45 0f 28 cb movapd %xmm11,%xmm9
4977be: f2 44 0f 5c 4c 24 d0 subsd -0x30(%rsp),%xmm9
4977c5: f2 45 0f 5c d1 subsd %xmm9,%xmm10
4977ca: 66 45 0f 28 cb movapd %xmm11,%xmm9
4977cf: f2 44 0f 58 54 24 e0 addsd -0x20(%rsp),%xmm10
4977d6: f2 45 0f 58 ca addsd %xmm10,%xmm9
4977db: 66 45 0f 28 e1 movapd %xmm9,%xmm12
4977e0: f2 45 0f 5c e3 subsd %xmm11,%xmm12
4977e5: f2 45 0f 5c d4 subsd %xmm12,%xmm10
4977ea: 0f 8c f8 00 00 00 jl 4978e8 <pow_rn+0x9c8>
4977f0: f2 44 0f 59 15 67 9c mulsd 0x29c67(%rip),%xmm10 # 4c1460 <scs_sixinv+0x2de0>
4977f7: 02 00
4977f9: f2 45 0f 58 d1 addsd %xmm9,%xmm10
4977fe: 66 45 0f 2e ca ucomisd %xmm10,%xmm9
497803: 0f 85 cf 00 00 00 jne 4978d8 <pow_rn+0x9b8>
497809: 0f 8a c9 00 00 00 jp 4978d8 <pow_rn+0x9b8>

49780f: 81 fe fe 03 00 00 cmp $0x3fe,%esi

[/b]
ID: 62904 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cameron

Send message
Joined: 16 Dec 07
Posts: 37
Credit: 25,830,848
RAC: 4,767
Message 62908 - Posted: 29 Dec 2014, 21:48:14 UTC

Returned ps_nbody_12_20_orphan_sim_2_1413455402_1482094 and it can't validate because of too many results.

de_nbody_08_05_orphan_sim_0_1413455402_1236584 looks like going the same way.

How can 6 or 4 (mt) apps not come up with a qurom between them and my standard app is the odd one out making up the numbers.

Should 1.46 be taking 90 minutes to reach the first checkpoint on a single core of
http://milkyway.cs.rpi.edu/milkyway/show_host_detail.php?hostid=496692
ID: 62908 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Anthony Ayiomamitis

Send message
Joined: 25 Dec 14
Posts: 8
Credit: 1,007,280
RAC: 0
Message 62909 - Posted: 29 Dec 2014, 22:38:20 UTC

I am losing a lot of tasks for the same reason due to "too many results". In looking at the output log from various users for a particular task, the reported results vary dramatically with no two results being the same.

Many times a negative result is also very informative and users should at least be credited rather than being penalized with 0 credits.
ID: 62909 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
root

Send message
Joined: 27 Oct 14
Posts: 1
Credit: 3,672,526
RAC: 0
Message 62956 - Posted: 5 Jan 2015, 3:39:13 UTC

Job name: de_nbody_12_20_orphan_sim_2_1413455402_1542933_0
System spec: the instance running this job is a KVM virtual machine running gentoo linux with 6 vCPU attached to it. RAM is not an issue. (still more then half of it un-used and no swap being touched at any point, allocating more did not help). Physical CPU is AMD Phenom(tm) II X6 1090T Processor
Issue: Job claims to be running 6 CPUs, but never uses more then one. Ran the job for 52 hours+ and it never showed a sign of giving an estimated time of completion.
ID: 62956 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sidd
Project developer
Project tester
Project scientist

Send message
Joined: 19 May 14
Posts: 73
Credit: 356,131
RAC: 0
Message 62961 - Posted: 5 Jan 2015, 18:29:29 UTC

Hey All,

It seems there has been an issue with the newly added multithreading routines in this version. It seems this is the main error most people are having.I will work on finding a solution to this problem.

This version's initialization algorithm requires significantly more time to complete which would explain why some people would have a low percentage increase at the start. We will try to change the time estimation routines to account for this properly.


Cheers,
Sidd
ID: 62961 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 16 Jun 08
Posts: 93
Credit: 366,882,323
RAC: 0
Message 62977 - Posted: 8 Jan 2015, 1:16:43 UTC - in response to Message 62961.  

It seems there has been an issue with the newly added multithreading routines in this version...

Is it just the MTs? I've seen the same kind of errors in the non-MTs. WU 690529199, for example.
ID: 62977 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stick

Send message
Joined: 8 Oct 07
Posts: 52
Credit: 5,923,986
RAC: 5,331
Message 62991 - Posted: 11 Jan 2015, 19:24:49 UTC

I posted this message here a couple of weeks ago regarding "can't validate" 1.46 WU's - and I'm still getting a high percentage of them. Also posted that the program didn't appear to be checkpointing. Since then I've looked at a lot more "in progess" 1.46 tasks and still have not seen one that had checkpointed.
ID: 62991 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stick

Send message
Joined: 8 Oct 07
Posts: 52
Credit: 5,923,986
RAC: 5,331
Message 62993 - Posted: 12 Jan 2015, 2:49:46 UTC - in response to Message 62991.  

OK - I just looked at Task 945867764 and it has checkpointed. It has now run over for over 7 hours and is around 30% complete. When I looked earlier (before it had checkpointed), it had run for well over an hour over and was around 7% complete.

I posted this message here a couple of weeks ago regarding "can't validate" 1.46 WU's - and I'm still getting a high percentage of them. Also posted that the program didn't appear to be checkpointing. Since then I've looked at a lot more "in progess" 1.46 tasks and still have not seen one that had checkpointed.
ID: 62993 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : News : New Nbody version 1.46

©2024 Astroinformatics Group