Welcome to MilkyWay@home

issues with workunits crashing might be fixed now and nbody work generation information


Advanced search

Message boards : News : issues with workunits crashing might be fixed now and nbody work generation information
Message board moderation

To post messages, you must log in.

AuthorMessage
ProfileTravis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
10 thousand credit badge10 year member badge
Message 54933 - Posted: 29 Jun 2012, 4:24:40 UTC

While debugging the nbody work generation I think I might have found out why the workunits were still crashing. I think the server might have been sending out the wrong parameter file (ie., a different one than I specified) so they were still using the old one and crashing. So with these updates this issue might really be fixed.

Also, the nbody work generation code dynamically estimates the fpops on a per-workunit basis, so I'm hoping this will result in more accurate runtime estimates and generated credit. Let me know how that goes.

--Travis
ID: 54933 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
boroda3

Send message
Joined: 11 Mar 12
Posts: 3
Credit: 86,944,865
RAC: 0
50 million credit badge7 year member badge
Message 54938 - Posted: 29 Jun 2012, 5:53:30 UTC - in response to Message 54933.  

I think the server might have been sending out the wrong parameter file...
Let me know how that goes.


Tasks 244932752, 244932709 and 244932708 crashes:

<stderr_txt>
Argument parsing error: 3141327896: number too large or too small
<search_application> milkyway_nbody_0.80_windows_x86_64__mt.exe 0.80 Windows x86 double OpenMP, Crlibm </search_application>
12:07:44 (3364): called boinc_finish

</stderr_txt>

Numbers are different, but the reason one.

Other 6 tasks done ok.
ID: 54938 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTravis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
10 thousand credit badge10 year member badge
Message 54939 - Posted: 29 Jun 2012, 5:59:07 UTC - in response to Message 54938.  

I think the server might have been sending out the wrong parameter file...
Let me know how that goes.


Tasks 244932752, 244932709 and 244932708 crashes:


Argument parsing error: 3141327896: number too large or too small
milkyway_nbody_0.80_windows_x86_64__mt.exe 0.80 Windows x86 double OpenMP, Crlibm
12:07:44 (3364): called boinc_finish



Numbers are different, but the reason one.

Other 6 tasks done ok.


That should be fixed (see the other thread). The workunits being generated now shouldn't generate seeds that are too large.

ID: 54939 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileLlewellyn Weaver

Send message
Joined: 24 Aug 11
Posts: 2
Credit: 8,408,125
RAC: 0
5 million credit badge8 year member badge
Message 54959 - Posted: 1 Jul 2012, 13:37:05 UTC - in response to Message 54933.  

"Also, the nbody work generation code dynamically estimates the fpops on a per-workunit basis, so I'm hoping this will result in more accurate runtime estimates and generated credit. Let me know how that goes."

Since June 29, 2012 my work unit credits appear to be going backwards.
almost like 1 step forward and 3 steps back! What gives?
ID: 54959 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTravis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
10 thousand credit badge10 year member badge
Message 54961 - Posted: 1 Jul 2012, 19:22:15 UTC - in response to Message 54959.  

"Also, the nbody work generation code dynamically estimates the fpops on a per-workunit basis, so I'm hoping this will result in more accurate runtime estimates and generated credit. Let me know how that goes."

Since June 29, 2012 my work unit credits appear to be going backwards.
almost like 1 step forward and 3 steps back! What gives?


Your credit is going backwards? That doesn't make any sense. All the credit you've been granted has been positive...
ID: 54961 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTOTEM

Send message
Joined: 22 Jun 12
Posts: 2
Credit: 377,594
RAC: 0
100 thousand credit badge7 year member badge
Message 54979 - Posted: 3 Jul 2012, 20:34:55 UTC - in response to Message 54933.  

July 4th. Still crashing.
ID: 54979 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jimmy Gondek

Send message
Joined: 28 Sep 11
Posts: 60
Credit: 22,764,173
RAC: 0
20 million credit badge8 year member badge
Message 54991 - Posted: 4 Jul 2012, 1:57:46 UTC
Last modified: 4 Jul 2012, 2:25:50 UTC

...July 3rd, just had a slew of "Computation Errors" go back to you...

EDIT:...MWAH was just performing a string of 6 NBody 0.84s and all were coming up the same "Computation Error" so I decided to exit MWAH/BOINC Manager after the 3rd one just to see if the software was acting up...the effort yielded the same result, the remaining 3 0.84s errored out, as well...but a 1.01 finished subsequently with no problem...

Typical stderr for the NBodys...

Task 248919949:
http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=248919949

ps_plum_slice_EMD_NEW_100K_1341346802_6694_1

<core_client_version>6.12.43</core_client_version>
<![CDATA[
<message>
process exited with code 15 (0xf, -241)
</message>
<stderr_txt>
<search_application> milkyway_nbody 0.84 Darwin x86_64 double OpenMP, Crlibm </search_application>
Using OpenMP 8 max threads on a system with 8 processors
Error reading histogram line 37: massPerParticle = 0.000100
21:55:47 (80181): called boinc_finish

</stderr_txt>
]]>
ID: 54991 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rjs5

Send message
Joined: 18 Nov 10
Posts: 17
Credit: 76,158,375
RAC: 0
50 million credit badge8 year member badge
Message 55010 - Posted: 4 Jul 2012, 16:46:45 UTC - in response to Message 54991.  

I am having similar failures running on Linux 64-bit. I am running an old version of Boinc which is the one that is easiest to get running on CentOS.

I saw similar problems on Einstein and was able to clear compute errors by doing an "ldd" to see which libraries that Einstein could not find. For Einstein, I had to install some 32-bit versions of libraries (GLUT,...).

Milkyway is statically linked and stripped of symbols so missing libraries is not the problem for Milkyway.


rod


Task 249098203

Stderr output

<core_client_version>6.10.45</core_client_version>
<![CDATA[
<message>
process exited with code 15 (0xf, -241)
</message>
<stderr_txt>
<search_application> milkyway_nbody 0.88 Linux x86_64 double OpenMP, Crlibm </search_application>
Using OpenMP 4 max threads on a system with 4 processors
Error reading histogram line 37: massPerParticle = 0.000100
23:07:55 (8730): called boinc_finish

</stderr_txt>
]]>
ID: 55010 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
gyx

Send message
Joined: 29 May 12
Posts: 1
Credit: 110,369
RAC: 0
100 thousand credit badge7 year member badge
Message 55021 - Posted: 5 Jul 2012, 2:15:15 UTC - in response to Message 54991.  

I have similar symptoms:
http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=249667580

I also tried to build from goodNBodyCompar2 branch (a6b8b34), no luck yet:
http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=245911387

<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
<search_application> milkyway_nbody 0.93 Linux x86_64 double OpenMP, Crlibm </search_application>
Using OpenMP 1 max threads on a system with 2 processors
Warning: not applying timestep correction for workunit with min version 0.80
<search_likelihood>-nan</search_likelihood>
Failed to calculate likelihood
*** glibc detected *** ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody: corrupted double-linked list: 0x000000000248b940 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x75b46)[0x7f2c97987b46]
/lib/x86_64-linux-gnu/libc.so.6(+0x77547)[0x7f2c97989547]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x6c)[0x7f2c9798c87c]
../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody(destroyNBodyState+0xe9)[0x4248e9]
../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody(nbMain+0x27a)[0x41d9ea]
../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody(main+0x2e7)[0x41baf7]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)[0x7f2c97930ead]
../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody[0x41bd5d]
======= Memory map: ========
00400000-004c5000 r-xp 00000000 08:09 1194356 /usr/lib/boinc-app-milkyway-local/milkyway_nbody
006c5000-006c7000 rw-p 000c5000 08:09 1194356 /usr/lib/boinc-app-milkyway-local/milkyway_nbody
006c7000-006f6000 rw-p 00000000 00:00 0
0103e000-024ac000 rw-p 00000000 00:00 0 [heap]
7f2c90000000-7f2c90021000 rw-p 00000000 00:00 0
7f2c90021000-7f2c94000000 ---p 00000000 00:00 0
7f2c9661c000-7f2c96631000 r-xp 00000000 08:09 1586269 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f2c96631000-7f2c96831000 ---p 00015000 08:09 1586269 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f2c96831000-7f2c96832000 rw-p 00015000 08:09 1586269 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f2c96846000-7f2c96847000 rw-p 00000000 00:00 0
7f2c96847000-7f2c96cdc000 rw-s 00000000 08:09 81906 /var/lib/boinc-client/slots/0/boinc_milkyway_nbody_0
7f2c96cdc000-7f2c97912000 rw-p 00000000 00:00 0
7f2c97912000-7f2c97a8f000 r-xp 00000000 08:09 957362 /lib/x86_64-linux-gnu/libc-2.13.so
7f2c97a8f000-7f2c97c8f000 ---p 0017d000 08:09 957362 /lib/x86_64-linux-gnu/libc-2.13.so
7f2c97c8f000-7f2c97c93000 r--p 0017d000 08:09 957362 /lib/x86_64-linux-gnu/libc-2.13.so
7f2c97c93000-7f2c97c94000 rw-p 00181000 08:09 957362 /lib/x86_64-linux-gnu/libc-2.13.so
7f2c97c94000-7f2c97c99000 rw-p 00000000 00:00 0
7f2c97c99000-7f2c97cb0000 r-xp 00000000 08:09 957380 /lib/x86_64-linux-gnu/libpthread-2.13.so
7f2c97cb0000-7f2c97eaf000 ---p 00017000 08:09 957380 /lib/x86_64-linux-gnu/libpthread-2.13.so
7f2c97eaf000-7f2c97eb0000 r--p 00016000 08:09 957380 /lib/x86_64-linux-gnu/libpthread-2.13.so
7f2c97eb0000-7f2c97eb1000 rw-p 00017000 08:09 957380 /lib/x86_64-linux-gnu/libpthread-2.13.so
7f2c97eb1000-7f2c97eb5000 rw-p 00000000 00:00 0
7f2c97eb5000-7f2c97ec3000 r-xp 00000000 08:09 1226439 /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0
7f2c97ec3000-7f2c980c2000 ---p 0000e000 08:09 1226439 /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0
7f2c980c2000-7f2c980c3000 rw-p 0000d000 08:09 1226439 /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0
7f2c980c3000-7f2c98144000 r-xp 00000000 08:09 957369 /lib/x86_64-linux-gnu/libm-2.13.so
7f2c98144000-7f2c98343000 ---p 00081000 08:09 957369 /lib/x86_64-linux-gnu/libm-2.13.so
7f2c98343000-7f2c98344000 r--p 00080000 08:09 957369 /lib/x86_64-linux-gnu/libm-2.13.so
7f2c98344000-7f2c98345000 rw-p 00081000 08:09 957369 /lib/x86_64-linux-gnu/libm-2.13.so
7f2c98345000-7f2c9834c000 r-xp 00000000 08:09 957382 /lib/x86_64-linux-gnu/librt-2.13.so
7f2c9834c000-7f2c9854b000 ---p 00007000 08:09 957382 /lib/x86_64-linux-gnu/librt-2.13.so
7f2c9854b000-7f2c9854c000 r--p 00006000 08:09 957382 /lib/x86_64-linux-gnu/librt-2.13.so
7f2c9854c000-7f2c9854d000 rw-p 00007000 08:09 957382 /lib/x86_64-linux-gnu/librt-2.13.so
7f2c9854d000-7f2c9856d000 r-xp 00000000 08:09 957384 /lib/x86_64-linux-gnu/ld-2.13.so
7f2c98752000-7f2c98756000 rw-p 00000000 00:00 0
7f2c98764000-7f2c98765000 ---p 00000000 00:00 0
7f2c98765000-7f2c98768000 rw-p 00000000 00:00 0
7f2c98768000-7f2c9876a000 rw-s 00000000 00:04 6782993 /SYSV01093f70 (deleted)
7f2c9876a000-7f2c9876c000 rw-p 00000000 00:00 0
7f2c9876c000-7f2c9876d000 r--p 0001f000 08:09 957384 /lib/x86_64-linux-gnu/ld-2.13.so
7f2c9876d000-7f2c9876e000 rw-p 00020000 08:09 957384 /lib/x86_64-linux-gnu/ld-2.13.so
7f2c9876e000-7f2c9876f000 rw-p 00000000 00:00 0
7fff518ae000-7fff518cf000 rw-p 00000000 00:00 0 [stack]
7fff51993000-7fff51994000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
SIGABRT: abort called
Stack trace (13 frames):
../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody(boinc_catch_signal+0xf7)[0x4766d3]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf030)[0x7f2c97ca8030]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7f2c97944475]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x180)[0x7f2c979476f0]
/lib/x86_64-linux-gnu/libc.so.6(+0x6c2fb)[0x7f2c9797e2fb]
/lib/x86_64-linux-gnu/libc.so.6(+0x75b46)[0x7f2c97987b46]
/lib/x86_64-linux-gnu/libc.so.6(+0x77547)[0x7f2c97989547]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x6c)[0x7f2c9798c87c]
../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody(destroyNBodyState+0xe9)[0x4248e9]
../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody(nbMain+0x27a)[0x41d9ea]
../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody(main+0x2e7)[0x41baf7]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)[0x7f2c97930ead]
../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody[0x41bd5d]

Exiting...

</stderr_txt>
]]>
ID: 55021 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rjs5

Send message
Joined: 18 Nov 10
Posts: 17
Credit: 76,158,375
RAC: 0
50 million credit badge8 year member badge
Message 55022 - Posted: 5 Jul 2012, 3:40:18 UTC - in response to Message 55021.  

It is hard to tell what is happening with this stripped, statically linked program. I don't know how the program manages the different system call interfaces (various versions of Linux) with a single static link. It has been a long time since I have seen anyone statically link anything.

Intel has released a new beta version of their compiler that performs dynamic, runtime pointer checking that might help locate the bug but there is nothing MilkyWay@Home users can do to help other than to say .... still failing.
http://software.intel.com/en-us/articles/beta-tech-talks/

They have both a Fortan and C compiler that should help clean up bogus pointers. They can run a test on the application and locate their corrupted pointer.

An objdump of the application shows that there are AVX instructions in what I guess is the OpenMP code. I have Sandy Bridge and Ivy Bridge systems and ONE Nehalem system. The Nehalem system seems to work. The Sandy/Ivy bridge systems using AVX (via OpenMP) are failing.



If you get a computation error, the choices are to (1) turn off work or (2) let the compute errors pile up.

The workloads fail pretty rapidly so I am going to let the compute errors filter back to the system and it will be clear when they have fixed the bug.

ID: 55022 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Penguin

Send message
Joined: 4 Mar 12
Posts: 42
Credit: 23,505,794
RAC: 0
20 million credit badge7 year member badge
Message 55031 - Posted: 5 Jul 2012, 23:13:04 UTC - in response to Message 55022.  

yep still gettting failed wu's here too
ID: 55031 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : News : issues with workunits crashing might be fixed now and nbody work generation information

©2019 Astroinformatics Group