Message boards :
News :
issues with workunits crashing might be fixed now and nbody work generation information
Message board moderation
Author | Message |
---|---|
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
While debugging the nbody work generation I think I might have found out why the workunits were still crashing. I think the server might have been sending out the wrong parameter file (ie., a different one than I specified) so they were still using the old one and crashing. So with these updates this issue might really be fixed. Also, the nbody work generation code dynamically estimates the fpops on a per-workunit basis, so I'm hoping this will result in more accurate runtime estimates and generated credit. Let me know how that goes. --Travis |
Send message Joined: 11 Mar 12 Posts: 3 Credit: 86,944,865 RAC: 0 |
I think the server might have been sending out the wrong parameter file... Tasks 244932752, 244932709 and 244932708 crashes: <stderr_txt> Argument parsing error: 3141327896: number too large or too small <search_application> milkyway_nbody_0.80_windows_x86_64__mt.exe 0.80 Windows x86 double OpenMP, Crlibm </search_application> 12:07:44 (3364): called boinc_finish </stderr_txt> Numbers are different, but the reason one. Other 6 tasks done ok. |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
I think the server might have been sending out the wrong parameter file... That should be fixed (see the other thread). The workunits being generated now shouldn't generate seeds that are too large. |
Send message Joined: 24 Aug 11 Posts: 2 Credit: 8,408,125 RAC: 0 |
"Also, the nbody work generation code dynamically estimates the fpops on a per-workunit basis, so I'm hoping this will result in more accurate runtime estimates and generated credit. Let me know how that goes." Since June 29, 2012 my work unit credits appear to be going backwards. almost like 1 step forward and 3 steps back! What gives? |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
"Also, the nbody work generation code dynamically estimates the fpops on a per-workunit basis, so I'm hoping this will result in more accurate runtime estimates and generated credit. Let me know how that goes." Your credit is going backwards? That doesn't make any sense. All the credit you've been granted has been positive... |
Send message Joined: 22 Jun 12 Posts: 2 Credit: 377,594 RAC: 0 |
July 4th. Still crashing. |
Send message Joined: 28 Sep 11 Posts: 60 Credit: 22,764,173 RAC: 0 |
...July 3rd, just had a slew of "Computation Errors" go back to you... EDIT:...MWAH was just performing a string of 6 NBody 0.84s and all were coming up the same "Computation Error" so I decided to exit MWAH/BOINC Manager after the 3rd one just to see if the software was acting up...the effort yielded the same result, the remaining 3 0.84s errored out, as well...but a 1.01 finished subsequently with no problem... Typical stderr for the NBodys... Task 248919949: http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=248919949 ps_plum_slice_EMD_NEW_100K_1341346802_6694_1 <core_client_version>6.12.43</core_client_version> <![CDATA[ <message> process exited with code 15 (0xf, -241) </message> <stderr_txt> <search_application> milkyway_nbody 0.84 Darwin x86_64 double OpenMP, Crlibm </search_application> Using OpenMP 8 max threads on a system with 8 processors Error reading histogram line 37: massPerParticle = 0.000100 21:55:47 (80181): called boinc_finish </stderr_txt> ]]> |
Send message Joined: 18 Nov 10 Posts: 19 Credit: 180,131,652 RAC: 24,307 |
I am having similar failures running on Linux 64-bit. I am running an old version of Boinc which is the one that is easiest to get running on CentOS. I saw similar problems on Einstein and was able to clear compute errors by doing an "ldd" to see which libraries that Einstein could not find. For Einstein, I had to install some 32-bit versions of libraries (GLUT,...). Milkyway is statically linked and stripped of symbols so missing libraries is not the problem for Milkyway. rod Task 249098203 Stderr output <core_client_version>6.10.45</core_client_version> <![CDATA[ <message> process exited with code 15 (0xf, -241) </message> <stderr_txt> <search_application> milkyway_nbody 0.88 Linux x86_64 double OpenMP, Crlibm </search_application> Using OpenMP 4 max threads on a system with 4 processors Error reading histogram line 37: massPerParticle = 0.000100 23:07:55 (8730): called boinc_finish </stderr_txt> ]]> |
Send message Joined: 29 May 12 Posts: 1 Credit: 110,369 RAC: 0 |
I have similar symptoms: http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=249667580 I also tried to build from goodNBodyCompar2 branch (a6b8b34), no luck yet: http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=245911387 <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> <search_application> milkyway_nbody 0.93 Linux x86_64 double OpenMP, Crlibm </search_application> Using OpenMP 1 max threads on a system with 2 processors Warning: not applying timestep correction for workunit with min version 0.80 <search_likelihood>-nan</search_likelihood> Failed to calculate likelihood *** glibc detected *** ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody: corrupted double-linked list: 0x000000000248b940 *** ======= Backtrace: ========= /lib/x86_64-linux-gnu/libc.so.6(+0x75b46)[0x7f2c97987b46] /lib/x86_64-linux-gnu/libc.so.6(+0x77547)[0x7f2c97989547] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x6c)[0x7f2c9798c87c] ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody(destroyNBodyState+0xe9)[0x4248e9] ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody(nbMain+0x27a)[0x41d9ea] ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody(main+0x2e7)[0x41baf7] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)[0x7f2c97930ead] ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody[0x41bd5d] ======= Memory map: ======== 00400000-004c5000 r-xp 00000000 08:09 1194356 /usr/lib/boinc-app-milkyway-local/milkyway_nbody 006c5000-006c7000 rw-p 000c5000 08:09 1194356 /usr/lib/boinc-app-milkyway-local/milkyway_nbody 006c7000-006f6000 rw-p 00000000 00:00 0 0103e000-024ac000 rw-p 00000000 00:00 0 [heap] 7f2c90000000-7f2c90021000 rw-p 00000000 00:00 0 7f2c90021000-7f2c94000000 ---p 00000000 00:00 0 7f2c9661c000-7f2c96631000 r-xp 00000000 08:09 1586269 /lib/x86_64-linux-gnu/libgcc_s.so.1 7f2c96631000-7f2c96831000 ---p 00015000 08:09 1586269 /lib/x86_64-linux-gnu/libgcc_s.so.1 7f2c96831000-7f2c96832000 rw-p 00015000 08:09 1586269 /lib/x86_64-linux-gnu/libgcc_s.so.1 7f2c96846000-7f2c96847000 rw-p 00000000 00:00 0 7f2c96847000-7f2c96cdc000 rw-s 00000000 08:09 81906 /var/lib/boinc-client/slots/0/boinc_milkyway_nbody_0 7f2c96cdc000-7f2c97912000 rw-p 00000000 00:00 0 7f2c97912000-7f2c97a8f000 r-xp 00000000 08:09 957362 /lib/x86_64-linux-gnu/libc-2.13.so 7f2c97a8f000-7f2c97c8f000 ---p 0017d000 08:09 957362 /lib/x86_64-linux-gnu/libc-2.13.so 7f2c97c8f000-7f2c97c93000 r--p 0017d000 08:09 957362 /lib/x86_64-linux-gnu/libc-2.13.so 7f2c97c93000-7f2c97c94000 rw-p 00181000 08:09 957362 /lib/x86_64-linux-gnu/libc-2.13.so 7f2c97c94000-7f2c97c99000 rw-p 00000000 00:00 0 7f2c97c99000-7f2c97cb0000 r-xp 00000000 08:09 957380 /lib/x86_64-linux-gnu/libpthread-2.13.so 7f2c97cb0000-7f2c97eaf000 ---p 00017000 08:09 957380 /lib/x86_64-linux-gnu/libpthread-2.13.so 7f2c97eaf000-7f2c97eb0000 r--p 00016000 08:09 957380 /lib/x86_64-linux-gnu/libpthread-2.13.so 7f2c97eb0000-7f2c97eb1000 rw-p 00017000 08:09 957380 /lib/x86_64-linux-gnu/libpthread-2.13.so 7f2c97eb1000-7f2c97eb5000 rw-p 00000000 00:00 0 7f2c97eb5000-7f2c97ec3000 r-xp 00000000 08:09 1226439 /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0 7f2c97ec3000-7f2c980c2000 ---p 0000e000 08:09 1226439 /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0 7f2c980c2000-7f2c980c3000 rw-p 0000d000 08:09 1226439 /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0 7f2c980c3000-7f2c98144000 r-xp 00000000 08:09 957369 /lib/x86_64-linux-gnu/libm-2.13.so 7f2c98144000-7f2c98343000 ---p 00081000 08:09 957369 /lib/x86_64-linux-gnu/libm-2.13.so 7f2c98343000-7f2c98344000 r--p 00080000 08:09 957369 /lib/x86_64-linux-gnu/libm-2.13.so 7f2c98344000-7f2c98345000 rw-p 00081000 08:09 957369 /lib/x86_64-linux-gnu/libm-2.13.so 7f2c98345000-7f2c9834c000 r-xp 00000000 08:09 957382 /lib/x86_64-linux-gnu/librt-2.13.so 7f2c9834c000-7f2c9854b000 ---p 00007000 08:09 957382 /lib/x86_64-linux-gnu/librt-2.13.so 7f2c9854b000-7f2c9854c000 r--p 00006000 08:09 957382 /lib/x86_64-linux-gnu/librt-2.13.so 7f2c9854c000-7f2c9854d000 rw-p 00007000 08:09 957382 /lib/x86_64-linux-gnu/librt-2.13.so 7f2c9854d000-7f2c9856d000 r-xp 00000000 08:09 957384 /lib/x86_64-linux-gnu/ld-2.13.so 7f2c98752000-7f2c98756000 rw-p 00000000 00:00 0 7f2c98764000-7f2c98765000 ---p 00000000 00:00 0 7f2c98765000-7f2c98768000 rw-p 00000000 00:00 0 7f2c98768000-7f2c9876a000 rw-s 00000000 00:04 6782993 /SYSV01093f70 (deleted) 7f2c9876a000-7f2c9876c000 rw-p 00000000 00:00 0 7f2c9876c000-7f2c9876d000 r--p 0001f000 08:09 957384 /lib/x86_64-linux-gnu/ld-2.13.so 7f2c9876d000-7f2c9876e000 rw-p 00020000 08:09 957384 /lib/x86_64-linux-gnu/ld-2.13.so 7f2c9876e000-7f2c9876f000 rw-p 00000000 00:00 0 7fff518ae000-7fff518cf000 rw-p 00000000 00:00 0 [stack] 7fff51993000-7fff51994000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] SIGABRT: abort called Stack trace (13 frames): ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody(boinc_catch_signal+0xf7)[0x4766d3] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf030)[0x7f2c97ca8030] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7f2c97944475] /lib/x86_64-linux-gnu/libc.so.6(abort+0x180)[0x7f2c979476f0] /lib/x86_64-linux-gnu/libc.so.6(+0x6c2fb)[0x7f2c9797e2fb] /lib/x86_64-linux-gnu/libc.so.6(+0x75b46)[0x7f2c97987b46] /lib/x86_64-linux-gnu/libc.so.6(+0x77547)[0x7f2c97989547] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x6c)[0x7f2c9798c87c] ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody(destroyNBodyState+0xe9)[0x4248e9] ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody(nbMain+0x27a)[0x41d9ea] ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody(main+0x2e7)[0x41baf7] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)[0x7f2c97930ead] ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody[0x41bd5d] Exiting... </stderr_txt> ]]> |
Send message Joined: 18 Nov 10 Posts: 19 Credit: 180,131,652 RAC: 24,307 |
It is hard to tell what is happening with this stripped, statically linked program. I don't know how the program manages the different system call interfaces (various versions of Linux) with a single static link. It has been a long time since I have seen anyone statically link anything. Intel has released a new beta version of their compiler that performs dynamic, runtime pointer checking that might help locate the bug but there is nothing MilkyWay@Home users can do to help other than to say .... still failing. http://software.intel.com/en-us/articles/beta-tech-talks/ They have both a Fortan and C compiler that should help clean up bogus pointers. They can run a test on the application and locate their corrupted pointer. An objdump of the application shows that there are AVX instructions in what I guess is the OpenMP code. I have Sandy Bridge and Ivy Bridge systems and ONE Nehalem system. The Nehalem system seems to work. The Sandy/Ivy bridge systems using AVX (via OpenMP) are failing. If you get a computation error, the choices are to (1) turn off work or (2) let the compute errors pile up. The workloads fail pretty rapidly so I am going to let the compute errors filter back to the system and it will be clear when they have fixed the bug. |
Send message Joined: 4 Mar 12 Posts: 45 Credit: 460,132,234 RAC: 940 |
yep still gettting failed wu's here too |
©2024 Astroinformatics Group