Welcome to MilkyWay@home

Signal 11 on x86_64 for milkyway_separation_1.02

Message boards : Application Code Discussion : Signal 11 on x86_64 for milkyway_separation_1.02
Message board moderation

To post messages, you must log in.

AuthorMessage
christophe

Send message
Joined: 24 Dec 11
Posts: 2
Credit: 9,650,503
RAC: 0
Message 60626 - Posted: 19 Dec 2013, 18:15:42 UTC

Hi,

milkyway_separation_1.02_x86_64-pc-linux-gnu__opencl_amd_ati always failed on my machine, despite other programs such as milkyway_separation__modified_fit_1.28_x86_64-pc-linux-gnu__opencl_amd_ati runs fine.

I run other GPU boinc projects without any problems.

I was running on 13.4 drivers and just updated to 13.11 Beta V9.4, but nothing changed.

I have been investigating this with gdb.

Here is the error:
gdb ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_separation_1.02_x86_64-pc-linux-gnu__opencl_amd_ati
> r -a astronomy_parameters.txt

.....

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff290b700 (LWP 5586)]
0x00007ffff496a7f5 in gpu::VirtualGPU::setActiveKernelDesc(amd::NDRangeContainer const&, gpu::Kernel const*) () from /usr/lib/libamdocl64.so
(gdb) bt
#0 0x00007ffff496a7f5 in gpu::VirtualGPU::setActiveKernelDesc(amd::NDRangeContainer const&, gpu::Kernel const*) () from /usr/lib/libamdocl64.so
#1 0x00007ffff496aabd in gpu::VirtualGPU::submitKernelInternal(amd::NDRangeContainer const&, amd::Kernel const&, unsigned char const*, bool) () from /usr/lib/libamdocl64.so
#2 0x00007ffff496f059 in gpu::VirtualGPU::submitKernel(amd::NDRangeKernelCommand&) () from /usr/lib/libamdocl64.so
#3 0x00007ffff4900810 in amd::CommandQueue::loop(device::VirtualDevice*) () from /usr/lib/libamdocl64.so
#4 0x00007ffff4901085 in amd::CommandQueue::Thread::run(void*) () from /usr/lib/libamdocl64.so
#5 0x00007ffff4917321 in amd::Thread::main() () from /usr/lib/libamdocl64.so
#6 0x00007ffff491487c in amd::Thread::entry(amd::Thread*) () from /usr/lib/libamdocl64.so
#7 0x00007ffff74bae0e in start_thread (arg=0x7ffff290b700) at pthread_create.c:311
#8 0x00007ffff71ef9ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113


The crash obviously occurs inside AMD library, and disassembing the faulting line:
0x00007ffff496a7ea <+26>: sub $0x68,%rsp
0x00007ffff496a7ee <+30>: mov 0x188(%rdx),%rax
=> 0x00007ffff496a7f5 <+37>: mov 0x10(%rax),%rbx

(gdb) p $rax
$1 = 0


-> Something NULL is passed in the OpenCl program/execution environment. Very probably a struct.

The execution stack from milkyway code, is as follows
(gdb) info threads
Id Target Id Frame
* 3 Thread 0x7ffff290b700 (LWP 5586) "milkyway_separa" 0x00007ffff496aabd in gpu::VirtualGPU::submitKernelInternal(amd::NDRangeContainer const&, amd::Kernel const&, unsigned char const*, bool) () from /usr/lib/libamdocl64.so
2 Thread 0x7ffff7ff6700 (LWP 5585) "milkyway_separa" 0x00007ffff71c049d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
1 Thread 0x7ffff7fd5700 (LWP 5582) "milkyway_separa" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85


(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff7fd5700 (LWP 5582))]

(gdb) bt
#0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1 0x00007ffff49167c0 in amd::Semaphore::wait() () from /usr/lib/libamdocl64.so
#2 0x00007ffff4912b1f in amd::Monitor::wait() () from /usr/lib/libamdocl64.so
#3 0x00007ffff48ff750 in amd::Event::awaitCompletion() () from /usr/lib/libamdocl64.so
#4 0x00007ffff49001bb in amd::CommandQueue::finish() () from /usr/lib/libamdocl64.so
#5 0x00007ffff48d87c7 in clFinish () from /usr/lib/libamdocl64.so
#6 0x000000000044c419 in ?? ()
#7 0x000000000044c6ed in ?? ()
#8 0x000000000044cbda in integrateCL ()
#9 0x0000000000445c95 in evaluate ()
#10 0x00000000004437d6 in main ()

Since the symbols above evaluate() and integrateCL() are stripped, I can't pinpoint the cause of the crash.

Am I the only x86_64 Linux user to have this error?

I'm running debian stable with an HD6900.
ID: 60626 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3319
Credit: 520,263,582
RAC: 20,155
Message 60631 - Posted: 20 Dec 2013, 12:11:42 UTC - in response to Message 60626.  

Hi,
Am I the only x86_64 Linux user to have this error?

I'm running debian stable with an HD6900.


My suggestion would be to post this in the News section, they talk alot about the Separation runs there. Just make sure you get the right thread or your responses could be limited.
ID: 60631 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
christophe

Send message
Joined: 24 Dec 11
Posts: 2
Credit: 9,650,503
RAC: 0
Message 60638 - Posted: 21 Dec 2013, 6:20:29 UTC

Thanks.
I reallized this is the same problem as reported on the thread "All Milkyway@Home 1.02 tasks ending in computation error on HD6950."
ID: 60638 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Application Code Discussion : Signal 11 on x86_64 for milkyway_separation_1.02

©2024 Astroinformatics Group