rpi_logo
Segfault on Linux, AMD Radeon, open source Mesa drivers
Segfault on Linux, AMD Radeon, open source Mesa drivers
log in

Advanced search

Message boards : Number crunching : Segfault on Linux, AMD Radeon, open source Mesa drivers

Author Message
TomasBrod
Send message
Joined: 7 Dec 15
Posts: 17
Credit: 2,367,530
RAC: 46,703

Message 67860 - Posted: 1 Nov 2018, 22:57:34 UTC

Hello. The app crashes shortly after initialization.

Kernel options releated to VSYSCALL:

CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_X86_VSYSCALL_EMULATION=y # CONFIG_LEGACY_VSYSCALL_EMULATE is not set CONFIG_LEGACY_VSYSCALL_NONE=y


A bit of clinfo:

Number of platforms 1 Platform Name Clover Platform Vendor Mesa Platform Version OpenCL 1.1 Mesa 18.2.3 Platform Extensions cl_khr_icd Device Name Radeon RX 560 Series (POLARIS11, DRM 3.26.0, 4.18.16-arch1-1-ARCH, LLVM 7.0.0) Device Version OpenCL 1.1 Mesa 18.2.3 Driver Version 18.2.3 Device OpenCL C Version OpenCL C 1.1 ICD loader Vendor OCL Icd free software ICD loader Version 2.2.12 ICD loader Profile OpenCL 2.2


Reading preferences ended prematurely BOINC GPU type suggests using OpenCL vendor 'Advanced Micro Devices, Inc.' Setting process priority to 0 (13): Permission denied Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4 ' Switching to Parameter File 'astronomy_parameters.txt' <number_WUs> 4 </number_WUs> <number_params_per_WU> 26 </number_params_per_WU> Using AVX path Found 1 platform Platform 0 information: Name: Clover Version: OpenCL 1.1 Mesa 18.2.3 Vendor: Mesa Extensions: cl_khr_icd Profile: FULL_PROFILE Didn't find preferred platform Using device 0 on platform 0 Found 2 CL devices Device 'Radeon RX 560 Series (POLARIS11, DRM 3.26.0, 4.18.16-arch1-1-ARCH, LLVM 7.0.0)' (AMD:0x1002) (CL_DEVICE_TYPE _GPU) Board: Driver version: 18.2.3 Version: OpenCL 1.1 Mesa 18.2.3 Compute capability: 0.0 Max compute units: 16 Clock frequency: 1300 Mhz Global mem size: 3221225472 Local mem size: 32768 Max const buf size: 2147483647 Double extension: cl_khr_fp64 SIGSEGV: segmentation violation Exiting...


I will provide details on request.
____________

Profile Keith Myers
Avatar
Send message
Joined: 24 Jan 11
Posts: 159
Credit: 104,120,262
RAC: 25,396

Message 67861 - Posted: 1 Nov 2018, 23:32:09 UTC

I'll post the same instructions I found for someone else using a RX580 card for OpenCL tasks in Linux at Einstein. The Mesa drivers nor the AMD GPU or ROCm drivers install the OpenCL component necessary to run OpenCL apps.


Other than pointing you at the instructions, I can provide no other help since I know nothing about ATI/AMD since I run Nvidia exclusively.

https://einsteinathome.org/content/quick-guide-how-install-opencl-amd-gpus-linux-kubuntu-1804-and-similar-distro
____________

TomasBrod
Send message
Joined: 7 Dec 15
Posts: 17
Credit: 2,367,530
RAC: 46,703

Message 67862 - Posted: 2 Nov 2018, 8:16:37 UTC

Keith, I am confused by your reply.
I already have OpenCL drivers, including opencl header and can run tasks from Amicable, Einstein, Primegrid and Collatz no problem. The instructions talk about installing the opencl part of amdgpu. Do yo think there is some file missing from the stuff that I have already installed? In that case, the app still should not segfault, but print error.
I did try to install the amdgpu opencl part using my distribution recommended way, but it crashes all opencl becaue it requires deprecated version of libdrm. I did not try to install it directly, but I will, to see if there was a file missing.

Profile Keith Myers
Avatar
Send message
Joined: 24 Jan 11
Posts: 159
Credit: 104,120,262
RAC: 25,396

Message 67863 - Posted: 2 Nov 2018, 16:29:26 UTC - in response to Message 67862.

As I stated, I know nothing about ATI. All I was suggesting is that like Microsoft with Nvidia drivers, it is common for them to ship the latest drivers without OpenCL support in the driver package.

I was thinking the same thing could be happening with your ATI card. From that post at Einstein and comments in that thread, it seems that discrete OpenCL support has to be installed independently from the normal ATI driver package.
____________

TomasBrod
Send message
Joined: 7 Dec 15
Posts: 17
Credit: 2,367,530
RAC: 46,703

Message 67865 - Posted: 2 Nov 2018, 18:50:23 UTC

Thanks. I definitely have OpenCL support installed. I tried to install the proprietary OpenCL component but ran into dependency issue. I could continue, but instead I took different path.

I recompiled the milkyway separation app from source codes with debugging symbols enabled and static linking disabled. First I just plugged my custom app into boinc (via app_info.xml), but that crashed the same way!

Then I pulled random separation WU from boinc and tried to run the app in debugger. It unsurprisingly crashed again, but this time I got the back trace. It appears the crash is in `gelf_getshdr` function from libelf.so, called by clBuildProgram from libOpenCL.so library. This means

A) the opencl driver/compiler has bug and crashes trying to load the code

B) the milkyway opencl code and/or cl flags are problematic

Trace follows:

Thread 1 "milkyway_separa" received signal SIGSEGV, Segmentation fault. 0x00007ffff7fb47b4 in gelf_getshdr () from /usr/lib/libelf.so.1 (gdb) bt #0 0x00007ffff7fb47b4 in gelf_getshdr () from /usr/lib/libelf.so.1 #1 0x00007ffff7b3e8ab in ?? () from /usr/lib/libMesaOpenCL.so.1 #2 0x00007ffff7b39ed4 in ?? () from /usr/lib/libMesaOpenCL.so.1 #3 0x00007ffff7ae9c68 in ?? () from /usr/lib/libMesaOpenCL.so.1 #4 0x00007ffff7acce1b in ?? () from /usr/lib/libMesaOpenCL.so.1 #5 0x00007ffff7de5d9b in clBuildProgram () from /usr/lib/libOpenCL.so.1 #6 0x0000555555668f68 in mwBuildProgram (program=0x55555616dbe8, device=0x555555831078, options=0x555556178310 "-D DOUBLEPREC=1 -cl-mad-enable -cl-no-signed-zeros -cl-finite-math-only -D BACKGROUND_PROFILE=1 -D AUX_BG_PROFILE=0 -D NSTREAM=4 -D CONVOLVE=120 -D R0=12 -D SUN_R0=8.5 -D Q_INV_SQR=3.69822485207101 -D BG_A=0 -D BG_B=0 -D BG_C=0 -D BACKGROUND_WEIGHT=0.99 -D THICK_DISK_WEIGHT=0.01 -D INNERPOWER=1 -D OUTERPOWER=1 -D ALPHA_DELTA_3=3 ") at /home/tomas/downloads/milkywayathome_client/milkyway/src/milkyway_cl_program.c:99 #7 0x00005555556693af in mwCreateProgramFromSrc (ci=0x7fffffffd3a0, srcCount=1, src=0x7fffffffd2c0, lengths=0x7fffffffd2c8, compileDefs=0x555556178310 "-D DOUBLEPREC=1 -cl-mad-enable -cl-no-signed-zeros -cl-finite-math-only -D BACKGROUND_PROFILE=1 -D AUX_BG_PROFILE=0 -D NSTREAM=4 -D CONVOLVE=120 -D R0=12 -D SUN_R0=8.5 -D Q_INV_SQR=3.69822485207101 -D BG_A=0 -D BG_B=0 -D BG_C=0 -D BACKGROUND_WEIGHT=0.99 -D THICK_DISK_WEIGHT=0.01 -D INNERPOWER=1 -D OUTERPOWER=1 -D ALPHA_DELTA_3=3 ") at /home/tomas/downloads/milkywayathome_client/milkyway/src/milkyway_cl_program.c:223 #8 0x00005555555e90f0 in setupSeparationCL (ci=0x7fffffffd3a0, ap=0x7fffffffdc80, ias=0x5555557e3280, clr=0x7fffffffdc20) at /home/tomas/downloads/milkywayathome_client/separation/src/setup_cl.c:600 #9 0x00005555555deca6 in evaluate (results=0x5555557df930, ap=0x7fffffffdc80, ias=0x5555557e3280, streams=0x7fffffffdbc0, sc=0x5555557e0180, likelihoodToText=0, starPointsFile=0x5555557d52b0 "stars.txt", clr=0x7fffffffdc20, do_separation=0, ignoreCheckpoint=0x7fffffffdb9c, separation_outfile=0x0) at /home/tomas/downloads/milkywayathome_client/separation/src/evaluation.c:249 #10 0x00005555555de2c4 in worker (sf=0x7fffffffdd90) at /home/tomas/downloads/milkywayathome_client/separation/src/separation_main.c:688 #11 0x00005555555de572 in main (argc=3, argv=0x7fffffffe038) at /home/tomas/downloads/milkywayathome_client/separation/src/separation_main.c:784 (gdb) frame 8 (gdb) p compileFlags $6 = 0x555556178310 "-D DOUBLEPREC=1 -cl-mad-enable -cl-no-signed-zeros -cl-finite-math-only -D BACKGROUND_PROFILE=1 -D AUX_BG_PROFILE=0 -D NSTREAM=4 -D CONVOLVE=120 -D R0=12 -D SUN_R0=8.5 -D Q_INV_SQR=3.69822485207101 -D BG_A=0 -D BG_B=0 -D BG_C=0 -D BACKGROUND_WEIGHT=0.99 -D THICK_DISK_WEIGHT=0.01 -D INNERPOWER=1 -D OUTERPOWER=1 -D ALPHA_DELTA_3=3 "

____________

TomasBrod
Send message
Joined: 7 Dec 15
Posts: 17
Credit: 2,367,530
RAC: 46,703

Message 67866 - Posted: 2 Nov 2018, 19:08:34 UTC

I had normal stable mesa and opencl-mesa release.

Adding backtrace with development debug version of Mesa (OpenCL 1.1 Mesa 18.3.0-devel (git-9007c0ed26)) installed. This is almost definitely error in the driver and may be even another issue. yeah:

Thread 24 "milkyway_s:sh0" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fff937fe700 (LWP 10873)] 0x00007ffff7fa07b4 in gelf_getshdr () from /usr/lib/libelf.so.1 (gdb) bt #0 0x00007ffff7fa07b4 in gelf_getshdr () from /usr/lib/libelf.so.1 #1 0x00007fffefae3d87 in ac_elf_read (elf_data=<optimized out>, elf_size=<optimized out>, binary=binary@entry=0x55555611e118) at common/ac_binary.c:135 #2 0x00007fffefaebd97 in ac_compile_module_to_binary (p=p@entry=0x555555cb92d0, module=module@entry=0x7fff84005de0, binary=binary@entry=0x55555611e118) at /usr/include/llvm/ADT/StringRef.h:138 #3 0x00007fffefaab48d in si_llvm_compile (M=M@entry=0x7fff84005de0, binary=binary@entry=0x55555611e118, compiler=compiler@entry=0x555555cdea68, debug=debug@entry=0x55555611e018, less_optimized=less_optimized@entry=false) at si_shader_tgsi_setup.c:103 #4 0x00007fffefaa1137 in si_compile_llvm (sscreen=sscreen@entry=0x555555cde350, binary=binary@entry=0x55555611e118, conf=conf@entry=0x55555611e168, compiler=compiler@entry=0x555555cdea68, mod=0x7fff84005de0, debug=debug@entry=0x55555611e018, processor=5, name=0x7fffefb312d1 "Compute Shader", less_optimized=false) at si_shader.c:5599 #5 0x00007fffefaa2937 in si_compile_tgsi_shader (sscreen=0x555555cde350, compiler=0x555555cdea68, shader=0x55555611e058, debug=0x55555611e018) at si_shader.c:6734 #6 0x00007fffefaa3755 in si_shader_create (sscreen=sscreen@entry=0x555555cde350, compiler=compiler@entry=0x555555cdea68, shader=shader@entry=0x55555611e058, debug=debug@entry=0x55555611e018) at si_shader.c:8045 #7 0x00007fffefa7d125 in si_create_compute_state_async (job=job@entry=0x55555611dff0, thread_index=thread_index@entry=0) at si_compute.c:152 #8 0x00007fffefa43c79 in util_queue_thread_func (input=input@entry=0x555555cdd600) at u_queue.c:286 #9 0x00007fffefa43937 in impl_thrd_routine (p=<optimized out>) at ../../include/c11/threads_posix.h:87 #10 0x00007ffff7dbba9d in start_thread () from /usr/lib/libpthread.so.0 #11 0x00007ffff7cebb23 in clone () from /usr/lib/libc.so.6 (gdb) frame 1 #1 0x00007fffefae3d87 in ac_elf_read (elf_data=<optimized out>, elf_size=<optimized out>, binary=binary@entry=0x55555611e118) at common/ac_binary.c:135 135 if (gelf_getshdr(section, &section_header) != &section_header) { (gdb) p section $1 = (Elf_Scn *) 0x7fff84043da8 (gdb) p section_header $2 = {sh_name = 0, sh_type = 0, sh_flags = 0, sh_addr = 68719476736, sh_offset = 140735667996976, sh_size = 140737323335464, sh_link = 0, sh_info = 0, sh_addralign = 93825000247984, sh_entsize = 0}

TomasBrod
Send message
Joined: 7 Dec 15
Posts: 17
Credit: 2,367,530
RAC: 46,703

Message 67867 - Posted: 2 Nov 2018, 20:07:21 UTC - in response to Message 67866.

I now created/copied a simple program that would just compile, load and execute the opencl kernel with the same Flags as milkyway used. And that did not crash!

Profile Keith Myers
Avatar
Send message
Joined: 24 Jan 11
Posts: 159
Credit: 104,120,262
RAC: 25,396

Message 67868 - Posted: 3 Nov 2018, 1:18:33 UTC - in response to Message 67867.

You definitely have the skill set to diagnose errors. Seems you are going to need to log the errors to MESA and AMD.

Similar to what the Einstein users are having to do with Nvidia Turing cards.

It is either the applications are performing a function that crashes the driver or the drivers are having issues performing a valid function that only the MW and Einstein apps are exposing.

Doubtful that neither project has the developer resources to quickly effect changes to the failing applications. Solution will probably have to come from the driver developers.
____________

TomasBrod
Send message
Joined: 7 Dec 15
Posts: 17
Credit: 2,367,530
RAC: 46,703

Message 67869 - Posted: 3 Nov 2018, 7:51:34 UTC - in response to Message 67868.

Similar to what the Einstein users are having to do with Nvidia Turing cards.

It is either the applications are performing a function that crashes the driver or the drivers are having issues performing a valid function that only the MW and Einstein apps are exposing.


On my system, Einstein apps work just fine. I do not have Turing card, however.

TomasBrod
Send message
Joined: 7 Dec 15
Posts: 17
Credit: 2,367,530
RAC: 46,703

Message 67871 - Posted: 3 Nov 2018, 11:35:04 UTC

I think I fixed it. Will submit PR once i finalize it.

TomasBrod
Send message
Joined: 7 Dec 15
Posts: 17
Credit: 2,367,530
RAC: 46,703

Message 67872 - Posted: 3 Nov 2018, 11:59:21 UTC - in response to Message 67871.

I think I fixed it. Will submit PR once i finalize it.


Done in https://github.com/Milkyway-at-home/milkywayathome_client/pull/62!

The problem was that it was using different type in declaration of inline kernel source size than in definition. This resulted in size in order of terabytes, which crashed the compiler.

Also I had to do these changes to build non-static debug-enabled binaries https://github.com/gridcoin-community/milkywayathome_client/pull/1. I did not submit that one to MW.
____________

TomasBrod
Send message
Joined: 7 Dec 15
Posts: 17
Credit: 2,367,530
RAC: 46,703

Message 67873 - Posted: 3 Nov 2018, 12:35:17 UTC - in response to Message 67872.

That allowed me to run Separation jobs on Polaris. NBody fails to load/compile on both Polaris and Tahiti and Separation fails to "Failed to calculate integral 0" on Tahiti.

TomasBrod
Send message
Joined: 7 Dec 15
Posts: 17
Credit: 2,367,530
RAC: 46,703

Message 67874 - Posted: 3 Nov 2018, 18:55:39 UTC - in response to Message 67873.

I got it to work. At least Separation. N-Body still does not work because it is stupid.

Profile Keith Myers
Avatar
Send message
Joined: 24 Jan 11
Posts: 159
Credit: 104,120,262
RAC: 25,396

Message 67875 - Posted: 3 Nov 2018, 20:35:57 UTC

Thank you Tomas for submitting changes to the MW codebase to improve current support under Linux. We really appreciate the volunteer developers since realistically, they are the ones that have the time compared to project scientists.
____________

TomasBrod
Send message
Joined: 7 Dec 15
Posts: 17
Credit: 2,367,530
RAC: 46,703

Message 67879 - Posted: 6 Nov 2018, 18:37:34 UTC

N-Body wasn't even supposed to work. That's why I could not get it to work.

TomasBrod
Send message
Joined: 7 Dec 15
Posts: 17
Credit: 2,367,530
RAC: 46,703

Message 67880 - Posted: 6 Nov 2018, 19:53:42 UTC

I doubt this will be useful, but I put my custom milkyway separation binary here http://www.tbrada.eu/up/363f27ebac1a27b6715609c245555881 and app info here http://www.tbrada.eu/up/6826f16895d61bf7e069dcd4acf33c21.xml.

xdarma
Send message
Joined: 28 May 10
Posts: 5
Credit: 250,577,336
RAC: 28,216

Message 67925 - Posted: 4 Dec 2018, 13:50:27 UTC

The application you provide works on my old hd7970 gpu with Mesa OpenCL libs.

Found 1 CL device Device 'AMD Radeon HD 7900 Series (TAHITI, DRM 3.27.0, 4.19.6-gentoo.s1, LLVM 6.0.1)' (AMD:0x1002) (CL_DEVICE_TYPE_GPU) Board: Driver version: 18.3.0-rc5 Version: OpenCL 1.1 Mesa 18.3.0-rc5

WU are correctly validated and execution time for single WU is around 58 seconds, so I think the is no big penalty with Mesa OpenCL libraries.

Thanks for your work.

Maybe Jack or Eric are interested on updating the official application.
Hope they get a look at the forums. ;-)


Post to thread

Message boards : Number crunching : Segfault on Linux, AMD Radeon, open source Mesa drivers


Main page · Your account · Message boards


Copyright © 2018 AstroInformatics Group