Message boards :
Number crunching :
Segfault on Linux, AMD Radeon, open source Mesa drivers
Message board moderation
Author | Message |
---|---|
Send message Joined: 7 Dec 15 Posts: 20 Credit: 8,805,665 RAC: 0 |
Hello. The app crashes shortly after initialization. Kernel options releated to VSYSCALL: CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_X86_VSYSCALL_EMULATION=y # CONFIG_LEGACY_VSYSCALL_EMULATE is not set CONFIG_LEGACY_VSYSCALL_NONE=y A bit of clinfo: Number of platforms 1 Platform Name Clover Platform Vendor Mesa Platform Version OpenCL 1.1 Mesa 18.2.3 Platform Extensions cl_khr_icd Device Name Radeon RX 560 Series (POLARIS11, DRM 3.26.0, 4.18.16-arch1-1-ARCH, LLVM 7.0.0) Device Version OpenCL 1.1 Mesa 18.2.3 Driver Version 18.2.3 Device OpenCL C Version OpenCL C 1.1 ICD loader Vendor OCL Icd free software ICD loader Version 2.2.12 ICD loader Profile OpenCL 2.2 Reading preferences ended prematurely BOINC GPU type suggests using OpenCL vendor 'Advanced Micro Devices, Inc.' Setting process priority to 0 (13): Permission denied Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4 ' Switching to Parameter File 'astronomy_parameters.txt' <number_WUs> 4 </number_WUs> <number_params_per_WU> 26 </number_params_per_WU> Using AVX path Found 1 platform Platform 0 information: Name: Clover Version: OpenCL 1.1 Mesa 18.2.3 Vendor: Mesa Extensions: cl_khr_icd Profile: FULL_PROFILE Didn't find preferred platform Using device 0 on platform 0 Found 2 CL devices Device 'Radeon RX 560 Series (POLARIS11, DRM 3.26.0, 4.18.16-arch1-1-ARCH, LLVM 7.0.0)' (AMD:0x1002) (CL_DEVICE_TYPE _GPU) Board: Driver version: 18.2.3 Version: OpenCL 1.1 Mesa 18.2.3 Compute capability: 0.0 Max compute units: 16 Clock frequency: 1300 Mhz Global mem size: 3221225472 Local mem size: 32768 Max const buf size: 2147483647 Double extension: cl_khr_fp64 SIGSEGV: segmentation violation Exiting... I will provide details on request. |
Send message Joined: 24 Jan 11 Posts: 708 Credit: 543,124,445 RAC: 143,222 |
I'll post the same instructions I found for someone else using a RX580 card for OpenCL tasks in Linux at Einstein. The Mesa drivers nor the AMD GPU or ROCm drivers install the OpenCL component necessary to run OpenCL apps. Other than pointing you at the instructions, I can provide no other help since I know nothing about ATI/AMD since I run Nvidia exclusively. https://einsteinathome.org/content/quick-guide-how-install-opencl-amd-gpus-linux-kubuntu-1804-and-similar-distro |
Send message Joined: 7 Dec 15 Posts: 20 Credit: 8,805,665 RAC: 0 |
Keith, I am confused by your reply. I already have OpenCL drivers, including opencl header and can run tasks from Amicable, Einstein, Primegrid and Collatz no problem. The instructions talk about installing the opencl part of amdgpu. Do yo think there is some file missing from the stuff that I have already installed? In that case, the app still should not segfault, but print error. I did try to install the amdgpu opencl part using my distribution recommended way, but it crashes all opencl becaue it requires deprecated version of libdrm. I did not try to install it directly, but I will, to see if there was a file missing. |
Send message Joined: 24 Jan 11 Posts: 708 Credit: 543,124,445 RAC: 143,222 |
As I stated, I know nothing about ATI. All I was suggesting is that like Microsoft with Nvidia drivers, it is common for them to ship the latest drivers without OpenCL support in the driver package. I was thinking the same thing could be happening with your ATI card. From that post at Einstein and comments in that thread, it seems that discrete OpenCL support has to be installed independently from the normal ATI driver package. |
Send message Joined: 7 Dec 15 Posts: 20 Credit: 8,805,665 RAC: 0 |
Thanks. I definitely have OpenCL support installed. I tried to install the proprietary OpenCL component but ran into dependency issue. I could continue, but instead I took different path. I recompiled the milkyway separation app from source codes with debugging symbols enabled and static linking disabled. First I just plugged my custom app into boinc (via app_info.xml), but that crashed the same way! Then I pulled random separation WU from boinc and tried to run the app in debugger. It unsurprisingly crashed again, but this time I got the back trace. It appears the crash is in `gelf_getshdr` function from libelf.so, called by clBuildProgram from libOpenCL.so library. This means A) the opencl driver/compiler has bug and crashes trying to load the code B) the milkyway opencl code and/or cl flags are problematic Trace follows: Thread 1 "milkyway_separa" received signal SIGSEGV, Segmentation fault. 0x00007ffff7fb47b4 in gelf_getshdr () from /usr/lib/libelf.so.1 (gdb) bt #0 0x00007ffff7fb47b4 in gelf_getshdr () from /usr/lib/libelf.so.1 #1 0x00007ffff7b3e8ab in ?? () from /usr/lib/libMesaOpenCL.so.1 #2 0x00007ffff7b39ed4 in ?? () from /usr/lib/libMesaOpenCL.so.1 #3 0x00007ffff7ae9c68 in ?? () from /usr/lib/libMesaOpenCL.so.1 #4 0x00007ffff7acce1b in ?? () from /usr/lib/libMesaOpenCL.so.1 #5 0x00007ffff7de5d9b in clBuildProgram () from /usr/lib/libOpenCL.so.1 #6 0x0000555555668f68 in mwBuildProgram (program=0x55555616dbe8, device=0x555555831078, options=0x555556178310 "-D DOUBLEPREC=1 -cl-mad-enable -cl-no-signed-zeros -cl-finite-math-only -D BACKGROUND_PROFILE=1 -D AUX_BG_PROFILE=0 -D NSTREAM=4 -D CONVOLVE=120 -D R0=12 -D SUN_R0=8.5 -D Q_INV_SQR=3.69822485207101 -D BG_A=0 -D BG_B=0 -D BG_C=0 -D BACKGROUND_WEIGHT=0.99 -D THICK_DISK_WEIGHT=0.01 -D INNERPOWER=1 -D OUTERPOWER=1 -D ALPHA_DELTA_3=3 ") at /home/tomas/downloads/milkywayathome_client/milkyway/src/milkyway_cl_program.c:99 #7 0x00005555556693af in mwCreateProgramFromSrc (ci=0x7fffffffd3a0, srcCount=1, src=0x7fffffffd2c0, lengths=0x7fffffffd2c8, compileDefs=0x555556178310 "-D DOUBLEPREC=1 -cl-mad-enable -cl-no-signed-zeros -cl-finite-math-only -D BACKGROUND_PROFILE=1 -D AUX_BG_PROFILE=0 -D NSTREAM=4 -D CONVOLVE=120 -D R0=12 -D SUN_R0=8.5 -D Q_INV_SQR=3.69822485207101 -D BG_A=0 -D BG_B=0 -D BG_C=0 -D BACKGROUND_WEIGHT=0.99 -D THICK_DISK_WEIGHT=0.01 -D INNERPOWER=1 -D OUTERPOWER=1 -D ALPHA_DELTA_3=3 ") at /home/tomas/downloads/milkywayathome_client/milkyway/src/milkyway_cl_program.c:223 #8 0x00005555555e90f0 in setupSeparationCL (ci=0x7fffffffd3a0, ap=0x7fffffffdc80, ias=0x5555557e3280, clr=0x7fffffffdc20) at /home/tomas/downloads/milkywayathome_client/separation/src/setup_cl.c:600 #9 0x00005555555deca6 in evaluate (results=0x5555557df930, ap=0x7fffffffdc80, ias=0x5555557e3280, streams=0x7fffffffdbc0, sc=0x5555557e0180, likelihoodToText=0, starPointsFile=0x5555557d52b0 "stars.txt", clr=0x7fffffffdc20, do_separation=0, ignoreCheckpoint=0x7fffffffdb9c, separation_outfile=0x0) at /home/tomas/downloads/milkywayathome_client/separation/src/evaluation.c:249 #10 0x00005555555de2c4 in worker (sf=0x7fffffffdd90) at /home/tomas/downloads/milkywayathome_client/separation/src/separation_main.c:688 #11 0x00005555555de572 in main (argc=3, argv=0x7fffffffe038) at /home/tomas/downloads/milkywayathome_client/separation/src/separation_main.c:784 (gdb) frame 8 (gdb) p compileFlags $6 = 0x555556178310 "-D DOUBLEPREC=1 -cl-mad-enable -cl-no-signed-zeros -cl-finite-math-only -D BACKGROUND_PROFILE=1 -D AUX_BG_PROFILE=0 -D NSTREAM=4 -D CONVOLVE=120 -D R0=12 -D SUN_R0=8.5 -D Q_INV_SQR=3.69822485207101 -D BG_A=0 -D BG_B=0 -D BG_C=0 -D BACKGROUND_WEIGHT=0.99 -D THICK_DISK_WEIGHT=0.01 -D INNERPOWER=1 -D OUTERPOWER=1 -D ALPHA_DELTA_3=3 " |
Send message Joined: 7 Dec 15 Posts: 20 Credit: 8,805,665 RAC: 0 |
I had normal stable mesa and opencl-mesa release. Adding backtrace with development debug version of Mesa (OpenCL 1.1 Mesa 18.3.0-devel (git-9007c0ed26)) installed. This is almost definitely error in the driver and may be even another issue. yeah: Thread 24 "milkyway_s:sh0" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fff937fe700 (LWP 10873)] 0x00007ffff7fa07b4 in gelf_getshdr () from /usr/lib/libelf.so.1 (gdb) bt #0 0x00007ffff7fa07b4 in gelf_getshdr () from /usr/lib/libelf.so.1 #1 0x00007fffefae3d87 in ac_elf_read (elf_data=<optimized out>, elf_size=<optimized out>, binary=binary@entry=0x55555611e118) at common/ac_binary.c:135 #2 0x00007fffefaebd97 in ac_compile_module_to_binary (p=p@entry=0x555555cb92d0, module=module@entry=0x7fff84005de0, binary=binary@entry=0x55555611e118) at /usr/include/llvm/ADT/StringRef.h:138 #3 0x00007fffefaab48d in si_llvm_compile (M=M@entry=0x7fff84005de0, binary=binary@entry=0x55555611e118, compiler=compiler@entry=0x555555cdea68, debug=debug@entry=0x55555611e018, less_optimized=less_optimized@entry=false) at si_shader_tgsi_setup.c:103 #4 0x00007fffefaa1137 in si_compile_llvm (sscreen=sscreen@entry=0x555555cde350, binary=binary@entry=0x55555611e118, conf=conf@entry=0x55555611e168, compiler=compiler@entry=0x555555cdea68, mod=0x7fff84005de0, debug=debug@entry=0x55555611e018, processor=5, name=0x7fffefb312d1 "Compute Shader", less_optimized=false) at si_shader.c:5599 #5 0x00007fffefaa2937 in si_compile_tgsi_shader (sscreen=0x555555cde350, compiler=0x555555cdea68, shader=0x55555611e058, debug=0x55555611e018) at si_shader.c:6734 #6 0x00007fffefaa3755 in si_shader_create (sscreen=sscreen@entry=0x555555cde350, compiler=compiler@entry=0x555555cdea68, shader=shader@entry=0x55555611e058, debug=debug@entry=0x55555611e018) at si_shader.c:8045 #7 0x00007fffefa7d125 in si_create_compute_state_async (job=job@entry=0x55555611dff0, thread_index=thread_index@entry=0) at si_compute.c:152 #8 0x00007fffefa43c79 in util_queue_thread_func (input=input@entry=0x555555cdd600) at u_queue.c:286 #9 0x00007fffefa43937 in impl_thrd_routine (p=<optimized out>) at ../../include/c11/threads_posix.h:87 #10 0x00007ffff7dbba9d in start_thread () from /usr/lib/libpthread.so.0 #11 0x00007ffff7cebb23 in clone () from /usr/lib/libc.so.6 (gdb) frame 1 #1 0x00007fffefae3d87 in ac_elf_read (elf_data=<optimized out>, elf_size=<optimized out>, binary=binary@entry=0x55555611e118) at common/ac_binary.c:135 135 if (gelf_getshdr(section, §ion_header) != §ion_header) { (gdb) p section $1 = (Elf_Scn *) 0x7fff84043da8 (gdb) p section_header $2 = {sh_name = 0, sh_type = 0, sh_flags = 0, sh_addr = 68719476736, sh_offset = 140735667996976, sh_size = 140737323335464, sh_link = 0, sh_info = 0, sh_addralign = 93825000247984, sh_entsize = 0} |
Send message Joined: 7 Dec 15 Posts: 20 Credit: 8,805,665 RAC: 0 |
I now created/copied a simple program that would just compile, load and execute the opencl kernel with the same Flags as milkyway used. And that did not crash! |
Send message Joined: 24 Jan 11 Posts: 708 Credit: 543,124,445 RAC: 143,222 |
You definitely have the skill set to diagnose errors. Seems you are going to need to log the errors to MESA and AMD. Similar to what the Einstein users are having to do with Nvidia Turing cards. It is either the applications are performing a function that crashes the driver or the drivers are having issues performing a valid function that only the MW and Einstein apps are exposing. Doubtful that neither project has the developer resources to quickly effect changes to the failing applications. Solution will probably have to come from the driver developers. |
Send message Joined: 7 Dec 15 Posts: 20 Credit: 8,805,665 RAC: 0 |
Similar to what the Einstein users are having to do with Nvidia Turing cards. On my system, Einstein apps work just fine. I do not have Turing card, however. |
Send message Joined: 7 Dec 15 Posts: 20 Credit: 8,805,665 RAC: 0 |
I think I fixed it. Will submit PR once i finalize it. |
Send message Joined: 7 Dec 15 Posts: 20 Credit: 8,805,665 RAC: 0 |
I think I fixed it. Will submit PR once i finalize it. Done in https://github.com/Milkyway-at-home/milkywayathome_client/pull/62! The problem was that it was using different type in declaration of inline kernel source size than in definition. This resulted in size in order of terabytes, which crashed the compiler. Also I had to do these changes to build non-static debug-enabled binaries https://github.com/gridcoin-community/milkywayathome_client/pull/1. I did not submit that one to MW. |
Send message Joined: 7 Dec 15 Posts: 20 Credit: 8,805,665 RAC: 0 |
That allowed me to run Separation jobs on Polaris. NBody fails to load/compile on both Polaris and Tahiti and Separation fails to "Failed to calculate integral 0" on Tahiti. |
Send message Joined: 7 Dec 15 Posts: 20 Credit: 8,805,665 RAC: 0 |
I got it to work. At least Separation. N-Body still does not work because it is stupid. |
Send message Joined: 24 Jan 11 Posts: 708 Credit: 543,124,445 RAC: 143,222 |
Thank you Tomas for submitting changes to the MW codebase to improve current support under Linux. We really appreciate the volunteer developers since realistically, they are the ones that have the time compared to project scientists. |
Send message Joined: 7 Dec 15 Posts: 20 Credit: 8,805,665 RAC: 0 |
N-Body wasn't even supposed to work. That's why I could not get it to work. |
Send message Joined: 7 Dec 15 Posts: 20 Credit: 8,805,665 RAC: 0 |
I doubt this will be useful, but I put my custom milkyway separation binary here http://www.tbrada.eu/up/363f27ebac1a27b6715609c245555881 and app info here http://www.tbrada.eu/up/6826f16895d61bf7e069dcd4acf33c21.xml. |
Send message Joined: 28 May 10 Posts: 5 Credit: 264,702,311 RAC: 0 |
The application you provide works on my old hd7970 gpu with Mesa OpenCL libs. Found 1 CL device Device 'AMD Radeon HD 7900 Series (TAHITI, DRM 3.27.0, 4.19.6-gentoo.s1, LLVM 6.0.1)' (AMD:0x1002) (CL_DEVICE_TYPE_GPU) Board: Driver version: 18.3.0-rc5 Version: OpenCL 1.1 Mesa 18.3.0-rc5 WU are correctly validated and execution time for single WU is around 58 seconds, so I think the is no big penalty with Mesa OpenCL libraries. Thanks for your work. Maybe Jack or Eric are interested on updating the official application. Hope they get a look at the forums. ;-) |
Send message Joined: 18 Oct 12 Posts: 2 Credit: 10,078,939 RAC: 0 |
I hope too. I'm running some gpu tasks after a very long time :) Thanks |
Send message Joined: 7 Dec 15 Posts: 20 Credit: 8,805,665 RAC: 0 |
Maybe Jack or Eric are interested on updating the official application. The code has been merged into official repository, but I do not know whether it was deployed to boinc server. |
©2024 Astroinformatics Group