Welcome to MilkyWay@home

Segfault on Linux, AMD Radeon, open source Mesa drivers

Message boards : Number crunching : Segfault on Linux, AMD Radeon, open source Mesa drivers
Message board moderation

To post messages, you must log in.

AuthorMessage
Tomas Brod

Send message
Joined: 7 Dec 15
Posts: 20
Credit: 8,805,665
RAC: 0
Message 67860 - Posted: 1 Nov 2018, 22:57:34 UTC

Hello. The app crashes shortly after initialization.

Kernel options releated to VSYSCALL:

CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_X86_VSYSCALL_EMULATION=y
# CONFIG_LEGACY_VSYSCALL_EMULATE is not set
CONFIG_LEGACY_VSYSCALL_NONE=y


A bit of clinfo:

Number of platforms                               1
  Platform Name                                   Clover
  Platform Vendor                                 Mesa
  Platform Version                                OpenCL 1.1 Mesa 18.2.3
  Platform Extensions                             cl_khr_icd
  Device Name                                     Radeon RX 560 Series (POLARIS11, DRM 3.26.0, 4.18.16-arch1-1-ARCH, LLVM 7.0.0)
  Device Version                                  OpenCL 1.1 Mesa 18.2.3
  Driver Version                                  18.2.3
  Device OpenCL C Version                         OpenCL C 1.1 
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.12
  ICD loader Profile                              OpenCL 2.2


Reading preferences ended prematurely
BOINC GPU type suggests using OpenCL vendor 'Advanced Micro Devices, Inc.'
Setting process priority to 0 (13): Permission denied
Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4
' 
Switching to Parameter File 'astronomy_parameters.txt'
<number_WUs> 4 </number_WUs>
<number_params_per_WU> 26 </number_params_per_WU>
Using AVX path
Found 1 platform
Platform 0 information:
  Name:       Clover
  Version:    OpenCL 1.1 Mesa 18.2.3
  Vendor:     Mesa
  Extensions: cl_khr_icd
  Profile:    FULL_PROFILE
Didn't find preferred platform
Using device 0 on platform 0
Found 2 CL devices
Device 'Radeon RX 560 Series (POLARIS11, DRM 3.26.0, 4.18.16-arch1-1-ARCH, LLVM 7.0.0)' (AMD:0x1002) (CL_DEVICE_TYPE
_GPU)
Board: 
Driver version:      18.2.3
Version:             OpenCL 1.1 Mesa 18.2.3
Compute capability:  0.0
Max compute units:   16
Clock frequency:     1300 Mhz
Global mem size:     3221225472
Local mem size:      32768
Max const buf size:  2147483647
Double extension:    cl_khr_fp64
SIGSEGV: segmentation violation

Exiting...


I will provide details on request.
ID: 67860 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 708
Credit: 542,930,296
RAC: 151,922
Message 67861 - Posted: 1 Nov 2018, 23:32:09 UTC

I'll post the same instructions I found for someone else using a RX580 card for OpenCL tasks in Linux at Einstein. The Mesa drivers nor the AMD GPU or ROCm drivers install the OpenCL component necessary to run OpenCL apps.


Other than pointing you at the instructions, I can provide no other help since I know nothing about ATI/AMD since I run Nvidia exclusively.

https://einsteinathome.org/content/quick-guide-how-install-opencl-amd-gpus-linux-kubuntu-1804-and-similar-distro
ID: 67861 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomas Brod

Send message
Joined: 7 Dec 15
Posts: 20
Credit: 8,805,665
RAC: 0
Message 67862 - Posted: 2 Nov 2018, 8:16:37 UTC

Keith, I am confused by your reply.
I already have OpenCL drivers, including opencl header and can run tasks from Amicable, Einstein, Primegrid and Collatz no problem. The instructions talk about installing the opencl part of amdgpu. Do yo think there is some file missing from the stuff that I have already installed? In that case, the app still should not segfault, but print error.
I did try to install the amdgpu opencl part using my distribution recommended way, but it crashes all opencl becaue it requires deprecated version of libdrm. I did not try to install it directly, but I will, to see if there was a file missing.
ID: 67862 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 708
Credit: 542,930,296
RAC: 151,922
Message 67863 - Posted: 2 Nov 2018, 16:29:26 UTC - in response to Message 67862.  

As I stated, I know nothing about ATI. All I was suggesting is that like Microsoft with Nvidia drivers, it is common for them to ship the latest drivers without OpenCL support in the driver package.

I was thinking the same thing could be happening with your ATI card. From that post at Einstein and comments in that thread, it seems that discrete OpenCL support has to be installed independently from the normal ATI driver package.
ID: 67863 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomas Brod

Send message
Joined: 7 Dec 15
Posts: 20
Credit: 8,805,665
RAC: 0
Message 67865 - Posted: 2 Nov 2018, 18:50:23 UTC

Thanks. I definitely have OpenCL support installed. I tried to install the proprietary OpenCL component but ran into dependency issue. I could continue, but instead I took different path.

I recompiled the milkyway separation app from source codes with debugging symbols enabled and static linking disabled. First I just plugged my custom app into boinc (via app_info.xml), but that crashed the same way!

Then I pulled random separation WU from boinc and tried to run the app in debugger. It unsurprisingly crashed again, but this time I got the back trace. It appears the crash is in `gelf_getshdr` function from libelf.so, called by clBuildProgram from libOpenCL.so library. This means

A) the opencl driver/compiler has bug and crashes trying to load the code

B) the milkyway opencl code and/or cl flags are problematic

Trace follows:
Thread 1 "milkyway_separa" received signal SIGSEGV, Segmentation fault.
0x00007ffff7fb47b4 in gelf_getshdr () from /usr/lib/libelf.so.1
(gdb) bt
#0  0x00007ffff7fb47b4 in gelf_getshdr () from /usr/lib/libelf.so.1
#1  0x00007ffff7b3e8ab in ?? () from /usr/lib/libMesaOpenCL.so.1
#2  0x00007ffff7b39ed4 in ?? () from /usr/lib/libMesaOpenCL.so.1
#3  0x00007ffff7ae9c68 in ?? () from /usr/lib/libMesaOpenCL.so.1
#4  0x00007ffff7acce1b in ?? () from /usr/lib/libMesaOpenCL.so.1
#5  0x00007ffff7de5d9b in clBuildProgram () from /usr/lib/libOpenCL.so.1
#6  0x0000555555668f68 in mwBuildProgram (program=0x55555616dbe8, device=0x555555831078, 
    options=0x555556178310 "-D DOUBLEPREC=1 -cl-mad-enable -cl-no-signed-zeros -cl-finite-math-only -D BACKGROUND_PROFILE=1 -D AUX_BG_PROFILE=0 -D NSTREAM=4 -D CONVOLVE=120 -D R0=12 -D SUN_R0=8.5 -D Q_INV_SQR=3.69822485207101 -D BG_A=0 -D BG_B=0 -D BG_C=0 -D BACKGROUND_WEIGHT=0.99 -D THICK_DISK_WEIGHT=0.01 -D INNERPOWER=1 -D OUTERPOWER=1 -D ALPHA_DELTA_3=3 ") at /home/tomas/downloads/milkywayathome_client/milkyway/src/milkyway_cl_program.c:99
#7  0x00005555556693af in mwCreateProgramFromSrc (ci=0x7fffffffd3a0, srcCount=1, src=0x7fffffffd2c0, 
    lengths=0x7fffffffd2c8, 
    compileDefs=0x555556178310 "-D DOUBLEPREC=1 -cl-mad-enable -cl-no-signed-zeros -cl-finite-math-only -D BACKGROUND_PROFILE=1 -D AUX_BG_PROFILE=0 -D NSTREAM=4 -D CONVOLVE=120 -D R0=12 -D SUN_R0=8.5 -D Q_INV_SQR=3.69822485207101 -D BG_A=0 -D BG_B=0 -D BG_C=0 -D BACKGROUND_WEIGHT=0.99 -D THICK_DISK_WEIGHT=0.01 -D INNERPOWER=1 -D OUTERPOWER=1 -D ALPHA_DELTA_3=3 ") at /home/tomas/downloads/milkywayathome_client/milkyway/src/milkyway_cl_program.c:223
#8  0x00005555555e90f0 in setupSeparationCL (ci=0x7fffffffd3a0, ap=0x7fffffffdc80, ias=0x5555557e3280, 
    clr=0x7fffffffdc20) at /home/tomas/downloads/milkywayathome_client/separation/src/setup_cl.c:600
#9  0x00005555555deca6 in evaluate (results=0x5555557df930, ap=0x7fffffffdc80, ias=0x5555557e3280, 
    streams=0x7fffffffdbc0, sc=0x5555557e0180, likelihoodToText=0, starPointsFile=0x5555557d52b0 "stars.txt", 
    clr=0x7fffffffdc20, do_separation=0, ignoreCheckpoint=0x7fffffffdb9c, separation_outfile=0x0)
    at /home/tomas/downloads/milkywayathome_client/separation/src/evaluation.c:249
#10 0x00005555555de2c4 in worker (sf=0x7fffffffdd90)
    at /home/tomas/downloads/milkywayathome_client/separation/src/separation_main.c:688
#11 0x00005555555de572 in main (argc=3, argv=0x7fffffffe038)
    at /home/tomas/downloads/milkywayathome_client/separation/src/separation_main.c:784
(gdb) frame 8
(gdb) p compileFlags
$6 = 0x555556178310 "-D DOUBLEPREC=1 -cl-mad-enable -cl-no-signed-zeros -cl-finite-math-only -D BACKGROUND_PROFILE=1 -D AUX_BG_PROFILE=0 -D NSTREAM=4 -D CONVOLVE=120 -D R0=12 -D SUN_R0=8.5 -D Q_INV_SQR=3.69822485207101 -D BG_A=0 -D BG_B=0 -D BG_C=0 -D BACKGROUND_WEIGHT=0.99 -D THICK_DISK_WEIGHT=0.01 -D INNERPOWER=1 -D OUTERPOWER=1 -D ALPHA_DELTA_3=3 "

ID: 67865 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomas Brod

Send message
Joined: 7 Dec 15
Posts: 20
Credit: 8,805,665
RAC: 0
Message 67866 - Posted: 2 Nov 2018, 19:08:34 UTC

I had normal stable mesa and opencl-mesa release.

Adding backtrace with development debug version of Mesa (OpenCL 1.1 Mesa 18.3.0-devel (git-9007c0ed26)) installed. This is almost definitely error in the driver and may be even another issue. yeah:

Thread 24 "milkyway_s:sh0" received signal SIGSEGV, Segmentation fault.                                             
[Switching to Thread 0x7fff937fe700 (LWP 10873)]                                                                    
0x00007ffff7fa07b4 in gelf_getshdr () from /usr/lib/libelf.so.1                                                     
(gdb) bt                                                                                                            
#0  0x00007ffff7fa07b4 in gelf_getshdr () from /usr/lib/libelf.so.1                                                 
#1  0x00007fffefae3d87 in ac_elf_read (elf_data=<optimized out>, elf_size=<optimized out>,                          
    binary=binary@entry=0x55555611e118) at common/ac_binary.c:135                                                   
#2  0x00007fffefaebd97 in ac_compile_module_to_binary (p=p@entry=0x555555cb92d0,                                    
    module=module@entry=0x7fff84005de0, binary=binary@entry=0x55555611e118)                                         
    at /usr/include/llvm/ADT/StringRef.h:138                                                                        
#3  0x00007fffefaab48d in si_llvm_compile (M=M@entry=0x7fff84005de0, binary=binary@entry=0x55555611e118,            
    compiler=compiler@entry=0x555555cdea68, debug=debug@entry=0x55555611e018,                                       
    less_optimized=less_optimized@entry=false) at si_shader_tgsi_setup.c:103                                        
#4  0x00007fffefaa1137 in si_compile_llvm (sscreen=sscreen@entry=0x555555cde350,         
    binary=binary@entry=0x55555611e118, conf=conf@entry=0x55555611e168, compiler=compiler@entry=0x555555cdea68,     
    mod=0x7fff84005de0, debug=debug@entry=0x55555611e018, processor=5, name=0x7fffefb312d1 "Compute Shader",        
    less_optimized=false) at si_shader.c:5599                                                                       
#5  0x00007fffefaa2937 in si_compile_tgsi_shader (sscreen=0x555555cde350, compiler=0x555555cdea68,            
    shader=0x55555611e058, debug=0x55555611e018) at si_shader.c:6734                                                
#6  0x00007fffefaa3755 in si_shader_create (sscreen=sscreen@entry=0x555555cde350,                                
    compiler=compiler@entry=0x555555cdea68, shader=shader@entry=0x55555611e058, debug=debug@entry=0x55555611e018)   
    at si_shader.c:8045                                                                        
#7  0x00007fffefa7d125 in si_create_compute_state_async (job=job@entry=0x55555611dff0,                      
    thread_index=thread_index@entry=0) at si_compute.c:152                                                         
#8  0x00007fffefa43c79 in util_queue_thread_func (input=input@entry=0x555555cdd600) at u_queue.c:286                
#9  0x00007fffefa43937 in impl_thrd_routine (p=<optimized out>) at ../../include/c11/threads_posix.h:87             
#10 0x00007ffff7dbba9d in start_thread () from /usr/lib/libpthread.so.0                                        
#11 0x00007ffff7cebb23 in clone () from /usr/lib/libc.so.6                                                          (gdb) frame 1                                                                                                       #1  0x00007fffefae3d87 in ac_elf_read (elf_data=<optimized out>, elf_size=<optimized out>,                         
    binary=binary@entry=0x55555611e118) at common/ac_binary.c:135
135                     if (gelf_getshdr(section, &section_header) != &section_header) {                          
(gdb) p section                                                                                                     
$1 = (Elf_Scn *) 0x7fff84043da8                                                                                    
(gdb) p section_header                                                                                             
$2 = {sh_name = 0, sh_type = 0, sh_flags = 0, sh_addr = 68719476736, sh_offset = 140735667996976,                   
  sh_size = 140737323335464, sh_link = 0, sh_info = 0, sh_addralign = 93825000247984, sh_entsize = 0}
ID: 67866 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomas Brod

Send message
Joined: 7 Dec 15
Posts: 20
Credit: 8,805,665
RAC: 0
Message 67867 - Posted: 2 Nov 2018, 20:07:21 UTC - in response to Message 67866.  

I now created/copied a simple program that would just compile, load and execute the opencl kernel with the same Flags as milkyway used. And that did not crash!
ID: 67867 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 708
Credit: 542,930,296
RAC: 151,922
Message 67868 - Posted: 3 Nov 2018, 1:18:33 UTC - in response to Message 67867.  

You definitely have the skill set to diagnose errors. Seems you are going to need to log the errors to MESA and AMD.

Similar to what the Einstein users are having to do with Nvidia Turing cards.

It is either the applications are performing a function that crashes the driver or the drivers are having issues performing a valid function that only the MW and Einstein apps are exposing.

Doubtful that neither project has the developer resources to quickly effect changes to the failing applications. Solution will probably have to come from the driver developers.
ID: 67868 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomas Brod

Send message
Joined: 7 Dec 15
Posts: 20
Credit: 8,805,665
RAC: 0
Message 67869 - Posted: 3 Nov 2018, 7:51:34 UTC - in response to Message 67868.  

Similar to what the Einstein users are having to do with Nvidia Turing cards.

It is either the applications are performing a function that crashes the driver or the drivers are having issues performing a valid function that only the MW and Einstein apps are exposing.


On my system, Einstein apps work just fine. I do not have Turing card, however.
ID: 67869 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomas Brod

Send message
Joined: 7 Dec 15
Posts: 20
Credit: 8,805,665
RAC: 0
Message 67871 - Posted: 3 Nov 2018, 11:35:04 UTC

I think I fixed it. Will submit PR once i finalize it.
ID: 67871 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomas Brod

Send message
Joined: 7 Dec 15
Posts: 20
Credit: 8,805,665
RAC: 0
Message 67872 - Posted: 3 Nov 2018, 11:59:21 UTC - in response to Message 67871.  

I think I fixed it. Will submit PR once i finalize it.


Done in https://github.com/Milkyway-at-home/milkywayathome_client/pull/62!

The problem was that it was using different type in declaration of inline kernel source size than in definition. This resulted in size in order of terabytes, which crashed the compiler.

Also I had to do these changes to build non-static debug-enabled binaries https://github.com/gridcoin-community/milkywayathome_client/pull/1. I did not submit that one to MW.
ID: 67872 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomas Brod

Send message
Joined: 7 Dec 15
Posts: 20
Credit: 8,805,665
RAC: 0
Message 67873 - Posted: 3 Nov 2018, 12:35:17 UTC - in response to Message 67872.  

That allowed me to run Separation jobs on Polaris. NBody fails to load/compile on both Polaris and Tahiti and Separation fails to "Failed to calculate integral 0" on Tahiti.
ID: 67873 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomas Brod

Send message
Joined: 7 Dec 15
Posts: 20
Credit: 8,805,665
RAC: 0
Message 67874 - Posted: 3 Nov 2018, 18:55:39 UTC - in response to Message 67873.  

I got it to work. At least Separation. N-Body still does not work because it is stupid.
ID: 67874 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 708
Credit: 542,930,296
RAC: 151,922
Message 67875 - Posted: 3 Nov 2018, 20:35:57 UTC

Thank you Tomas for submitting changes to the MW codebase to improve current support under Linux. We really appreciate the volunteer developers since realistically, they are the ones that have the time compared to project scientists.
ID: 67875 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomas Brod

Send message
Joined: 7 Dec 15
Posts: 20
Credit: 8,805,665
RAC: 0
Message 67879 - Posted: 6 Nov 2018, 18:37:34 UTC

N-Body wasn't even supposed to work. That's why I could not get it to work.
ID: 67879 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomas Brod

Send message
Joined: 7 Dec 15
Posts: 20
Credit: 8,805,665
RAC: 0
Message 67880 - Posted: 6 Nov 2018, 19:53:42 UTC

I doubt this will be useful, but I put my custom milkyway separation binary here http://www.tbrada.eu/up/363f27ebac1a27b6715609c245555881 and app info here http://www.tbrada.eu/up/6826f16895d61bf7e069dcd4acf33c21.xml.
ID: 67880 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
xdarma

Send message
Joined: 28 May 10
Posts: 5
Credit: 264,702,311
RAC: 0
Message 67925 - Posted: 4 Dec 2018, 13:50:27 UTC

The application you provide works on my old hd7970 gpu with Mesa OpenCL libs.
Found 1 CL device
Device 'AMD Radeon HD 7900 Series (TAHITI, DRM 3.27.0, 4.19.6-gentoo.s1, LLVM 6.0.1)' (AMD:0x1002) (CL_DEVICE_TYPE_GPU)
Board: 
Driver version:      18.3.0-rc5
Version:             OpenCL 1.1 Mesa 18.3.0-rc5

WU are correctly validated and execution time for single WU is around 58 seconds, so I think the is no big penalty with Mesa OpenCL libraries.

Thanks for your work.

Maybe Jack or Eric are interested on updating the official application.
Hope they get a look at the forums. ;-)
ID: 67925 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
sorcrosc

Send message
Joined: 18 Oct 12
Posts: 2
Credit: 10,078,939
RAC: 0
Message 67989 - Posted: 5 Jan 2019, 23:55:00 UTC - in response to Message 67925.  



Maybe Jack or Eric are interested on updating the official application.
Hope they get a look at the forums. ;-)


I hope too. I'm running some gpu tasks after a very long time :)

Thanks
ID: 67989 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tomas Brod

Send message
Joined: 7 Dec 15
Posts: 20
Credit: 8,805,665
RAC: 0
Message 67995 - Posted: 9 Jan 2019, 14:22:58 UTC - in response to Message 67925.  

Maybe Jack or Eric are interested on updating the official application.
Hope they get a look at the forums. ;-)

The code has been merged into official repository, but I do not know whether it was deployed to boinc server.
ID: 67995 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Segfault on Linux, AMD Radeon, open source Mesa drivers

©2024 Astroinformatics Group