Welcome to MilkyWay@home

Nvidia Quadro K6000 Low Utilization - Slow Processing

Questions and Answers : Unix/Linux : Nvidia Quadro K6000 Low Utilization - Slow Processing
Message board moderation

To post messages, you must log in.

AuthorMessage
Samyah93

Send message
Joined: 24 Nov 19
Posts: 6
Credit: 21,148
RAC: 0
Message 69293 - Posted: 24 Nov 2019, 8:29:18 UTC

Hello,

I have a workstation equipped with two Nvidia cards: a Quadro K6000 and a Tesla K40. I've been noticing that the K6000 processes jobs much more slowly than the K40 when the two should be comparable in performance (same chip). The K6000 takes around 220 seconds to complete a job while the K40 finishes them in ~45 seconds. When I check the GPU utilization using nvidia-smi, the K6000 never has more than 25% utilization, but the K40 is going at 90-100%. When I check the workunit output, the only significant different I notice is the "blocks/chunk" and number of chunks. (For example, the K40 has 73 blocks/chunk with "num chunks: 1", while the K6000 is showing 5 blocks/chunk with "num chunks: 15"). Is this the source of the problem? And is there a way to speed up the calculations on the K6000? Sample outputs are included below. Thanks!

Quadro K6000:
<core_client_version>7.9.3</core_client_version>
<![CDATA[
<stderr_txt>
<search_application> milkyway_separation 1.46 Linux x86_64 double OpenCL </search_application>
BOINC GPU type suggests using OpenCL vendor 'NVIDIA Corporation'
Setting process priority to 0 (13): Permission denied
Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4' 
Switching to Parameter File 'astronomy_parameters.txt'
<number_WUs> 4 </number_WUs>
<number_params_per_WU> 26 </number_params_per_WU>
Using AVX path
Found 1 platform
Platform 0 information:
  Name:       NVIDIA CUDA
  Version:    OpenCL 1.2 CUDA 10.1.236
  Vendor:     NVIDIA Corporation
  Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
  Profile:    FULL_PROFILE
Using device 1 on platform 0
Found 2 CL devices
Device 'Quadro K6000' (NVIDIA Corporation:0x10de) (CL_DEVICE_TYPE_GPU)
Board: 
Driver version:      418.87.01
Version:             OpenCL 1.2 CUDA
Compute capability:  3.5
Max compute units:   15
Clock frequency:     901 Mhz
Global mem size:     11989549056
Local mem size:      49152
Max const buf size:  65536
Double extension:    cl_khr_fp64
Build log:
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
Build log:
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
Estimated Nvidia GPU GFLOP/s: 865 SP GFLOP/s, 108 DP FLOP/s
Using a target frequency of 60.0
Using a block size of 7680 with 5 blocks/chunk
Using clWaitForEvents() for polling with initial wait of 12 ms (mode 0)
Range:          { nu_steps = 320, mu_steps = 800, r_steps = 700 }
Iteration area: 560000
Chunk estimate: 13
Num chunks:     15
Chunk size:     38400
Added area:     16000
Effective area: 576000
Initial wait:   12 ms
Integration time: 53.356348 s. Average time per iteration = 166.738586 ms
Integral 0 time = 53.476008 s
Running likelihood with 34614 stars
Likelihood time = 0.447050 s
<background_integral> 0.000059483224699 </background_integral>
<stream_integral>  161.051564875388237  19.295289056122250  17.373218361365936  0.453606010282199 </stream_integral>
<background_likelihood> -3.370756367290975 </background_likelihood>
<stream_only_likelihood>  -3.414761260341190  -4.680083632057806  -4.476524432162782  -62.665198829490095 </stream_only_likelihood>
<search_likelihood> -2.788954249416149 </search_likelihood>


Tesla K40:
<core_client_version>7.9.3</core_client_version>
<![CDATA[
<stderr_txt>
<search_application> milkyway_separation 1.46 Linux x86_64 double OpenCL </search_application>
BOINC GPU type suggests using OpenCL vendor 'NVIDIA Corporation'
Setting process priority to 0 (13): Permission denied
Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4' 
Switching to Parameter File 'astronomy_parameters.txt'
<number_WUs> 4 </number_WUs>
<number_params_per_WU> 26 </number_params_per_WU>
Using AVX path
Found 1 platform
Platform 0 information:
  Name:       NVIDIA CUDA
  Version:    OpenCL 1.2 CUDA 10.1.236
  Vendor:     NVIDIA Corporation
  Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
  Profile:    FULL_PROFILE
Using device 0 on platform 0
Found 2 CL devices
Device 'Tesla K40c' (NVIDIA Corporation:0x10de) (CL_DEVICE_TYPE_GPU)
Board: 
Driver version:      418.87.01
Version:             OpenCL 1.2 CUDA
Compute capability:  3.5
Max compute units:   15
Clock frequency:     745 Mhz
Global mem size:     11996954624
Local mem size:      49152
Max const buf size:  65536
Double extension:    cl_khr_fp64
Build log:
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
Build log:
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
Estimated Nvidia GPU GFLOP/s: 715 SP GFLOP/s, 358 DP FLOP/s
Using a target frequency of 60.0
Using a block size of 7680 with 73 blocks/chunk
Using clWaitForEvents() for polling with initial wait of 12 ms (mode 0)
Range:          { nu_steps = 320, mu_steps = 800, r_steps = 700 }
Iteration area: 560000
Chunk estimate: 4
Num chunks:     1
Chunk size:     560640
Added area:     640
Effective area: 560640
Initial wait:   12 ms
Integration time: 9.972695 s. Average time per iteration = 31.164673 ms
Integral 0 time = 10.090335 s
Running likelihood with 38073 stars
Likelihood time = 0.683837 s
<background_integral> 0.000059045116913 </background_integral>
<stream_integral>  107.805354630900979  47.098348005207185  0.477909869639480  1.440862238706881 </stream_integral>
<background_likelihood> -3.395313980381030 </background_likelihood>
<stream_only_likelihood>  -4.154406400186403  -3.352796560986827  -61.060047563529722  -101.940935169837473 </stream_only_likelihood>
<search_likelihood> -2.797258234549717 </search_likelihood>
ID: 69293 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 69294 - Posted: 24 Nov 2019, 12:08:28 UTC - in response to Message 69293.  

something wrong, what does nvidia-smi show for utilization for each
ID: 69294 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Samyah93

Send message
Joined: 24 Nov 19
Posts: 6
Credit: 21,148
RAC: 0
Message 69295 - Posted: 24 Nov 2019, 16:03:27 UTC - in response to Message 69294.  

nvidia-smi shows around 15-25% utilization for the K6000 and 90-100% for the K40.
ID: 69295 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 69296 - Posted: 24 Nov 2019, 16:30:55 UTC - in response to Message 69295.  
Last modified: 24 Nov 2019, 16:38:27 UTC

Sorry, need more info, especially temps.

Post the complete output of the nvidia-smi

From your Boinc event message log, post the top lines down through "Memory:"

=========if the above do not have any obvious problems then do ====>

Add the following line to your programdata\boinc\projects\milkyway.cs.rpi.edu_milkyway\app_config.xml file

<cmdline>--verbose</cmdline>

and re-post the info like your first did here. There must be some important difference.

I have no experience with those two boards. All that I can do is look for some obvious differences.in the above outputs. I did compare your two board at techpowerup and the important FP64 specs looks identical. Maybe someone here has one of those boards and can better figure out the problem .
ID: 69296 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Samyah93

Send message
Joined: 24 Nov 19
Posts: 6
Credit: 21,148
RAC: 0
Message 69301 - Posted: 25 Nov 2019, 8:58:29 UTC - in response to Message 69296.  

Sorry about that.

Here is the output from nvidia-smi while running jobs:
Sun Nov 24 21:27:57 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40c          On   | 00000000:02:00.0 Off |                    0 |
| 44%   75C    P0   162W / 235W |    125MiB / 11441MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro K6000        On   | 00000000:03:00.0 Off |                    0 |
| 34%   61C    P0    79W / 225W |    210MiB / 11434MiB |     17%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     10982      C   ..._x86_64-pc-linux-gnu__opencl_nvidia_101   114MiB |
|    1      3724      G   /usr/lib/xorg/Xorg                            19MiB |
|    1     10881      C   ..._x86_64-pc-linux-gnu__opencl_nvidia_101   177MiB |
+-----------------------------------------------------------------------------+


Here is the event log through memory:
Mon 25 Nov 2019 12:44:56 AM PST |  | Starting BOINC client version 7.9.3 for x86_64-pc-linux-gnu
Mon 25 Nov 2019 12:44:56 AM PST |  | log flags: file_xfer, sched_ops, task
Mon 25 Nov 2019 12:44:56 AM PST |  | Libraries: libcurl/7.58.0 OpenSSL/1.1.1 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3
Mon 25 Nov 2019 12:44:56 AM PST |  | Data directory: /var/lib/boinc-client
Mon 25 Nov 2019 12:44:56 AM PST |  | CUDA: NVIDIA GPU 0: Tesla K40c (driver version 418.87, CUDA version 10.1, compute capability 3.5, 4096MB, 4007MB available, 4291 GFLOPS peak)
Mon 25 Nov 2019 12:44:56 AM PST |  | CUDA: NVIDIA GPU 1: Quadro K6000 (driver version 418.87, CUDA version 10.1, compute capability 3.5, 4096MB, 4007MB available, 5193 GFLOPS peak)
Mon 25 Nov 2019 12:44:56 AM PST |  | OpenCL: NVIDIA GPU 0: Tesla K40c (driver version 418.87.01, device version OpenCL 1.2 CUDA, 11441MB, 4007MB available, 4291 GFLOPS peak)
Mon 25 Nov 2019 12:44:56 AM PST |  | OpenCL: NVIDIA GPU 1: Quadro K6000 (driver version 418.87.01, device version OpenCL 1.2 CUDA, 11434MB, 4007MB available, 5193 GFLOPS peak)
Mon 25 Nov 2019 12:44:57 AM PST |  | [libc detection] gathered: 2.27, Ubuntu GLIBC 2.27-3ubuntu1
Mon 25 Nov 2019 12:44:57 AM PST |  | Host name: albus
Mon 25 Nov 2019 12:44:57 AM PST |  | Processor: 32 GenuineIntel Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz [Family 6 Model 63 Stepping 2]
Mon 25 Nov 2019 12:44:57 AM PST |  | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d
Mon 25 Nov 2019 12:44:57 AM PST |  | OS: Linux Ubuntu: Ubuntu 18.04.3 LTS [4.15.0-70-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)]
Mon 25 Nov 2019 12:44:57 AM PST |  | Memory: 62.82 GB physical, 980.00 MB virtual


I tried adding the line <cmdline>--verbose</cmdline> as follows to an app_config.xml, but the content of the stderr file does not seem to change... (I found these under boinc/slots/X/ and on the milkyway@home website. Is there somewhere else I should look?) My app_config.xml looks like the following:
<app_config>
<app>
<name>milkyway</name>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
</app_config>


Thanks!
ID: 69301 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 69302 - Posted: 25 Nov 2019, 13:54:06 UTC

Think i wasted your time, your analysis was spot on - block size difference means more iterations and less parallel processing. everything else looks good. maybe there is a way to specify a different block size on command line. biggest improvement is to use .25 for gpu in app confie, set cc_config to exclude the slower boad.

out of curiosity have you tried tried SETI special linux app? i am curious if it has the same problem.
ID: 69302 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 696
Credit: 539,997,909
RAC: 86,867
Message 69303 - Posted: 25 Nov 2019, 16:11:55 UTC

Well your first disinformation is that the two cards use the same gpu silicon. They don't. Look at the cards at the gpu database site.
https://www.techpowerup.com/gpu-specs/tesla-k40c.c2505
https://www.techpowerup.com/gpu-specs/quadro-k6000.c2426

Second, both your Event Log startup entries and your nvidia-smi output show differences in GFLOPS rating.
The K6000 should be faster. It is not. So the card is being starved of probably both cpu support and PCIe bus speed or lane width.
ID: 69303 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Samyah93

Send message
Joined: 24 Nov 19
Posts: 6
Credit: 21,148
RAC: 0
Message 69304 - Posted: 25 Nov 2019, 17:46:36 UTC - in response to Message 69302.  

Thanks! I will get back to you on whether SETI has this problem, since I have not tried it yet.
ID: 69304 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Samyah93

Send message
Joined: 24 Nov 19
Posts: 6
Credit: 21,148
RAC: 0
Message 69305 - Posted: 25 Nov 2019, 17:49:15 UTC - in response to Message 69303.  
Last modified: 25 Nov 2019, 18:10:37 UTC

Hello,

Thanks for the information! It is my understanding that the GK110 ended up being used in the Tesla K40 at launch. GK180 was also just a GK110 with some minor tweaks.

Both cards are on PCIe x16 slots, and the cards work correctly for non-BOINC (quantum chemical) calculation purposes. Is there a way to check if it is being starved of resources?

As was pointed out, it also seems to have significant differences in the number of blocks processed simultaneously.
ID: 69305 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 696
Credit: 539,997,909
RAC: 86,867
Message 69311 - Posted: 26 Nov 2019, 5:58:16 UTC - in response to Message 69305.  
Last modified: 26 Nov 2019, 6:00:25 UTC

Well first thing I would do is pull the K40 card out of the system and put the K6000 card in the slot the K40 was in. Probably a PCIe lane width issue. Just because the slots are physically X16 doesn't mean the cpu or chipset is delivering the full 16 lanes to the slot. If the utilization goes up, you now know the slot you had the K6000 in is not running at the same lane width or possible bus speed. I'm sure that the typical gpu monitoring programs can tell you the negotiated bus speed of the cards along with their lane bandwidth. See if there is a difference between the two slots. Nvidia X Server Settings can show you that. It is installed with the drivers and should be in your Show Applications list.
ID: 69311 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Samyah93

Send message
Joined: 24 Nov 19
Posts: 6
Credit: 21,148
RAC: 0
Message 69313 - Posted: 26 Nov 2019, 22:11:20 UTC - in response to Message 69311.  

Hello,

I've tried swapping the two cards in my workstation, and it is still the K6000 that shows much lower utilization. The stderr file for the K6000 also still shows fewer blocks/chunk (14 vs. 146 on Tesla K40) and more chunks (10 vs. 1 on Tesla K40):
<search_application> milkyway_separation 1.46 Linux x86_64 double OpenCL </search_application>
BOINC GPU type suggests using OpenCL vendor 'NVIDIA Corporation'
Setting process priority to 0 (13): Permission denied
Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4'
Switching to Parameter File 'astronomy_parameters.txt'
<number_WUs> 5 </number_WUs>
<number_params_per_WU> 20 </number_params_per_WU>
Using AVX path
Found 1 platform
Platform 0 information:
  Name:       NVIDIA CUDA
  Version:    OpenCL 1.2 CUDA 10.2.95
  Vendor:     NVIDIA Corporation
  Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addr$
  Profile:    FULL_PROFILE
Using device 0 on platform 0
Found 2 CL devices
Device 'Quadro K6000' (NVIDIA Corporation:0x10de) (CL_DEVICE_TYPE_GPU)
Board:
Driver version:      440.33.01
Version:             OpenCL 1.2 CUDA
Compute capability:  3.5
Max compute units:   15
Clock frequency:     901 Mhz
Global mem size:     11988566016
Local mem size:      49152
Max const buf size:  65536
Double extension:    cl_khr_fp64
Build log:
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
Build log:
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
Estimated Nvidia GPU GFLOP/s: 865 SP GFLOP/s, 108 DP FLOP/s
Using a target frequency of 60.0
Using a block size of 3840 with 14 blocks/chunk
Using clWaitForEvents() for polling with initial wait of 13 ms (mode 0)
Range:          { nu_steps = 320, mu_steps = 800, r_steps = 700 }
Iteration area: 560000
Chunk estimate: 10
Num chunks:     11
Chunk size:     53760
Added area:     31360
Effective area: 591360
Initial wait:   13 ms
Integration time: 43.993598 s. Average time per iteration = 137.479993 ms
Integral 0 time = 44.114030 s
Running likelihood with 10 stars
Likelihood time = 0.000139 s
<background_integral> 0.000023991637849 </background_integral>
<stream_integral>  0.013916549538950  0.081085356418659  0.551135322056047 </stream_integral>
<background_likelihood> -3.108318245877400 </background_likelihood>
<stream_only_likelihood>  -217.020095058481104  -169.073239934746482  -169.106244918774621 </stream_only_likelihood>
<search_likelihood> -1.633680829801995 </search_likelihood>



I also tried updating the drivers. The updated output for nvidia-smi is:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro K6000        On   | 00000000:02:00.0 Off |                    0 |
| 35%   58C    P0    74W / 225W |    279MiB / 11433MiB |     15%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40c          On   | 00000000:03:00.0 Off |                    0 |
| 42%   71C    P0   168W / 235W |    180MiB / 11441MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1905      G   /usr/lib/xorg/Xorg                            24MiB |
|    0      5520      C   ..._x86_64-pc-linux-gnu__opencl_nvidia_101   241MiB |
|    1      5683      C   ..._x86_64-pc-linux-gnu__opencl_nvidia_101   169MiB |
+-----------------------------------------------------------------------------+


I also used lspci to get information about the bus speed, and both cards are at PCIe 3.0 x16. The output also shows that the Tesla K40 is in fact a GK110 (derivative) card:

02:00.0 VGA compatible controller: NVIDIA Corporation GK110GL [Quadro K6000] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation GK110GL [Quadro K6000]
	Physical Slot: 2
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 32
	NUMA node: 0
	Region 0: Memory at f3000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at c0000000 (64-bit, prefetchable) [size=256M]
	Region 3: Memory at d0000000 (64-bit, prefetchable) [size=32M]
	Region 5: I/O ports at 1000 [size=128]
	[virtual] Expansion ROM at f4080000 [disabled] [size=512K]
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee16000  Data: 4022
	Capabilities: [78] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 75.000W
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 256 bytes, MaxReadReq 1024 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend+
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <1us, L1 <4us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [100 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
			Status:	NegoPending- InProgress-
	Capabilities: [128 v1] Power Budgeting <?>
	Capabilities: [420 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
	Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900 v1] #19
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

03:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40c] (rev a1)
	Subsystem: Hewlett-Packard Company GK110BGL [Tesla K40c]
	Physical Slot: 5
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 33
	NUMA node: 0
	Region 0: Memory at f2000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at e0000000 (64-bit, prefetchable) [size=256M]
	Region 3: Memory at f0000000 (64-bit, prefetchable) [size=32M]
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee35000  Data: 4021
	Capabilities: [78] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 75.000W
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 256 bytes, MaxReadReq 1024 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend+
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <1us, L1 <4us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [100 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
			Status:	NegoPending- InProgress-
	Capabilities: [128 v1] Power Budgeting <?>
	Capabilities: [420 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
	Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900 v1] #19
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia


I tried running SETI before I swapped the cards, and it showed near 100% utilization on both cards:
Mon Nov 25 16:53:34 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40c          On   | 00000000:02:00.0 Off |                    0 |
| 37%   73C    P0   154W / 235W |    270MiB / 11441MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro K6000        On   | 00000000:03:00.0 Off |                    0 |
| 44%   73C    P0   148W / 225W |    296MiB / 11433MiB |     92%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2788      C   ..._x86_64-pc-linux-gnu__opencl_nvidia_sah   258MiB |
|    1      2036      G   /usr/lib/xorg/Xorg                            24MiB |
|    1      2773      C   ..._x86_64-pc-linux-gnu__opencl_nvidia_sah   258MiB |
+-----------------------------------------------------------------------------+
ID: 69313 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 696
Credit: 539,997,909
RAC: 86,867
Message 69316 - Posted: 28 Nov 2019, 21:29:14 UTC

Haven't got a clue. Something strange about that card, maybe defective.
ID: 69316 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : Nvidia Quadro K6000 Low Utilization - Slow Processing

©2024 Astroinformatics Group