Nvidia Quadro K6000 Low Utilization

Author	Message
Samyah93 Send message Joined: 24 Nov 19 Posts: 6 Credit: 21,148 RAC: 0	Message 69293 - Posted: 24 Nov 2019, 8:29:18 UTC Hello, I have a workstation equipped with two Nvidia cards: a Quadro K6000 and a Tesla K40. I've been noticing that the K6000 processes jobs much more slowly than the K40 when the two should be comparable in performance (same chip). The K6000 takes around 220 seconds to complete a job while the K40 finishes them in ~45 seconds. When I check the GPU utilization using nvidia-smi, the K6000 never has more than 25% utilization, but the K40 is going at 90-100%. When I check the workunit output, the only significant different I notice is the "blocks/chunk" and number of chunks. (For example, the K40 has 73 blocks/chunk with "num chunks: 1", while the K6000 is showing 5 blocks/chunk with "num chunks: 15"). Is this the source of the problem? And is there a way to speed up the calculations on the K6000? Sample outputs are included below. Thanks! Quadro K6000: <core_client_version>7.9.3</core_client_version> <![CDATA[ <stderr_txt> <search_application> milkyway_separation 1.46 Linux x86_64 double OpenCL </search_application> BOINC GPU type suggests using OpenCL vendor 'NVIDIA Corporation' Setting process priority to 0 (13): Permission denied Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4' Switching to Parameter File 'astronomy_parameters.txt' <number_WUs> 4 </number_WUs> <number_params_per_WU> 26 </number_params_per_WU> Using AVX path Found 1 platform Platform 0 information: Name: NVIDIA CUDA Version: OpenCL 1.2 CUDA 10.1.236 Vendor: NVIDIA Corporation Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer Profile: FULL_PROFILE Using device 1 on platform 0 Found 2 CL devices Device 'Quadro K6000' (NVIDIA Corporation:0x10de) (CL_DEVICE_TYPE_GPU) Board: Driver version: 418.87.01 Version: OpenCL 1.2 CUDA Compute capability: 3.5 Max compute units: 15 Clock frequency: 901 Mhz Global mem size: 11989549056 Local mem size: 49152 Max const buf size: 65536 Double extension: cl_khr_fp64 Build log: -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Build log: -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Estimated Nvidia GPU GFLOP/s: 865 SP GFLOP/s, 108 DP FLOP/s Using a target frequency of 60.0 Using a block size of 7680 with 5 blocks/chunk Using clWaitForEvents() for polling with initial wait of 12 ms (mode 0) Range: { nu_steps = 320, mu_steps = 800, r_steps = 700 } Iteration area: 560000 Chunk estimate: 13 Num chunks: 15 Chunk size: 38400 Added area: 16000 Effective area: 576000 Initial wait: 12 ms Integration time: 53.356348 s. Average time per iteration = 166.738586 ms Integral 0 time = 53.476008 s Running likelihood with 34614 stars Likelihood time = 0.447050 s <background_integral> 0.000059483224699 </background_integral> <stream_integral> 161.051564875388237 19.295289056122250 17.373218361365936 0.453606010282199 </stream_integral> <background_likelihood> -3.370756367290975 </background_likelihood> <stream_only_likelihood> -3.414761260341190 -4.680083632057806 -4.476524432162782 -62.665198829490095 </stream_only_likelihood> <search_likelihood> -2.788954249416149 </search_likelihood> Tesla K40: <core_client_version>7.9.3</core_client_version> <![CDATA[ <stderr_txt> <search_application> milkyway_separation 1.46 Linux x86_64 double OpenCL </search_application> BOINC GPU type suggests using OpenCL vendor 'NVIDIA Corporation' Setting process priority to 0 (13): Permission denied Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4' Switching to Parameter File 'astronomy_parameters.txt' <number_WUs> 4 </number_WUs> <number_params_per_WU> 26 </number_params_per_WU> Using AVX path Found 1 platform Platform 0 information: Name: NVIDIA CUDA Version: OpenCL 1.2 CUDA 10.1.236 Vendor: NVIDIA Corporation Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer Profile: FULL_PROFILE Using device 0 on platform 0 Found 2 CL devices Device 'Tesla K40c' (NVIDIA Corporation:0x10de) (CL_DEVICE_TYPE_GPU) Board: Driver version: 418.87.01 Version: OpenCL 1.2 CUDA Compute capability: 3.5 Max compute units: 15 Clock frequency: 745 Mhz Global mem size: 11996954624 Local mem size: 49152 Max const buf size: 65536 Double extension: cl_khr_fp64 Build log: -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Build log: -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Estimated Nvidia GPU GFLOP/s: 715 SP GFLOP/s, 358 DP FLOP/s Using a target frequency of 60.0 Using a block size of 7680 with 73 blocks/chunk Using clWaitForEvents() for polling with initial wait of 12 ms (mode 0) Range: { nu_steps = 320, mu_steps = 800, r_steps = 700 } Iteration area: 560000 Chunk estimate: 4 Num chunks: 1 Chunk size: 560640 Added area: 640 Effective area: 560640 Initial wait: 12 ms Integration time: 9.972695 s. Average time per iteration = 31.164673 ms Integral 0 time = 10.090335 s Running likelihood with 38073 stars Likelihood time = 0.683837 s <background_integral> 0.000059045116913 </background_integral> <stream_integral> 107.805354630900979 47.098348005207185 0.477909869639480 1.440862238706881 </stream_integral> <background_likelihood> -3.395313980381030 </background_likelihood> <stream_only_likelihood> -4.154406400186403 -3.352796560986827 -61.060047563529722 -101.940935169837473 </stream_only_likelihood> <search_likelihood> -2.797258234549717 </search_likelihood> ID: 69293 · Rating: 0 · rate: / Reply Quote

Joseph Stateson Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,463,985,753 RAC: 25	Message 69294 - Posted: 24 Nov 2019, 12:08:28 UTC - in response to Message 69293. something wrong, what does nvidia-smi show for utilization for each ID: 69294 · Rating: 0 · rate: / Reply Quote

Samyah93 Send message Joined: 24 Nov 19 Posts: 6 Credit: 21,148 RAC: 0	Message 69295 - Posted: 24 Nov 2019, 16:03:27 UTC - in response to Message 69294. nvidia-smi shows around 15-25% utilization for the K6000 and 90-100% for the K40. ID: 69295 · Rating: 0 · rate: / Reply Quote

Joseph Stateson Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,463,985,753 RAC: 25	Message 69296 - Posted: 24 Nov 2019, 16:30:55 UTC - in response to Message 69295. Last modified: 24 Nov 2019, 16:38:27 UTC Sorry, need more info, especially temps. Post the complete output of the nvidia-smi From your Boinc event message log, post the top lines down through "Memory:" =========if the above do not have any obvious problems then do ====> Add the following line to your programdata\boinc\projects\milkyway.cs.rpi.edu_milkyway\app_config.xml file <cmdline>--verbose</cmdline> and re-post the info like your first did here. There must be some important difference. I have no experience with those two boards. All that I can do is look for some obvious differences.in the above outputs. I did compare your two board at techpowerup and the important FP64 specs looks identical. Maybe someone here has one of those boards and can better figure out the problem . ID: 69296 · Rating: 0 · rate: / Reply Quote

Samyah93 Send message Joined: 24 Nov 19 Posts: 6 Credit: 21,148 RAC: 0	Message 69301 - Posted: 25 Nov 2019, 8:58:29 UTC - in response to Message 69296. Sorry about that. Here is the output from nvidia-smi while running jobs: Sun Nov 24 21:27:57 2019 +-----------------------------------------------------------------------------+ \| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 \| \|-------------------------------+----------------------+----------------------+ \| GPU Name Persistence-M\| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap\| Memory-Usage \| GPU-Util Compute M. \| \|===============================+======================+======================\| \| 0 Tesla K40c On \| 00000000:02:00.0 Off \| 0 \| \| 44% 75C P0 162W / 235W \| 125MiB / 11441MiB \| 100% Default \| +-------------------------------+----------------------+----------------------+ \| 1 Quadro K6000 On \| 00000000:03:00.0 Off \| 0 \| \| 34% 61C P0 79W / 225W \| 210MiB / 11434MiB \| 17% Default \| +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ \| Processes: GPU Memory \| \| GPU PID Type Process name Usage \| \|=============================================================================\| \| 0 10982 C ..._x86_64-pc-linux-gnu__opencl_nvidia_101 114MiB \| \| 1 3724 G /usr/lib/xorg/Xorg 19MiB \| \| 1 10881 C ..._x86_64-pc-linux-gnu__opencl_nvidia_101 177MiB \| +-----------------------------------------------------------------------------+ Here is the event log through memory: Mon 25 Nov 2019 12:44:56 AM PST \| \| Starting BOINC client version 7.9.3 for x86_64-pc-linux-gnu Mon 25 Nov 2019 12:44:56 AM PST \| \| log flags: file_xfer, sched_ops, task Mon 25 Nov 2019 12:44:56 AM PST \| \| Libraries: libcurl/7.58.0 OpenSSL/1.1.1 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3 Mon 25 Nov 2019 12:44:56 AM PST \| \| Data directory: /var/lib/boinc-client Mon 25 Nov 2019 12:44:56 AM PST \| \| CUDA: NVIDIA GPU 0: Tesla K40c (driver version 418.87, CUDA version 10.1, compute capability 3.5, 4096MB, 4007MB available, 4291 GFLOPS peak) Mon 25 Nov 2019 12:44:56 AM PST \| \| CUDA: NVIDIA GPU 1: Quadro K6000 (driver version 418.87, CUDA version 10.1, compute capability 3.5, 4096MB, 4007MB available, 5193 GFLOPS peak) Mon 25 Nov 2019 12:44:56 AM PST \| \| OpenCL: NVIDIA GPU 0: Tesla K40c (driver version 418.87.01, device version OpenCL 1.2 CUDA, 11441MB, 4007MB available, 4291 GFLOPS peak) Mon 25 Nov 2019 12:44:56 AM PST \| \| OpenCL: NVIDIA GPU 1: Quadro K6000 (driver version 418.87.01, device version OpenCL 1.2 CUDA, 11434MB, 4007MB available, 5193 GFLOPS peak) Mon 25 Nov 2019 12:44:57 AM PST \| \| [libc detection] gathered: 2.27, Ubuntu GLIBC 2.27-3ubuntu1 Mon 25 Nov 2019 12:44:57 AM PST \| \| Host name: albus Mon 25 Nov 2019 12:44:57 AM PST \| \| Processor: 32 GenuineIntel Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz [Family 6 Model 63 Stepping 2] Mon 25 Nov 2019 12:44:57 AM PST \| \| Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d Mon 25 Nov 2019 12:44:57 AM PST \| \| OS: Linux Ubuntu: Ubuntu 18.04.3 LTS [4.15.0-70-generic\|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)] Mon 25 Nov 2019 12:44:57 AM PST \| \| Memory: 62.82 GB physical, 980.00 MB virtual I tried adding the line <cmdline>--verbose</cmdline> as follows to an app_config.xml, but the content of the stderr file does not seem to change... (I found these under boinc/slots/X/ and on the milkyway@home website. Is there somewhere else I should look?) My app_config.xml looks like the following: <app_config> <app> <name>milkyway</name> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>1</cpu_usage> </gpu_versions> </app> </app_config> Thanks! ID: 69301 · Rating: 0 · rate: / Reply Quote

Joseph Stateson Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,463,985,753 RAC: 25	Message 69302 - Posted: 25 Nov 2019, 13:54:06 UTC Think i wasted your time, your analysis was spot on - block size difference means more iterations and less parallel processing. everything else looks good. maybe there is a way to specify a different block size on command line. biggest improvement is to use .25 for gpu in app confie, set cc_config to exclude the slower boad. out of curiosity have you tried tried SETI special linux app? i am curious if it has the same problem. ID: 69302 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 733 Credit: 564,749,661 RAC: 12,038	Message 69303 - Posted: 25 Nov 2019, 16:11:55 UTC Well your first disinformation is that the two cards use the same gpu silicon. They don't. Look at the cards at the gpu database site. https://www.techpowerup.com/gpu-specs/tesla-k40c.c2505 https://www.techpowerup.com/gpu-specs/quadro-k6000.c2426 Second, both your Event Log startup entries and your nvidia-smi output show differences in GFLOPS rating. The K6000 should be faster. It is not. So the card is being starved of probably both cpu support and PCIe bus speed or lane width. ID: 69303 · Rating: 0 · rate: / Reply Quote

Samyah93 Send message Joined: 24 Nov 19 Posts: 6 Credit: 21,148 RAC: 0	Message 69304 - Posted: 25 Nov 2019, 17:46:36 UTC - in response to Message 69302. Thanks! I will get back to you on whether SETI has this problem, since I have not tried it yet. ID: 69304 · Rating: 0 · rate: / Reply Quote

Samyah93 Send message Joined: 24 Nov 19 Posts: 6 Credit: 21,148 RAC: 0	Message 69305 - Posted: 25 Nov 2019, 17:49:15 UTC - in response to Message 69303. Last modified: 25 Nov 2019, 18:10:37 UTC Hello, Thanks for the information! It is my understanding that the GK110 ended up being used in the Tesla K40 at launch. GK180 was also just a GK110 with some minor tweaks. Both cards are on PCIe x16 slots, and the cards work correctly for non-BOINC (quantum chemical) calculation purposes. Is there a way to check if it is being starved of resources? As was pointed out, it also seems to have significant differences in the number of blocks processed simultaneously. ID: 69305 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 733 Credit: 564,749,661 RAC: 12,038	Message 69311 - Posted: 26 Nov 2019, 5:58:16 UTC - in response to Message 69305. Last modified: 26 Nov 2019, 6:00:25 UTC Well first thing I would do is pull the K40 card out of the system and put the K6000 card in the slot the K40 was in. Probably a PCIe lane width issue. Just because the slots are physically X16 doesn't mean the cpu or chipset is delivering the full 16 lanes to the slot. If the utilization goes up, you now know the slot you had the K6000 in is not running at the same lane width or possible bus speed. I'm sure that the typical gpu monitoring programs can tell you the negotiated bus speed of the cards along with their lane bandwidth. See if there is a difference between the two slots. Nvidia X Server Settings can show you that. It is installed with the drivers and should be in your Show Applications list. ID: 69311 · Rating: 0 · rate: / Reply Quote

Samyah93 Send message Joined: 24 Nov 19 Posts: 6 Credit: 21,148 RAC: 0	Message 69313 - Posted: 26 Nov 2019, 22:11:20 UTC - in response to Message 69311. Hello, I've tried swapping the two cards in my workstation, and it is still the K6000 that shows much lower utilization. The stderr file for the K6000 also still shows fewer blocks/chunk (14 vs. 146 on Tesla K40) and more chunks (10 vs. 1 on Tesla K40): <search_application> milkyway_separation 1.46 Linux x86_64 double OpenCL </search_application> BOINC GPU type suggests using OpenCL vendor 'NVIDIA Corporation' Setting process priority to 0 (13): Permission denied Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4' Switching to Parameter File 'astronomy_parameters.txt' <number_WUs> 5 </number_WUs> <number_params_per_WU> 20 </number_params_per_WU> Using AVX path Found 1 platform Platform 0 information: Name: NVIDIA CUDA Version: OpenCL 1.2 CUDA 10.2.95 Vendor: NVIDIA Corporation Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addr$ Profile: FULL_PROFILE Using device 0 on platform 0 Found 2 CL devices Device 'Quadro K6000' (NVIDIA Corporation:0x10de) (CL_DEVICE_TYPE_GPU) Board: Driver version: 440.33.01 Version: OpenCL 1.2 CUDA Compute capability: 3.5 Max compute units: 15 Clock frequency: 901 Mhz Global mem size: 11988566016 Local mem size: 49152 Max const buf size: 65536 Double extension: cl_khr_fp64 Build log: -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Build log: -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Estimated Nvidia GPU GFLOP/s: 865 SP GFLOP/s, 108 DP FLOP/s Using a target frequency of 60.0 Using a block size of 3840 with 14 blocks/chunk Using clWaitForEvents() for polling with initial wait of 13 ms (mode 0) Range: { nu_steps = 320, mu_steps = 800, r_steps = 700 } Iteration area: 560000 Chunk estimate: 10 Num chunks: 11 Chunk size: 53760 Added area: 31360 Effective area: 591360 Initial wait: 13 ms Integration time: 43.993598 s. Average time per iteration = 137.479993 ms Integral 0 time = 44.114030 s Running likelihood with 10 stars Likelihood time = 0.000139 s <background_integral> 0.000023991637849 </background_integral> <stream_integral> 0.013916549538950 0.081085356418659 0.551135322056047 </stream_integral> <background_likelihood> -3.108318245877400 </background_likelihood> <stream_only_likelihood> -217.020095058481104 -169.073239934746482 -169.106244918774621 </stream_only_likelihood> <search_likelihood> -1.633680829801995 </search_likelihood> I also tried updating the drivers. The updated output for nvidia-smi is: +-----------------------------------------------------------------------------+ \| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 \| \|-------------------------------+----------------------+----------------------+ \| GPU Name Persistence-M\| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap\| Memory-Usage \| GPU-Util Compute M. \| \|===============================+======================+======================\| \| 0 Quadro K6000 On \| 00000000:02:00.0 Off \| 0 \| \| 35% 58C P0 74W / 225W \| 279MiB / 11433MiB \| 15% Default \| +-------------------------------+----------------------+----------------------+ \| 1 Tesla K40c On \| 00000000:03:00.0 Off \| 0 \| \| 42% 71C P0 168W / 235W \| 180MiB / 11441MiB \| 100% Default \| +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ \| Processes: GPU Memory \| \| GPU PID Type Process name Usage \| \|=============================================================================\| \| 0 1905 G /usr/lib/xorg/Xorg 24MiB \| \| 0 5520 C ..._x86_64-pc-linux-gnu__opencl_nvidia_101 241MiB \| \| 1 5683 C ..._x86_64-pc-linux-gnu__opencl_nvidia_101 169MiB \| +-----------------------------------------------------------------------------+ I also used lspci to get information about the bus speed, and both cards are at PCIe 3.0 x16. The output also shows that the Tesla K40 is in fact a GK110 (derivative) card: 02:00.0 VGA compatible controller: NVIDIA Corporation GK110GL [Quadro K6000] (rev a1) (prog-if 00 [VGA controller]) Subsystem: NVIDIA Corporation GK110GL [Quadro K6000] Physical Slot: 2 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 32 NUMA node: 0 Region 0: Memory at f3000000 (32-bit, non-prefetchable) [size=16M] Region 1: Memory at c0000000 (64-bit, prefetchable) [size=256M] Region 3: Memory at d0000000 (64-bit, prefetchable) [size=32M] Region 5: I/O ports at 1000 [size=128] [virtual] Expansion ROM at f4080000 [disabled] [size=512K] Capabilities: [60] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee16000 Data: 4022 Capabilities: [78] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 75.000W DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- MaxPayload 256 bytes, MaxReadReq 1024 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend+ LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <1us, L1 <4us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [100 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [128 v1] Power Budgeting <?> Capabilities: [420 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900 v1] #19 Kernel driver in use: nvidia Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia 03:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40c] (rev a1) Subsystem: Hewlett-Packard Company GK110BGL [Tesla K40c] Physical Slot: 5 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 33 NUMA node: 0 Region 0: Memory at f2000000 (32-bit, non-prefetchable) [size=16M] Region 1: Memory at e0000000 (64-bit, prefetchable) [size=256M] Region 3: Memory at f0000000 (64-bit, prefetchable) [size=32M] Capabilities: [60] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee35000 Data: 4021 Capabilities: [78] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 75.000W DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- MaxPayload 256 bytes, MaxReadReq 1024 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend+ LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <1us, L1 <4us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [100 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [128 v1] Power Budgeting <?> Capabilities: [420 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900 v1] #19 Kernel driver in use: nvidia Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia I tried running SETI before I swapped the cards, and it showed near 100% utilization on both cards: Mon Nov 25 16:53:34 2019 +-----------------------------------------------------------------------------+ \| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 \| \|-------------------------------+----------------------+----------------------+ \| GPU Name Persistence-M\| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap\| Memory-Usage \| GPU-Util Compute M. \| \|===============================+======================+======================\| \| 0 Tesla K40c On \| 00000000:02:00.0 Off \| 0 \| \| 37% 73C P0 154W / 235W \| 270MiB / 11441MiB \| 97% Default \| +-------------------------------+----------------------+----------------------+ \| 1 Quadro K6000 On \| 00000000:03:00.0 Off \| 0 \| \| 44% 73C P0 148W / 225W \| 296MiB / 11433MiB \| 92% Default \| +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ \| Processes: GPU Memory \| \| GPU PID Type Process name Usage \| \|=============================================================================\| \| 0 2788 C ..._x86_64-pc-linux-gnu__opencl_nvidia_sah 258MiB \| \| 1 2036 G /usr/lib/xorg/Xorg 24MiB \| \| 1 2773 C ..._x86_64-pc-linux-gnu__opencl_nvidia_sah 258MiB \| +-----------------------------------------------------------------------------+ ID: 69313 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 733 Credit: 564,749,661 RAC: 12,038	Message 69316 - Posted: 28 Nov 2019, 21:29:14 UTC Haven't got a clue. Something strange about that card, maybe defective. ID: 69316 · Rating: 0 · rate: / Reply Quote

Nvidia Quadro K6000 Low Utilization - Slow Processing