Welcome to MilkyWay@home

MW stopped using my nvidia GPU


Advanced search

Questions and Answers : Unix/Linux : MW stopped using my nvidia GPU
Message board moderation

To post messages, you must log in.

AuthorMessage
Cat22
Avatar

Send message
Joined: 26 May 20
Posts: 12
Credit: 242,499,033
RAC: 594,435
200 million credit badge1 year member badge
Message 69884 - Posted: 3 Jun 2020, 2:34:19 UTC

openSuse Linux (Tumbleweed) x86_64, nvidia driver 440.59

This was going along fine but at least a day ago all the nvidia tasks went to "waiting to run" status and have not run since. the activitiy menu options are all set to "Always" and nothing is suspended, there is no appinfo.xml no app_confing.xml cc_config.xml etc - just raw MW and BOINC. computing preferences allow for 1 cpu per GPU (I have 2 nividia cards) so that should be ok. The nvidia driver hasnt changed. I just double checked I am still part of the video group so thats not it.
How can i determine what is preventing the GPU from running?
Also, as an aside, how does one view the specific computer on the website? I cant seem to sort by computer id so searching for a particular computer in my set of computers is like searching for a needle in a haystack

Example:
Application Milkyway@home Separation 1.46 (opencl_nvidia_101)
Name de_modfit_86_bundle4_4s_south4s_bgset_2_1588605902_20177642
State Waiting to run
Received Sun 31 May 2020 09:17:52 PM PDT
Report deadline Fri 12 Jun 2020 09:17:51 PM PDT
Resources 0.965 CPUs + 1 NVIDIA GPU
Estimated computation size 42,135 GFLOPs
CPU time ---
CPU time since checkpoint ---
Elapsed time ---
Estimated time remaining 00:03:05
Fraction done 0.000%
Virtual memory size 9.97 GB
Working set size 499.16 MB
Directory slots/2
Executable milkyway_1.46_x86_64-pc-linux-gnu__opencl_nvidia_101

Here is a boinc re-start i just did to see if that would help (it didnt):
[---] Starting BOINC client version 7.17.0 for x86_64-pc-linux-gnu
[---] This a development version of BOINC and may not function properly
[---] log flags: file_xfer, sched_ops, task
[---] Libraries: libcurl/7.70.0 OpenSSL/1.1.1g-fips zlib/1.2.11 libidn2/2.3.0 libpsl/0.21.0 (+libidn2/2.3.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0
[---] Data directory: /home/erbenton/boinc
[---] CUDA: NVIDIA GPU 0: GeForce RTX 2060 (driver version 440.59, CUDA version 10.2, compute capability 7.5, 4096MB, 3970MB available, 6739 GFLOPS peak)
[---] CUDA: NVIDIA GPU 1: GeForce GTX 1660 Ti (driver version 440.59, CUDA version 10.2, compute capability 7.5, 4096MB, 3972MB available, 5668 GFLOPS peak)
[---] OpenCL: NVIDIA GPU 0: GeForce RTX 2060 (driver version 440.59, device version OpenCL 1.2 CUDA, 5932MB, 3970MB available, 6739 GFLOPS peak)
[---] OpenCL: NVIDIA GPU 1: GeForce GTX 1660 Ti (driver version 440.59, device version OpenCL 1.2 CUDA, 5945MB, 3972MB available, 5668 GFLOPS peak)
[---] OpenCL CPU: pthread-Intel(R) Core(TM) i7-3960X CPU @ 3.30GHz (OpenCL driver vendor: The pocl project, driver version 1.4, device version OpenCL 1.2 pocl HSTR: pthread-x86_64-unknown-linux-gnu-sandybridge)
[SETI@home] Found app_info.xml; using anonymous platform
[---] libc: GNU libc version 2.31
[---] Host name: erb1
[---] Processor: 12 GenuineIntel Intel(R) Core(TM) i7-3960X CPU @ 3.30GHz [Family 6 Model 45 Stepping 7]
[---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
[---] OS: Linux openSUSE: openSUSE Tumbleweed [5.5.4-cstm|libc 2.31 (GNU libc)]
[---] Memory: 31.29 GB physical, 2.00 GB virtual
[---] Disk: 80.10 GB total, 6.77 GB free
[---] Local time is UTC -7 hours
[---] VirtualBox version: 6.1.2r135662
[---] Config: use all coprocessors
[Milkyway@Home] General prefs: from Milkyway@Home (last modified 02-Jun-2020 19:14:24)
[Milkyway@Home] Computer location: home
[Milkyway@Home] General prefs: no separate prefs for home; using your defaults
[---] Reading preferences override file
[---] Preferences:
[---]    max memory usage when active: 32041.71 MB
[---]    max memory usage when idle: 32041.71 MB
[---]    max disk usage: 4.40 GB
[---]    max CPUs used: 10
[---]    (to change preferences, visit a project web site or select Preferences in the Manager)
[---] Setting up project and slot directories
[---] Checking active tasks
[Milkyway@Home] URL http://milkyway.cs.rpi.edu/milkyway/; Computer ID 852607; resource share 100
[SETI@home] URL http://setiathome.berkeley.edu/; Computer ID 8730567; resource share 100
[---] Setting up GUI RPC socket
[---] Checking presence of 31 project files
Initialization completed
[SETI@home] Sending scheduler request: To fetch work.
[SETI@home] Requesting new tasks for NVIDIA GPU
[SETI@home] Scheduler request completed: got 0 new tasks
[SETI@home] Project has no tasks available
[SETI@home] Project requested delay of 87264 seconds
[Milkyway@Home] Sending scheduler request: To fetch work.
[Milkyway@Home] Requesting new tasks for NVIDIA GPU
[Milkyway@Home] Scheduler request completed: got 0 new tasks
[Milkyway@Home] Not sending work - last request too recent: 77 sec
[Milkyway@Home] Project requested delay of 91 seconds
ID: 69884 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilemikey
Avatar

Send message
Joined: 8 May 09
Posts: 2541
Credit: 462,666,679
RAC: 142
300 million credit badge12 year member badgeextraordinary contributions badge
Message 69886 - Posted: 3 Jun 2020, 10:25:59 UTC - in response to Message 69884.  

openSuse Linux (Tumbleweed) x86_64, nvidia driver 440.59

This was going along fine but at least a day ago all the nvidia tasks went to "waiting to run" status and have not run since. the activitiy menu options are all set to "Always" and nothing is suspended, there is no appinfo.xml no app_confing.xml cc_config.xml etc - just raw MW and BOINC. computing preferences allow for 1 cpu per GPU (I have 2 nividia cards) so that should be ok. The nvidia driver hasnt changed. I just double checked I am still part of the video group so thats not it.
How can i determine what is preventing the GPU from running?
[i]Also, as an aside, how does one view the specific computer on the website? I cant seem to sort by computer id so searching for a particular computer in my set of computers is like searching for a needle in a haystack

[Milkyway@Home] Requesting new tasks for NVIDIA GPU
[Milkyway@Home] Scheduler request completed: got 0 new tasks
[Milkyway@Home] Not sending work - last request too recent: 77 sec
[Milkyway@Home] Project requested delay of 91 seconds
[/code]


The last part is the key "[Milkyway@Home] Not sending work - last request too recent: 77 sec" MilkyWay REQUIRES 10 minutes of not asking for new work before they will send you more gpu work, setup a zero resource share project and run a couple of their workunits until MilkyWay refills the cache. Cpu workunits do not have this problem, just gpu workunits.
ID: 69886 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 451
Credit: 345,860,717
RAC: 423,615
300 million credit badge10 year member badgeextraordinary contributions badge
Message 69887 - Posted: 3 Jun 2020, 18:54:22 UTC - in response to Message 69884.  

Also, as an aside, how does one view the specific computer on the website? I cant seem to sort by computer id so searching for a particular computer in my set of computers is like searching for a needle in a haystack

Don't understand this at all. Login to MW, go to your account main page, click the computers link on the page. https://milkyway.cs.rpi.edu/milkyway/hosts_user.php
Voila! All your computers are listed, even with their assigned network names. Easy to figure out which computer is which.

If you are constantly running out of work and the 10 minute backoff bugs you too much, you can always run JStateson's modified client which removes that aggravation.
ID: 69887 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cat22
Avatar

Send message
Joined: 26 May 20
Posts: 12
Credit: 242,499,033
RAC: 594,435
200 million credit badge1 year member badge
Message 69888 - Posted: 4 Jun 2020, 6:03:43 UTC - in response to Message 69887.  

Hi,
thanks for the info. I checked and the last nvidia task was sent in on June 1, all the others ware waiting.
I have plenty of tasks, some nvidia some cpu. But why is boinc ignoring the nvidia tasks?
All my nvidia tasks are in state "waiting to run" I have 1 nbody simulation 1.76 task running (12 cpu's)
and thats it, in fact as i write this it finished and started another similar nbody task.
ID: 69888 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cat22
Avatar

Send message
Joined: 26 May 20
Posts: 12
Credit: 242,499,033
RAC: 594,435
200 million credit badge1 year member badge
Message 69889 - Posted: 4 Jun 2020, 6:20:43 UTC - in response to Message 69888.  

Well, I gave up and did a 'project reset' and lo and behold the nvidia apps are running now :-) yaaaa
ID: 69889 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 451
Credit: 345,860,717
RAC: 423,615
300 million credit badge10 year member badgeextraordinary contributions badge
Message 69890 - Posted: 4 Jun 2020, 7:07:34 UTC

You starved the gpus from running by taking away all the cpu support by running the nbody tasks without any limit. A gpu task needs at least some part of a cpu to feed it data. If all your cpu threads were busy with nbody, then the gpu tasks will be forced into waiting to run. No mystery here, BOINC did exactly what it was supposed to. If you want to run both types of work you need to limit the nbody tasks from taking all the cpu threads. Read the documentation pertaining to nbody mt configuration.
https://boinc.berkeley.edu/wiki/Client_configuration#Application_configuration
ID: 69890 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Drago75

Send message
Joined: 12 Nov 20
Posts: 4
Credit: 1,267,669
RAC: 2,508
1 million credit badge
Message 70763 - Posted: 4 May 2021, 10:13:27 UTC

Hi everyone. I set up an i7-3770k on Linux Ubuntu 20.04, 64 bits equipped with a GTX 1050, driver 450 and 16 GB Ram. I don't receive any Milkyway WU's for the GPU although all settings should be correct even after waiting 10 minutes as suggested with no work pending for the GPU. It gets WU's from Primegrid, SRBase and Moo! Wrapper but none from Milkyway, Collatz, WCG or Einstein. That's why I selected the older 450 driver instead of the 460 but no help. Does anyone have an idea what the problem may be? The 1050 is included in the list of supported GPUs under Linux for MW.
ID: 70763 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Drago75

Send message
Joined: 12 Nov 20
Posts: 4
Credit: 1,267,669
RAC: 2,508
1 million credit badge
Message 70764 - Posted: 4 May 2021, 10:20:40 UTC

Also I have a problem now on my Ryzen 9 3900X, 16GB, GTX 1660, Win10, 64 bits. The Milkyway WU's for GPU all error out after 2 seconds. MW and MLC are the only projects who seem to be affected by this. A while ago everything worked fine.
ID: 70764 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilemikey
Avatar

Send message
Joined: 8 May 09
Posts: 2541
Credit: 462,666,679
RAC: 142
300 million credit badge12 year member badgeextraordinary contributions badge
Message 70765 - Posted: 4 May 2021, 11:20:54 UTC - in response to Message 70763.  

Hi everyone. I set up an i7-3770k on Linux Ubuntu 20.04, 64 bits equipped with a GTX 1050, driver 450 and 16 GB Ram. I don't receive any Milkyway WU's for the GPU although all settings should be correct even after waiting 10 minutes as suggested with no work pending for the GPU. It gets WU's from Primegrid, SRBase and Moo! Wrapper but none from Milkyway, Collatz, WCG or Einstein. That's why I selected the older 450 driver instead of the 460 but no help. Does anyone have an idea what the problem may be? The 1050 is included in the list of supported GPUs under Linux for MW.


Nvidia's 450 version was problematic I think, try upgrading a little bit. Another thought is to not get, or suspend tasks, from any other gpu project while trying to get MIlkyWay tasks
ID: 70765 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Drago75

Send message
Joined: 12 Nov 20
Posts: 4
Credit: 1,267,669
RAC: 2,508
1 million credit badge
Message 70766 - Posted: 4 May 2021, 13:01:58 UTC - in response to Message 70765.  

I don't have any suspended tasks but nevertheless my Linux PC doesn't receive any GPU unit from MW. Before I reverted to 450 driver version I had the 460 installed and the system ran without problems so I think it is unlikely that the problem is driver related. But to make sure I will update the driver tonight once the work cache is empty. My Windows PC has the latetst drivers but nevertheless the MW GPU WU's fail instantly.

Was maybe something done to the GPU WU's recently?
ID: 70766 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 451
Credit: 345,860,717
RAC: 423,615
300 million credit badge10 year member badgeextraordinary contributions badge
Message 70767 - Posted: 4 May 2021, 16:04:38 UTC - in response to Message 70766.  

With Windows drivers, always reinstall the drivers directly downloaded from Nvidia.

Always check that you have the OpenCL component of the drivers installed with clinfo. Available for both Linux and Windows.
ID: 70767 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Drago75

Send message
Joined: 12 Nov 20
Posts: 4
Credit: 1,267,669
RAC: 2,508
1 million credit badge
Message 70772 - Posted: 5 May 2021, 8:37:43 UTC - in response to Message 70767.  

On the Linux system I installed the newest driver, still no change. According to CL-info OpenCL 1.2 is installed.

Platform Name NVIDIA CUDA
Number of devices 1
Device Name GeForce GTX 1050
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 1.2 CUDA
Driver Version 460.73.01
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Topology (NV) PCI-E, 01:00.0
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 5
Max clock frequency 1468MHz
Compute Capability (NV) 6.1
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 1024x1024x64
Max work group size 1024
Preferred work group size multiple 32
Warp size (NV) 32
ID: 70772 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilemikey
Avatar

Send message
Joined: 8 May 09
Posts: 2541
Credit: 462,666,679
RAC: 142
300 million credit badge12 year member badgeextraordinary contributions badge
Message 70773 - Posted: 5 May 2021, 10:29:41 UTC - in response to Message 70772.  

On the Linux system I installed the newest driver, still no change. According to CL-info OpenCL 1.2 is installed.

Platform Name NVIDIA CUDA
Number of devices 1
Device Name GeForce GTX 1050
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 1.2 CUDA
Driver Version 460.73.01
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Topology (NV) PCI-E, 01:00.0
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 5
Max clock frequency 1468MHz
Compute Capability (NV) 6.1
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 1024x1024x64
Max work group size 1024
Preferred work group size multiple 32
Warp size (NV) 32


Move up or down a couple from ver 460.??? and see if it helps
ID: 70773 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 451
Credit: 345,860,717
RAC: 423,615
300 million credit badge10 year member badgeextraordinary contributions badge
Message 70776 - Posted: 5 May 2021, 15:48:12 UTC

Post the output from the Event Log when requesting work with the sched_ops_debug flag set. How many seconds of gpu work are you requesting?
Also post the startup of the Event Log after starting BOINC to be sure the gpu is detected.
When those are posted and nothing is self-evident, go back and set work_fetch_debug to show why BOINC is not requesting MW work.
ID: 70776 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : MW stopped using my nvidia GPU

©2021 Astroinformatics Group