Questions and Answers :
Unix/Linux :
Compute errors
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
Correction Yours are erroring out in just a few seconds that usually means a driver is missing but I don't know which one |
Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,461,693,501 RAC: 0 |
Login and run that nvidia diagnostic. I suspect the board got hung up. Possible oveheated. jstateson@dual-linux:~$ nvidia-smi Fri Feb 3 09:28:27 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.86.01 Driver Version: 515.86.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA P102-100 Off | 00000000:01:00.0 Off | N/A | | 0% 52C P0 142W / 250W | 1804MiB / 5120MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:03:00.0 Off | N/A | | 62% 61C P2 94W / 120W | 857MiB / 6144MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce ... Off | 00000000:04:00.0 Off | N/A | | 17% 57C P2 63W / 151W | 1056MiB / 8192MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 16919 C ...-linux-gnu__cuda118_linux 1802MiB | | 1 N/A N/A 16934 C ...-linux-gnu__cuda118_linux 854MiB | | 2 N/A N/A 16963 C ...-linux-gnu__cuda118_linux 1054MiB | +-----------------------------------------------------------------------------+ [edit] if it is overheating there is a "coolbits" setting that you can use to speed up the fan. |
Send message Joined: 18 Nov 22 Posts: 84 Credit: 640,530,847 RAC: 0 |
your errors are related to OpenCL specifically. try removing the openCL package (ocl-icd-libopencl1) and re-installing it. |
Send message Joined: 2 Jan 22 Posts: 8 Credit: 471,383 RAC: 0 |
I don't think there is a overheating GPU nvidia-smi Wed Mar 1 13:50:17 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.86.01 Driver Version: 515.86.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:07:00.0 On | N/A | | N/A 52C P0 32W / N/A | 795MiB / 8192MiB | 18% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 2121 G /usr/lib/xorg/Xorg 59MiB | | 0 N/A N/A 2835 G /usr/lib/xorg/Xorg 341MiB | | 0 N/A N/A 2969 G /usr/bin/gnome-shell 62MiB | | 0 N/A N/A 4005 G ...b/thunderbird/thunderbird 142MiB | | 0 N/A N/A 52798 G /usr/lib/firefox/firefox 139MiB | | 0 N/A N/A 258390 G ...2gtk-4.0/WebKitWebProcess 11MiB | | 0 N/A N/A 272185 G /usr/bin/python3 20MiB | +-----------------------------------------------------------------------------+ |
Send message Joined: 2 Jan 22 Posts: 8 Credit: 471,383 RAC: 0 |
@Ian&Steve C.: How do you come up with the idea of a OpenCl related error? I haven't got any problems with OpenCL anywhere else. What is your indication for that theory? How can I check it myself? cheers |
Send message Joined: 18 Nov 22 Posts: 84 Credit: 640,530,847 RAC: 0 |
@Ian&Steve C.: How do you come up with the idea of a OpenCl related error? because that's the type of error message you are getting. your list of errors: https://milkyway.cs.rpi.edu/milkyway/results.php?hostid=922489&offset=0&show_names=0&state=6&appid= one of the errors as an example: https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=730665743 stderr.txt from that task: <number_WUs> 5 </number_WUs> these are errors with openCL. i recommend you start with purging your current nvidia drivers, re-install the latest, and make sure you have the previously mentioned opencl package installed. |
Send message Joined: 24 Jan 11 Posts: 715 Credit: 557,092,326 RAC: 43,514 |
Make sure the software-updater didn't pull the Nvidia drivers out from underneath the running task. There seems to be an urgent security update of the hosts today with new Nvidia driver 525.89.02 being pushed out today. Already caught two hosts with all the downloads in place already and the updater just awaiting an acknowledgement to proceed. |
Send message Joined: 2 Jan 22 Posts: 8 Credit: 471,383 RAC: 0 |
Hey @Ian&Steve C.! Thanks for the advice. Using Ubuntu 20.04 LTS this worked for me so far I can see. At least the units finished to calculate. They are now beeing checked on the milkyway servers. I've stopped boinc client service via sudo systemctl stop boinc.client.service used sudo aptitude build-deb ocl-icd-libopencl1 ( Therefore had to active the source-files in the software download section) Reinstall the opencl drivers you mentioned... sudo aptitude reinstall ocl-icd-libopencl1 Restarted the boinc client again with... sudo systemctl restart boinc.client.service but did not work at first. One classical sudo reboot now Solved it in the end. ;-) cheers to the community knowledge base and your hint. Seemed to have affected another project too, I did not recognized before. Solved that too. Niko |
©2024 Astroinformatics Group