Welcome to MilkyWay@home

Compute errors

Questions and Answers : Unix/Linux : Compute errors
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3319
Credit: 520,318,116
RAC: 20,723
Message 74998 - Posted: 3 Feb 2023, 11:00:07 UTC - in response to Message 74996.  

Correction

It still does not work


My client still shows "Berechnungsfehler" - "computation error", but the log files still return finished as can be seen below.


Do 02 Feb 2023 09:05:26 | Milkyway@Home | Computation for task de_modfit_82_bundle5_3s_south_pt2_2_1674667491_7919183_2 finished
Do 02 Feb 2023 09:05:26 | Milkyway@Home | Starting task de_modfit_70_bundle5_3s_south_pt2_2_1674667491_9780881_0
Do 02 Feb 2023 09:05:28 | Milkyway@Home | Computation for task de_modfit_70_bundle5_3s_south_pt2_2_1674667491_9780881_0 finished
Do 02 Feb 2023 11:25:30 | Milkyway@Home | Computation for task de_nbody_01_19_2023_v176_temp__data__1_1674667492_77143_1 finished
Do 02 Feb 2023 11:25:30 | Milkyway@Home | Starting task de_nbody_01_19_2023_v176_temp__data__3_1674667492_100516_0

Any ideas left... a change to nvidia drivers 528.xxx is not possible as there is not support yet for Ubuntu 20.04 LTS?


greets

Niko


Yours are erroring out in just a few seconds that usually means a driver is missing but I don't know which one
ID: 74998 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 74999 - Posted: 3 Feb 2023, 15:29:40 UTC - in response to Message 74996.  
Last modified: 3 Feb 2023, 15:45:36 UTC

Login and run that nvidia diagnostic. I suspect the board got hung up. Possible oveheated.


jstateson@dual-linux:~$ nvidia-smi
Fri Feb  3 09:28:27 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA P102-100     Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   52C    P0   142W / 250W |   1804MiB /  5120MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:03:00.0 Off |                  N/A |
| 62%   61C    P2    94W / 120W |    857MiB /  6144MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:04:00.0 Off |                  N/A |
| 17%   57C    P2    63W / 151W |   1056MiB /  8192MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     16919      C   ...-linux-gnu__cuda118_linux     1802MiB |
|    1   N/A  N/A     16934      C   ...-linux-gnu__cuda118_linux      854MiB |
|    2   N/A  N/A     16963      C   ...-linux-gnu__cuda118_linux     1054MiB |
+-----------------------------------------------------------------------------+


[edit] if it is overheating there is a "coolbits" setting that you can use to speed up the fan.
ID: 74999 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 18 Nov 22
Posts: 81
Credit: 637,704,585
RAC: 44,500
Message 75002 - Posted: 3 Feb 2023, 16:15:34 UTC - in response to Message 74996.  

your errors are related to OpenCL specifically. try removing the openCL package (ocl-icd-libopencl1) and re-installing it.

ID: 75002 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
niko

Send message
Joined: 2 Jan 22
Posts: 8
Credit: 467,437
RAC: 24
Message 75089 - Posted: 1 Mar 2023, 12:53:23 UTC - in response to Message 74999.  

I don't think there is a overheating GPU

nvidia-smi
Wed Mar 1 13:50:17 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01 Driver Version: 515.86.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:07:00.0 On | N/A |
| N/A 52C P0 32W / N/A | 795MiB / 8192MiB | 18% Default |

| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2121 G /usr/lib/xorg/Xorg 59MiB |
| 0 N/A N/A 2835 G /usr/lib/xorg/Xorg 341MiB |
| 0 N/A N/A 2969 G /usr/bin/gnome-shell 62MiB |
| 0 N/A N/A 4005 G ...b/thunderbird/thunderbird 142MiB |
| 0 N/A N/A 52798 G /usr/lib/firefox/firefox 139MiB |
| 0 N/A N/A 258390 G ...2gtk-4.0/WebKitWebProcess 11MiB |
| 0 N/A N/A 272185 G /usr/bin/python3 20MiB |
+-----------------------------------------------------------------------------+
ID: 75089 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
niko

Send message
Joined: 2 Jan 22
Posts: 8
Credit: 467,437
RAC: 24
Message 75090 - Posted: 1 Mar 2023, 12:56:06 UTC - in response to Message 75002.  

@Ian&Steve C.: How do you come up with the idea of a OpenCl related error?


I haven't got any problems with OpenCL anywhere else.


What is your indication for that theory? How can I check it myself?


cheers
ID: 75090 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 18 Nov 22
Posts: 81
Credit: 637,704,585
RAC: 44,500
Message 75092 - Posted: 1 Mar 2023, 14:03:27 UTC - in response to Message 75090.  
Last modified: 1 Mar 2023, 14:15:33 UTC

@Ian&Steve C.: How do you come up with the idea of a OpenCl related error?


I haven't got any problems with OpenCL anywhere else.


What is your indication for that theory? How can I check it myself?


cheers


because that's the type of error message you are getting.

your list of errors: https://milkyway.cs.rpi.edu/milkyway/results.php?hostid=922489&offset=0&show_names=0&state=6&appid=
one of the errors as an example: https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=730665743

stderr.txt from that task:
<number_WUs> 5 </number_WUs>
<number_params_per_WU> 20 </number_params_per_WU>
Using AVX path
Error getting number of platform (-1001): CL_PLATFORM_NOT_FOUND_KHR
Failed to get information about device
Error getting device and context (1): MW_CL_ERROR
Failed to calculate likelihood
Using AVX path
Error getting number of platform (-1001): CL_PLATFORM_NOT_FOUND_KHR
Failed to get information about device
Error getting device and context (1): MW_CL_ERROR
Failed to calculate likelihood
Using AVX path
Error getting number of platform (-1001): CL_PLATFORM_NOT_FOUND_KHR
Failed to get information about device
Error getting device and context (1): MW_CL_ERROR
Failed to calculate likelihood
Using AVX path
Error getting number of platform (-1001): CL_PLATFORM_NOT_FOUND_KHR
Failed to get information about device
Error getting device and context (1): MW_CL_ERROR
Failed to calculate likelihood
Using AVX path
Error getting number of platform (-1001): CL_PLATFORM_NOT_FOUND_KHR
Failed to get information about device
Error getting device and context (1): MW_CL_ERROR
Failed to calculate likelihood
10:32:19 (299923): called boinc_finish(1)

</stderr_txt>


these are errors with openCL. i recommend you start with purging your current nvidia drivers, re-install the latest, and make sure you have the previously mentioned opencl package installed.

ID: 75092 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 708
Credit: 543,239,159
RAC: 141,444
Message 75093 - Posted: 1 Mar 2023, 16:16:19 UTC

Make sure the software-updater didn't pull the Nvidia drivers out from underneath the running task.

There seems to be an urgent security update of the hosts today with new Nvidia driver 525.89.02 being pushed out today.

Already caught two hosts with all the downloads in place already and the updater just awaiting an acknowledgement to proceed.
ID: 75093 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
niko

Send message
Joined: 2 Jan 22
Posts: 8
Credit: 467,437
RAC: 24
Message 75154 - Posted: 14 Mar 2023, 10:14:48 UTC - in response to Message 75002.  

Hey @Ian&Steve C.!


Thanks for the advice. Using Ubuntu 20.04 LTS this worked for me so far I can see. At least the units finished to calculate. They are now beeing checked on the milkyway servers.



I've stopped boinc client service via

sudo systemctl stop boinc.client.service


used

sudo aptitude build-deb ocl-icd-libopencl1 ( Therefore had to active the source-files in the software download section)

Reinstall the opencl drivers you mentioned...

sudo aptitude reinstall ocl-icd-libopencl1


Restarted the boinc client again with...

sudo systemctl restart boinc.client.service


but did not work at first.


One

classical


sudo reboot now


Solved it in the end. ;-)


cheers to the community knowledge base and your hint.

Seemed to have affected another project too, I did not recognized before. Solved that too.


Niko
ID: 75154 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Questions and Answers : Unix/Linux : Compute errors

©2024 Astroinformatics Group