1)
Questions and Answers :
Unix/Linux :
Problem switching from Windows to Ubuntu: ROCm question
(Message 75178)
Posted 1 day ago by ![]() Post: isn't this exactly what I said in the other thread? I did not see that (rather did not think about it). got too involved getting the Mi25 to work It turned out that windows was easier to work with. Going to post in the GPUUG over at Einstein |
2)
Questions and Answers :
Unix/Linux :
Problem switching from Windows to Ubuntu: ROCm question
(Message 75175)
Posted 2 days ago by ![]() Post: Think I found the problem. The Mi25 is recognized in 22.04 with ROCm 5.x but is not supported AMD Instinct MI25 End of Life ROCm release v4.5 is the final release to support AMD Instinct MI25 When I tried amdgpu-install --rocmrelease=4.5.2 The error "the package was not found" The 4.5.2 was last used in 20.04. I cannot find a repo that has the 4.5.2 rocm release. Messing around with and updated repo from AMD I ended up with broken packages I cannot fix. Will try 20.04 and see if it can find a 4.5 rocm version. |
3)
Questions and Answers :
Unix/Linux :
CL_OUT_OF_HOST_MEMORY with AMD RX 6600 XT on Xubuntu 20.04
(Message 75173)
Posted 2 days ago by ![]() Post: i don't necessarily think it will solve the problem, but you could look for and try setting "Above 4G decoding" to enabled. It is enabled. When I disabled that (mmio segment?) I got BIOS warning "not enough PCIe resources to go around, remove some PCIe boards***" (I didnt take a picture of the exact wording). I pulled all the x9150 and s9100 and all the only difference was a lot less fan noise. Got the same error message. The h110 can set gen 3 for "pcie #2" (of 0..7), all the rest of the slots can be gen1, gen2 or auto The slots are numbered 0,1,2,3,4,....11 the x16 slot is 3 so probably 2 and 3 are gen3. The Mi25 started working (clinfo only) when gen3 was enabled. All the other slots are x1 with risers. I tried disabling the h110 internal video thinking that might pull that mmio segment back in but it was worse. 22.04 would not boot. I tried a cheap AMD video board in slot 2 and the Mi25 in slot 3 and vice-versa but system hung. Tried the following kernels. the one named "jyskern" took me over 7 hours to build. The AMD driver I used was for 22.04.2 1 Advanced options for Ubuntu │ │ 1>0 Ubuntu, with Linux 5.19.0-35-generic │ │ 1>1 Ubuntu, with Linux 5.19.0-35-generic (recovery mode) │ │ 1>2 Ubuntu, with Linux 5.19.0-32-generic │ │ 1>3 Ubuntu, with Linux 5.19.0-32-generic (recovery mode) │ │ 1>4 Ubuntu, with Linux 5.13.0jyskern │ │ 1>5 Ubuntu, with Linux 5.13.0jyskern (recovery mode) I read the following article where the user burned the wx9100 bios onto the Mi25 and it working in kernel 5.13. He got the video to work also. Normally the Mi25 has no video. The guide I used to downgrade the kernel is https://youtu.be/IDYZ9Hm-p44 It was 7 hours to build on a dual xeon 24 thread. However, when I looked at the mi25 and wxl9100 bios sizes there was huge differences so I do not plan to flash wxl9100 unless I get advice from the author Plus it does not seem to work anyway. -rw-rw-r-- 1 jstateson jstateson 39201 Mar 20 16:08 218718.rom -rw-rw-r-- 1 jstateson jstateson 262144 Mar 8 2021 230670.rom -rw-rw-r-- 1 jstateson jstateson 262144 Apr 25 2022 245174.rom -rw-rw-r-- 1 jstateson jstateson 1048576 Feb 8 2019 AMD.RadeonVII.16384.190116.rom -rwxrwxrwx 1 jstateson jstateson 1660176 May 20 2021 amdvbflash* -rw-rw-r-- 1 jstateson jstateson 620696 Mar 20 10:37 amdvbflash_linux_4.71.zip -rw-r--r-- 1 root root 1048576 Mar 20 16:01 original_mi25 Another problem is I do not know if the rom is locked. Have not figured out exactly what the following denotes. jstateson@mi25-john:~/flash$ sudo ./amdvbflash -checklock 0 AMDVBFLASH version 4.71, Copyright (c) 2020 Advanced Micro Devices, Inc. SW protection fail (0x9C) The following error shows up in einstein slot 0 [15:30:32][2285][INFO ] Application startup - thank you for supporting Einstein@Home! [15:30:32][2285][INFO ] Starting data processing... [15:30:32][2285][INFO ] Using OpenCL platform provided by: Advanced Micro Devices, Inc. [15:30:32][2285][INFO ] Using OpenCL device "gfx900:xnack-" by: Advanced Micro Devices, Inc. [15:30:32][2285][ERROR] Couldn't create OpenCL command queue (error: -6)! [15:30:32][2285][INFO ] OpenCL shutdown complete! [15:30:32][2285][ERROR] Demodulation failed (error: 2013)! [15:30:32][2285][WARN ] Sorry, at the moment your system doesn't have enough free CPU/GPU memory to run this task! System has 8gb which is not a problem for AMD VII and 4 of the s91x0 types so I suspect memory to not be a problem. ****edit - That warning included the phrase that If I continue to boot there could be problems but there was no option to continue. There was no "press f1 to continue" or anything like that so not sure what the bios author meant to be done. |
4)
Questions and Answers :
Unix/Linux :
Problem switching from Windows to Ubuntu: ROCm question
(Message 75169)
Posted 2 days ago by ![]() Post: I have "almost" got this to work. I forced Gen3 on the x16 slot and now both the VII and the Mi25 boards work in that slot. Unfortunately, the VII will not work in any of the x1 slots and none of my x9150 boards work. The drivers for the mi25 and VII seem to be missing what is needed for those older S boards. |
5)
Questions and Answers :
Unix/Linux :
CL_OUT_OF_HOST_MEMORY with AMD RX 6600 XT on Xubuntu 20.04
(Message 75168)
Posted 2 days ago by ![]() Post:
I have just run into this problem testing out an AMD Mi25 that jpmboy sent me. OpenCL: AMD/ATI GPU 0: Radeon Instinct MI25 (driver version 3513.0 (HSA1.1,LC), device version OpenCL 2.0, 16368MB, 16368MB available, 12288 GFLOPS peak) OS: Linux Ubuntu: Ubuntu 22.04.2 LTS [5.19.0-35-generic|libc 2.35] Task Ter5_4_cfbf00057_segment_13_dms_400_13200_243_250000_1 postponed for 900 seconds: Not enough free CPU/GPU memory available! Delaying next attempt for at least 15 minutes... The above error is from an Einstein BREP7-opencl-ati task so it is not just Milkyway Milkyway@home Separation v1.46 (opencl_ati_101)x86_64-pc-linux-gnu Error getting device and context (-6): CL_OUT_OF_HOST_MEMORY I am guessing the problem has to do with some memory segment that my H110-btc motherboard has enabled. I have little experience with UEFI bios and there are lot of settings. I was only able to get the Mi25 board to work by forcing GEN3 on the PCIe x16 slot it is in. I also had to disable CSM. |
6)
Questions and Answers :
Unix/Linux :
Problem switching from Windows to Ubuntu: ROCm question
(Message 75163)
Posted 4 days ago by ![]() Post: System with AMD VII and S91x0 boards that work fine in Windows 10. Want to switch to Ubuntu. I installed 22.04 and AMD drivers using "=rocr,legacy". I disabled CSM in bios and the motherboard bios is UEFI Only one slot is Gen3. The AMD VII board is in that one. The S91x0 boards are all in Gen2 slots. Gen 2 does not support atomics I read somewhere long ago. Only the AMD VII board works clinfo reports the other gfx9 boards (s9150, s9100) have no platform. sudo lshw -c video mis-identifies the s9150 and s9100 as w9100 and w8100 However, I have seen windows do the same and they work anyway. According to some ROCm doc I cannot find at the moment quote: GFX9 GPUs (such as Vega 10) no longer require PCIe atomics That Gen3 slot has atomics since it is Gen3. The Gen2 do not but supposedly ROCm does not need it for gfx9 boards. I specified legacy as the s91x0 boards are legacy but I might be mistaken. Is anyone running ROCm in Gen 2 slots? |
7)
Message boards :
Cafe MilkyWay :
WCG Friends
(Message 75133)
Posted 13 days ago by ![]() Post: I have switched to SiDock for my COVID support when I ran out of WCG. Tasks take usually 2-3 days to complete which is a bummer. Last month I contributed (as best as I can calculate) at least 2 of the 8 points the GPU Users group got for SiDock. https://www.boincgames.com/sprint_details.php?id=11 SiDock have only CPU tasks. I have seen no indication they are working on OpenCL or CUDA applications. |
8)
Message boards :
Cafe MilkyWay :
WCG Friends
(Message 75130)
Posted 14 days ago by ![]() Post: Latest update (around 16:30 UTC 8th March) I assume Krembil got this system from IBM when they took over WCG. IBM makes their own servers but who know, maybe IBM went for a cheaper Dell system when putting WCG together back then. |
9)
Message boards :
Number crunching :
Some seconds in cuda crunching saved
(Message 75125)
Posted 15 days ago by ![]() Post: Hy, the system variable Milkyway GPU apps are all OpenCL. CUDA is not listed |
10)
Message boards :
Number crunching :
Future of Milkyway@Home
(Message 75085)
Posted 24 days ago by ![]() Post:
Thanks, I will try it. I spent 1/2 day with a clean install of 20.04.2 but it had kernel 5.15 Currently I have 5.4.0-139-generic and will try 20.50 but am not holding my breath. |
11)
Message boards :
Number crunching :
Future of Milkyway@Home
(Message 75059)
Posted 16 Feb 2023 by ![]() Post: It is. *ahem* will be. I assume Linux + CUDA which means only Nvidia cards can use the "mod". Currently, I cannot run any of my (older) hi performance AMD cards under Ubuntu 20.04.5. Every month or two I take another stab at getting them to work. Under 18.04 my S9xx0 and HD-79xx cards worked fine but not since a disk crash and upgrade to 20.04.5 Advice from AMD forum was not helpful: use the exact release listed and no other. The release, dated 2021 Q2, is for Ubuntu 20.04.2 https://www.amd.com/en/support/professional-graphics/firepro/firepro-s-series/firepro-s9000 The instructions come with a warning that if one upgrades to kernel 4.15 then the driver from the year 2018 needs to be used instead of the 2021 drivers. This does not make a lot of sense considering that the 20.04.5 kernel is way past 4.15 OS: Linux Ubuntu: Ubuntu 20.04.5 LTS [5.4.0-139-generic|libc 2.31] If I follow AMD's exact installation instructions for the Firepro s9000 card, only my RX-570 card works. I discovered this by plugging in the RX-570, powering the system back on, and doing nothing else. The 570 is significantly slower than the s9000 or s9050 cards. OpenCL: AMD/ATI GPU 0: Radeon RX 570 Series (driver version 3224.4, device version OpenCL 1.2 AMD-APP (3224.4), 4082MB, 4082MB available, 5095 GFLOPS peak) |
12)
Message boards :
Number crunching :
Future of Milkyway@Home
(Message 75048)
Posted 10 Feb 2023 by ![]() Post: Something that has always bothered me was the lack of support for CUDA in this project. If you look back at the 2008 "Application Code Discussions" you will find that a developer, Travis, was working on a CUDA implementation. I assume that OpenCL worked better for this project than CUDA and/or Travis left. Unless I am mistaken l remember that "Travis" was also developing CUDA code for, or contributing code, to other projects. |
13)
Message boards :
Number crunching :
Run Multiple WU's on Your GPU
(Message 75019)
Posted 5 Feb 2023 by ![]() Post: i use afterburner for overclocking and gpu z and hwinfo64 for temperature monitoring Small fans are like small dogs. They make a lot of noise but do not do much. Larger fans push more air with less noise but it is possibe to get too big a fan ![]() |
14)
Questions and Answers :
Unix/Linux :
Compute errors
(Message 74999)
Posted 3 Feb 2023 by ![]() Post: Login and run that nvidia diagnostic. I suspect the board got hung up. Possible oveheated. jstateson@dual-linux:~$ nvidia-smi Fri Feb 3 09:28:27 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.86.01 Driver Version: 515.86.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA P102-100 Off | 00000000:01:00.0 Off | N/A | | 0% 52C P0 142W / 250W | 1804MiB / 5120MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:03:00.0 Off | N/A | | 62% 61C P2 94W / 120W | 857MiB / 6144MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce ... Off | 00000000:04:00.0 Off | N/A | | 17% 57C P2 63W / 151W | 1056MiB / 8192MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 16919 C ...-linux-gnu__cuda118_linux 1802MiB | | 1 N/A N/A 16934 C ...-linux-gnu__cuda118_linux 854MiB | | 2 N/A N/A 16963 C ...-linux-gnu__cuda118_linux 1054MiB | +-----------------------------------------------------------------------------+ [edit] if it is overheating there is a "coolbits" setting that you can use to speed up the fan. |
15)
Message boards :
Number crunching :
AMD VII: Occasional a task never finishes and is "hot spot" too high?
(Message 74924)
Posted 19 Jan 2023 by ![]() Post: I finally got the AMD performance software to work. Problem was after driver install, I had 5 "ghost" GPU cards. I had to edit the coproc_info.xml file, remove the extra 5 GPUs and then mark the file read only so that BOINC would not be able to add the "extra" GPUs' back in. The driver I used was "win10-radeon-pro-software-enterprise-21.Q2.1" Possible using a device driver cleaner and a re-install might fix the copro_info.xml problem. I got duplicate GPUs due to BOINC ( or clinfo) seeing two drivers instead of one so it marked my system as having two opencl platforms and I had almost 500 error'ed out tasks before I could suspend the project and fix the problem. What driver(s) are you using? Does one card have one version and a different card have another? Anyway, this is the display of 4 of the 5 gpus. The 5tth would not fit in the screen capture. Note the junction temp, the so-called "hot spot". Tuning is only available for the VII. I have not tried any tuning yet. Is it the same app you are using? ![]() |
16)
Message boards :
Number crunching :
AMD VII: Occasional a task never finishes and is "hot spot" too high?
(Message 74921)
Posted 18 Jan 2023 by ![]() Post: When one of my AMD 7970`s got to 104c one day it was dead the next week . Your HD-7970 is the "Tahiti" chip, same as the S9000 and S9050 cards None of the S9xxx cards have the so called "hot spot" sensor capability The temps shown by GPUz or MSI's afterburner are all under 60c GPUz shows two temps for the VII "GPU" One is low like 60-70 and is in the same location as the S9xxx series cards. the other is the "Hot Spot" which jumps extremely, from 70 to 107 and all around The S9xxx do not have that sensors and are missing other feature that GPUz can show for the VII |
17)
Message boards :
Number crunching :
AMD VII: Occasional a task never finishes and is "hot spot" too high?
(Message 74919)
Posted 18 Jan 2023 by ![]() Post: I am running 4 work units per GPU: One AMD VIi and several AMD S9xxx boards About once every 4-5 days a task hangs up on the VII. It is easily fixed by suspending and then resuming the task. The VII is the fastest board, 2x as fast as S9150 and the problem is only on the VII. I used a boinctask "rule" to automatically suspend any MW task taking over 5 minutes which allows the card to continue processing 4 tasks as is normal instead of 3 and a hung task. 1 - Has anyone seen a problem like this before? 2 - GPUz has a feature that shows the "hot spot". My RTX-2080Ti and the VII card are the only ones that reports "hot spot". My other Nvidia and AMD cards lack that feature, probably too old. The VII has 3 fans and is in an open frame rack and there is a box fan cooling the rack. Its "hot spot" runs 102-107c. The RTX-2080Ti is in a case "Area51" that is cramped. It shows 80c for its hot spot. Do these values seem ok? |
18)
Message boards :
Number crunching :
seems AMD VII cannot be used on a riser
(Message 74910)
Posted 14 Jan 2023 by ![]() Post: For what it's worth, I tried one of those ribbon cable risers that have all the x1 wiring, not just the wiring in the USB3 type cable. Same problem: VII slowed down terribly. It needs to be in an x16 slot yet crypto miners run the VII on USB3 risers so I am unsure what is happening. The eBayer I bought the card from was a crypto miner and had no problem as all the cards were in x1 USB3 risers and suggested a bad risers but I tried several. |
19)
Questions and Answers :
Windows :
GPU in non-continuous operation.
(Message 74909)
Posted 14 Jan 2023 by ![]() Post: [quote]For what it is worth, I updated the Milkyway fix "mod" using BOINC version 7.20.2 Poor choice of words, the speed-up would have been a better choice I think I will try and compile Joseph's Linux BOINC client and drop it into the Lunatics AIO version of BOINC on a spare PC for beta testing his MW "fix"[/quotd] I had resources set to 0 for the Milkyway project as I run Einstein at %100 on my single Linux system. As Keith mentioned, the share needs to be set to %100 to trigger the problem and be able to see if the mod works properly. AFAICT the Linux version works. The executable can be downloaded or can be built. It does not do anything special over what the 7.15 mod does, but it does work with the 7.20 Berkeley manager AFAICT. |
20)
Questions and Answers :
Windows :
GPU in non-continuous operation.
(Message 74904)
Posted 12 Jan 2023 by ![]() Post: For what it is worth, I updated the Milkyway fix "mod" using BOINC version 7.20.2 That Einstein mod is in Linux and AFAICT makes an improvement in performance of Linux applications. There was also a mod to BOINC: 7.17 and 7.19 It was done by Petri and possibly others using Linux This Einstein@home App (v1.0 by petri33) was built at: Apr 28 2022 18:47:15 All I have done is modify the scheduling algorithm in BOINC to bypass that 91 second delay that the Milkyway server wants. |
©2023 Astroinformatics Group