Welcome to MilkyWay@home

Posts by Joseph Stateson

1) Questions and Answers : Unix/Linux : Problem switching from Windows to Ubuntu: ROCm question (Message 75178)
Posted 1 day ago by ProfileJoseph Stateson
Post:
isn't this exactly what I said in the other thread?

Vega only supported by ROCm 4.5
ROCm 4.5 only supported on 20.04 with 5.11 kernel or 18.04 with 5.4 kernel.

i wasnt so specific, but I said you might need an older OS and to check the install docs, and the above is what the install docs say.


I did not see that (rather did not think about it). got too involved getting the Mi25 to work

It turned out that windows was easier to work with. Going to post in the GPUUG over at Einstein
2) Questions and Answers : Unix/Linux : Problem switching from Windows to Ubuntu: ROCm question (Message 75175)
Posted 2 days ago by ProfileJoseph Stateson
Post:
Think I found the problem. The Mi25 is recognized in 22.04 with ROCm 5.x but is not supported
AMD Instinct MI25 End of Life
ROCm release v4.5 is the final release to support AMD Instinct MI25


When I tried amdgpu-install --rocmrelease=4.5.2
The error "the package was not found"
The 4.5.2 was last used in 20.04. I cannot find a repo that has the 4.5.2 rocm release.
Messing around with and updated repo from AMD I ended up with broken packages I cannot fix.
Will try 20.04 and see if it can find a 4.5 rocm version.
3) Questions and Answers : Unix/Linux : CL_OUT_OF_HOST_MEMORY with AMD RX 6600 XT on Xubuntu 20.04 (Message 75173)
Posted 2 days ago by ProfileJoseph Stateson
Post:
i don't necessarily think it will solve the problem, but you could look for and try setting "Above 4G decoding" to enabled.


It is enabled. When I disabled that (mmio segment?) I got BIOS warning "not enough PCIe resources to go around, remove some PCIe boards***" (I didnt take a picture of the exact wording).

I pulled all the x9150 and s9100 and all the only difference was a lot less fan noise. Got the same error message.

The h110 can set gen 3 for "pcie #2" (of 0..7), all the rest of the slots can be gen1, gen2 or auto
The slots are numbered 0,1,2,3,4,....11
the x16 slot is 3 so probably 2 and 3 are gen3. The Mi25 started working (clinfo only) when gen3 was enabled. All the other slots are x1 with risers.
I tried disabling the h110 internal video thinking that might pull that mmio segment back in but it was worse. 22.04 would not boot. I tried a cheap AMD video board in slot 2 and the Mi25 in slot 3 and vice-versa but system hung.

Tried the following kernels. the one named "jyskern" took me over 7 hours to build.
The AMD driver I used was for 22.04.2
1   Advanced options for Ubuntu                                    │
                                             │           1>0 Ubuntu, with Linux 5.19.0-35-generic                           │
                                             │           1>1 Ubuntu, with Linux 5.19.0-35-generic (recovery mode)           │
                                             │           1>2 Ubuntu, with Linux 5.19.0-32-generic                           │
                                             │           1>3 Ubuntu, with Linux 5.19.0-32-generic (recovery mode)           │
                                             │           1>4 Ubuntu, with Linux 5.13.0jyskern                               │
                                             │           1>5 Ubuntu, with Linux 5.13.0jyskern (recovery mode)


I read the following article where the user burned the wx9100 bios onto the Mi25 and it working in kernel 5.13. He got the video to work also. Normally the Mi25 has no video.
The guide I used to downgrade the kernel is
https://youtu.be/IDYZ9Hm-p44
It was 7 hours to build on a dual xeon 24 thread.
However, when I looked at the mi25 and wxl9100 bios sizes there was huge differences so I do not plan to flash wxl9100 unless I get advice from the author
Plus it does not seem to work anyway.

-rw-rw-r--  1 jstateson jstateson   39201 Mar 20 16:08 218718.rom
-rw-rw-r--  1 jstateson jstateson  262144 Mar  8  2021 230670.rom
-rw-rw-r--  1 jstateson jstateson  262144 Apr 25  2022 245174.rom
-rw-rw-r--  1 jstateson jstateson 1048576 Feb  8  2019 AMD.RadeonVII.16384.190116.rom
-rwxrwxrwx  1 jstateson jstateson 1660176 May 20  2021 amdvbflash*
-rw-rw-r--  1 jstateson jstateson  620696 Mar 20 10:37 amdvbflash_linux_4.71.zip
-rw-r--r--  1 root      root      1048576 Mar 20 16:01 original_mi25


Another problem is I do not know if the rom is locked. Have not figured out exactly what the following denotes.

jstateson@mi25-john:~/flash$ sudo ./amdvbflash -checklock 0
AMDVBFLASH version 4.71, Copyright (c) 2020 Advanced Micro Devices, Inc.

SW protection fail (0x9C)


The following error shows up in einstein slot 0

[15:30:32][2285][INFO ] Application startup - thank you for supporting Einstein@Home!
[15:30:32][2285][INFO ] Starting data processing...
[15:30:32][2285][INFO ] Using OpenCL platform provided by: Advanced Micro Devices, Inc.
[15:30:32][2285][INFO ] Using OpenCL device "gfx900:xnack-" by: Advanced Micro Devices, Inc.
[15:30:32][2285][ERROR] Couldn't create OpenCL command queue (error: -6)!
[15:30:32][2285][INFO ] OpenCL shutdown complete!
[15:30:32][2285][ERROR] Demodulation failed (error: 2013)!
[15:30:32][2285][WARN ] Sorry, at the moment your system doesn't have enough free CPU/GPU memory to run this task!


System has 8gb which is not a problem for AMD VII and 4 of the s91x0 types so I suspect memory to not be a problem.

****edit - That warning included the phrase that If I continue to boot there could be problems but there was no option to continue. There was no "press f1 to continue" or anything like that so not sure what the bios author meant to be done.
4) Questions and Answers : Unix/Linux : Problem switching from Windows to Ubuntu: ROCm question (Message 75169)
Posted 2 days ago by ProfileJoseph Stateson
Post:
I have "almost" got this to work. I forced Gen3 on the x16 slot and now both the VII and the Mi25 boards work in that slot.
Unfortunately, the VII will not work in any of the x1 slots and none of my x9150 boards work. The drivers for the mi25 and VII seem to be missing what is needed for those older S boards.
5) Questions and Answers : Unix/Linux : CL_OUT_OF_HOST_MEMORY with AMD RX 6600 XT on Xubuntu 20.04 (Message 75168)
Posted 2 days ago by ProfileJoseph Stateson
Post:

https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.4.3/page/How_to_Install_ROCm.html

As reported by others, Einstein@Home GPU tasks work fine.
Best regards,

Samuel



I have just run into this problem testing out an AMD Mi25 that jpmboy sent me.

OpenCL: AMD/ATI GPU 0: Radeon Instinct MI25 (driver version 3513.0 (HSA1.1,LC), device version OpenCL 2.0, 16368MB, 16368MB available, 12288 GFLOPS peak)		
OS: Linux Ubuntu: Ubuntu 22.04.2 LTS [5.19.0-35-generic|libc 2.35]	
Task Ter5_4_cfbf00057_segment_13_dms_400_13200_243_250000_1 postponed for 900 seconds: Not enough free CPU/GPU memory available! Delaying next attempt for at least 15 minutes...


The above error is from an Einstein BREP7-opencl-ati task so it is not just Milkyway

Milkyway@home Separation v1.46 (opencl_ati_101)x86_64-pc-linux-gnu
Error getting device and context (-6): CL_OUT_OF_HOST_MEMORY


I am guessing the problem has to do with some memory segment that my H110-btc motherboard has enabled. I have little experience with UEFI bios and there are lot of settings. I was only able to get the Mi25 board to work by forcing GEN3 on the PCIe x16 slot it is in. I also had to disable CSM.
6) Questions and Answers : Unix/Linux : Problem switching from Windows to Ubuntu: ROCm question (Message 75163)
Posted 4 days ago by ProfileJoseph Stateson
Post:
System with AMD VII and S91x0 boards that work fine in Windows 10. Want to switch to Ubuntu.
I installed 22.04 and AMD drivers using "=rocr,legacy". I disabled CSM in bios and the motherboard bios is UEFI
Only one slot is Gen3. The AMD VII board is in that one. The S91x0 boards are all in Gen2 slots. Gen 2 does not support atomics I read somewhere long ago.

Only the AMD VII board works
clinfo reports the other gfx9 boards (s9150, s9100) have no platform.
sudo lshw -c video mis-identifies the s9150 and s9100 as w9100 and w8100
However, I have seen windows do the same and they work anyway.

According to some ROCm doc I cannot find at the moment quote:

GFX9 GPUs (such as Vega 10) no longer require PCIe atomics


That Gen3 slot has atomics since it is Gen3. The Gen2 do not but supposedly ROCm does not need it for gfx9 boards. I specified legacy as the s91x0 boards are legacy but I might be mistaken.

Is anyone running ROCm in Gen 2 slots?
7) Message boards : Cafe MilkyWay : WCG Friends (Message 75133)
Posted 13 days ago by ProfileJoseph Stateson
Post:
I have switched to SiDock for my COVID support when I ran out of WCG. Tasks take usually 2-3 days to complete which is a bummer.
Last month I contributed (as best as I can calculate) at least 2 of the 8 points the GPU Users group got for SiDock.
https://www.boincgames.com/sprint_details.php?id=11

SiDock have only CPU tasks. I have seen no indication they are working on OpenCL or CUDA applications.
8) Message boards : Cafe MilkyWay : WCG Friends (Message 75130)
Posted 14 days ago by ProfileJoseph Stateson
Post:
Latest update (around 16:30 UTC 8th March)

Update: As of this morning, the data center continues to work on booting the temporary replacement DSS 7000 storage system. They are attempting multiple alternative strategies to resolve current failures.
Sounds like there re a lot of cobwebs to blow out of that bit of kit! :-) -- As far as I can determine (from reseller's sites[1]) it uses a now-discontinued Xeon E5 processor model, so it's not exactly new tech.

Someone posted a picture of what appears to be a DSS 7000 unit and said "It's only 90 drives, what could go wrong?..." (although that's a fully populated 4-processor version; there's also a dual Xeon 45 drive version) -- if they manage to get this working, I wonder how long it will be before there's another failure; perhaps we should organize a collection for a brand new storage device?

And if anyone who reads this can actually confirm or correct the specification, feel free to do so :-)

Cheers - Al.

[1] I tried finding specs on Dell's site, but there wasn't anything immediately obvious and useful...


I assume Krembil got this system from IBM when they took over WCG. IBM makes their own servers but who know, maybe IBM went for a cheaper Dell system when putting WCG together back then.
9) Message boards : Number crunching : Some seconds in cuda crunching saved (Message 75125)
Posted 15 days ago by ProfileJoseph Stateson
Post:
Hy, the system variable

CUDA_CACHE_MAXSIZE 4294967296

can enhance cuda crunching times, please test it by yourselfes.

Greetings for germany


Milkyway GPU apps are all OpenCL. CUDA is not listed
10) Message boards : Number crunching : Future of Milkyway@Home (Message 75085)
Posted 24 days ago by ProfileJoseph Stateson
Post:


The only, quasi-stable, config I have been able to do for linux for firepros (and Bristol Ridge R7):

Ubuntu 20.04 (either .1 or .2 (I think).
Install and pull the internet during installation as even if you say don't upgrade: it will helpfully update you to kernel 5.15.XX which I haven't been able to get amdgpu to work on. period.
Uninstall snap. Disable unattended upgrades. Turn off crash reporting.
Never allow upgrading to go above kernel 5.4.0.XXX otherwise amdgpu 20.50 will break, at least in my experience.


I use amdgpu v20.50. Nothing else works. I only install --opencl=legacy. I don't bother with vega or later if the firepro is on the system.
Since I have dedicated crunchers, I will ssh into them as amdgpu tends to make video broken, but the firepro is seen and works. boinctui makes life bearable via cmdline.

Hope this helps.


Thanks, I will try it. I spent 1/2 day with a clean install of 20.04.2 but it had kernel 5.15
Currently I have 5.4.0-139-generic
and will try 20.50 but am not holding my breath.
11) Message boards : Number crunching : Future of Milkyway@Home (Message 75059)
Posted 16 Feb 2023 by ProfileJoseph Stateson
Post:
It is. *ahem* will be.



I assume Linux + CUDA which means only Nvidia cards can use the "mod".

Currently, I cannot run any of my (older) hi performance AMD cards under Ubuntu 20.04.5. Every month or two I take another stab at getting them to work.
Under 18.04 my S9xx0 and HD-79xx cards worked fine but not since a disk crash and upgrade to 20.04.5

Advice from AMD forum was not helpful: use the exact release listed and no other.

The release, dated 2021 Q2, is for Ubuntu 20.04.2
https://www.amd.com/en/support/professional-graphics/firepro/firepro-s-series/firepro-s9000

The instructions come with a warning that if one upgrades to kernel 4.15 then the driver from the year 2018 needs to be used instead of the 2021 drivers. This does not make a lot of sense considering that the 20.04.5 kernel is way past 4.15


OS: Linux Ubuntu: Ubuntu 20.04.5 LTS [5.4.0-139-generic|libc 2.31]	


If I follow AMD's exact installation instructions for the Firepro s9000 card, only my RX-570 card works. I discovered this by plugging in the RX-570, powering the system back on, and doing nothing else. The 570 is significantly slower than the s9000 or s9050 cards.

OpenCL: AMD/ATI GPU 0: Radeon RX 570 Series (driver version 3224.4, device version OpenCL 1.2 AMD-APP (3224.4), 4082MB, 4082MB available, 5095 GFLOPS peak)	
12) Message boards : Number crunching : Future of Milkyway@Home (Message 75048)
Posted 10 Feb 2023 by ProfileJoseph Stateson
Post:
Something that has always bothered me was the lack of support for CUDA in this project. If you look back at the 2008 "Application Code Discussions" you will find that a developer, Travis, was working on a CUDA implementation. I assume that OpenCL worked better for this project than CUDA and/or Travis left. Unless I am mistaken l remember that "Travis" was also developing CUDA code for, or contributing code, to other projects.
13) Message boards : Number crunching : Run Multiple WU's on Your GPU (Message 75019)
Posted 5 Feb 2023 by ProfileJoseph Stateson
Post:
i use afterburner for overclocking and gpu z and hwinfo64 for temperature monitoring


Can you take the side off the pc? If so try that and see if it helps or not, if not you can always put a small floor fan blowing into the open side or as you said get more fans and leave the side on. If your current case doesn't have top fans in some cases it's pretty easy to swap to a new case, Dell and HP's being the notable exceptions.


Small fans are like small dogs. They make a lot of noise but do not do much. Larger fans push more air with less noise but it is possibe to get too big a fan

14) Questions and Answers : Unix/Linux : Compute errors (Message 74999)
Posted 3 Feb 2023 by ProfileJoseph Stateson
Post:
Login and run that nvidia diagnostic. I suspect the board got hung up. Possible oveheated.


jstateson@dual-linux:~$ nvidia-smi
Fri Feb  3 09:28:27 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA P102-100     Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   52C    P0   142W / 250W |   1804MiB /  5120MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:03:00.0 Off |                  N/A |
| 62%   61C    P2    94W / 120W |    857MiB /  6144MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:04:00.0 Off |                  N/A |
| 17%   57C    P2    63W / 151W |   1056MiB /  8192MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     16919      C   ...-linux-gnu__cuda118_linux     1802MiB |
|    1   N/A  N/A     16934      C   ...-linux-gnu__cuda118_linux      854MiB |
|    2   N/A  N/A     16963      C   ...-linux-gnu__cuda118_linux     1054MiB |
+-----------------------------------------------------------------------------+


[edit] if it is overheating there is a "coolbits" setting that you can use to speed up the fan.
15) Message boards : Number crunching : AMD VII: Occasional a task never finishes and is "hot spot" too high? (Message 74924)
Posted 19 Jan 2023 by ProfileJoseph Stateson
Post:
I finally got the AMD performance software to work. Problem was after driver install, I had 5 "ghost" GPU cards. I had to edit the coproc_info.xml file, remove the extra 5 GPUs and then mark the file read only so that BOINC would not be able to add the "extra" GPUs' back in.

The driver I used was "win10-radeon-pro-software-enterprise-21.Q2.1"
Possible using a device driver cleaner and a re-install might fix the copro_info.xml problem. I got duplicate GPUs due to BOINC ( or clinfo) seeing two drivers instead of one so it marked my system as having two opencl platforms and I had almost 500 error'ed out tasks before I could suspend the project and fix the problem.

What driver(s) are you using? Does one card have one version and a different card have another?

Anyway, this is the display of 4 of the 5 gpus. The 5tth would not fit in the screen capture. Note the junction temp, the so-called "hot spot".
Tuning is only available for the VII. I have not tried any tuning yet.
Is it the same app you are using?


16) Message boards : Number crunching : AMD VII: Occasional a task never finishes and is "hot spot" too high? (Message 74921)
Posted 18 Jan 2023 by ProfileJoseph Stateson
Post:
When one of my AMD 7970`s got to 104c one day it was dead the next week .
80c is still to hot for my likeing . even short term .
I do everything I can to keep every bit of a hard working GPU/card below 70c ,
As far below as possible .
If you want them to last .


Your HD-7970 is the "Tahiti" chip, same as the S9000 and S9050 cards
None of the S9xxx cards have the so called "hot spot" sensor capability
The temps shown by GPUz or MSI's afterburner are all under 60c

GPUz shows two temps for the VII "GPU"
One is low like 60-70 and is in the same location as the S9xxx series cards.
the other is the "Hot Spot" which jumps extremely, from 70 to 107 and all around
The S9xxx do not have that sensors and are missing other feature that GPUz can show for the VII
17) Message boards : Number crunching : AMD VII: Occasional a task never finishes and is "hot spot" too high? (Message 74919)
Posted 18 Jan 2023 by ProfileJoseph Stateson
Post:
I am running 4 work units per GPU: One AMD VIi and several AMD S9xxx boards

About once every 4-5 days a task hangs up on the VII. It is easily fixed by suspending and then resuming the task. The VII is the fastest board, 2x as fast as S9150 and the problem is only on the VII. I used a boinctask "rule" to automatically suspend any MW task taking over 5 minutes which allows the card to continue processing 4 tasks as is normal instead of 3 and a hung task.

1 - Has anyone seen a problem like this before?

2 - GPUz has a feature that shows the "hot spot". My RTX-2080Ti and the VII card are the only ones that reports "hot spot". My other Nvidia and AMD cards lack that feature, probably too old. The VII has 3 fans and is in an open frame rack and there is a box fan cooling the rack. Its "hot spot" runs 102-107c. The RTX-2080Ti is in a case "Area51" that is cramped. It shows 80c for its hot spot. Do these values seem ok?
18) Message boards : Number crunching : seems AMD VII cannot be used on a riser (Message 74910)
Posted 14 Jan 2023 by ProfileJoseph Stateson
Post:
For what it's worth, I tried one of those ribbon cable risers that have all the x1 wiring, not just the wiring in the USB3 type cable.
Same problem: VII slowed down terribly. It needs to be in an x16 slot yet crypto miners run the VII on USB3 risers so I am unsure what is happening.
The eBayer I bought the card from was a crypto miner and had no problem as all the cards were in x1 USB3 risers and suggested a bad risers but I tried several.
19) Questions and Answers : Windows : GPU in non-continuous operation. (Message 74909)
Posted 14 Jan 2023 by ProfileJoseph Stateson
Post:
[quote]For what it is worth, I updated the Milkyway fix "mod" using BOINC version 7.20.2
https://github.com/JStateson/Milkyway-7-21
If you are running 7.15.0 there is no Milkyway advantage other than having 7.20.2 functionality instead of 7.14
Any problems post as an issue over at github.


Per chance does your 'fix' include the Lunatics 'fix' for Einstein as well, or is yours solely focused on Milkyway?

Not sure what is broken at Einstein that needs a "fix" Mikey.

Can you elucidate the problem?


Poor choice of words, the speed-up would have been a better choice I think

I will try and compile Joseph's Linux BOINC client and drop it into the Lunatics AIO version of BOINC on a spare PC for beta testing his MW "fix"[/quotd]

WOO HOO!!

Only problem is all I have are Nvidia cards and he mentions they aren't fast enough to trigger the problem at MW.

But I used to run into the MW Scheduler problem all the time back when I was running MW at 100% resource share. Why I asked our dev to come up with the report_delay parameter for our Pandora client which he kindly put into the cs_scheduler.cpp for me even though I was the only team member running MW at 100% at that time.


Hmmmm that does present a problem



I had resources set to 0 for the Milkyway project as I run Einstein at %100 on my single Linux system. As Keith mentioned, the share needs to be set to %100 to trigger the problem and be able to see if the mod works properly. AFAICT the Linux version works. The executable can be downloaded or can be built. It does not do anything special over what the 7.15 mod does, but it does work with the 7.20 Berkeley manager AFAICT.
20) Questions and Answers : Windows : GPU in non-continuous operation. (Message 74904)
Posted 12 Jan 2023 by ProfileJoseph Stateson
Post:
For what it is worth, I updated the Milkyway fix "mod" using BOINC version 7.20.2
https://github.com/JStateson/Milkyway-7-21
If you are running 7.15.0 there is no Milkyway advantage other than having 7.20.2 functionality instead of 7.14
Any problems post as an issue over at github.


Per chance does your 'fix' include the Lunatics 'fix' for Einstein as well, or is yours solely focused on Milkyway?


That Einstein mod is in Linux and AFAICT makes an improvement in performance of Linux applications. There was also a mod to BOINC: 7.17 and 7.19

It was done by Petri and possibly others using Linux
This Einstein@home App (v1.0 by petri33) was built at: Apr 28 2022 18:47:15


All I have done is modify the scheduling algorithm in BOINC to bypass that 91 second delay that the Milkyway server wants.


Next 20

©2023 Astroinformatics Group