Welcome to MilkyWay@home

Posts by Joseph Stateson

1) Message boards : Number crunching : New Benchmark Thread - times wanted for any hardware, CPU or GPU, old or new! (Message 75347)
Posted 27 Apr 2023 by Profile Joseph Stateson
Post:
This discussion actually got me experimenting and for me it is slightly more efficient to run 6 tasks at a time: 46/6=7.6 seconds/task and 32/4=8 seconds/task.

How are you getting a consistent feed of GPU separation tasks? I just posted elsewhere that I'm having trouble getting sent them automatically and I have to keep hitting the update button:

"I built a whole machine just for Milkyway with a 5900X and a Radeon Pro VII (6.528 TFLOPS FP64) that finishes the stack of 300 separation GPU tasks in around 40 minutes doing 6 concurrent tasks. After that, even with Milkway as the only enabled project, it just runs CPU tasks, even after the 10 minute runoff time that I see in "properties" of the project in BOINC Manager that I don't understand. The only way for me to get more tasks is to hit update on the porject after all 300 have been completed.

I've combed through the documentation for app_config.xml and cc_config.xml to see if there were any serttings for mandating updates or something, but couldn't find anything. Is this a temporary issue with Milkway or do you have any other advice for keeping a constant stream of GPU tasks fed to my GPU?"


This has been cussed and discussed before.
There are several ways to fix the delay. One it to use a batch file runs boinccmd.exe in a loop with a time delay. That app can issue an update to a local or networked system.

I put together a "mod" of boinc version "7.15' that a number of uses still work with. Check the statistics leaders. It is not compatible with 7.20 and I update the mod here

https://github.com/JStateson/Milkyway-7-21

replace boinc.exe with the boinc--7-21.exe and rename to boinc.exe
2) Message boards : Number crunching : New Benchmark Thread - times wanted for any hardware, CPU or GPU, old or new! (Message 75341)
Posted 25 Apr 2023 by Profile Joseph Stateson
Post:
Hello,

I recently built a dedicated system for MW: https://milkyway.cs.rpi.edu/milkyway/show_host_detail.php?hostid=963574

Used a Radeon Pro VII I got on eBay for $345 after tax paired with a 5900X. My advice would be to save a search for "pro vii" and then get alerts when new ones go up for auction. One just sold for $405 yesterday. It is an animal for separation tasks and doesn't get too hot when doing them. I have it running 4 tasks concurrently and they take an average of about 32 seconds each.

Best, Eric


Yea, I picked one for under 400 last year and price is still dropping.

Unaccountably, I cannot run my VII using a riser although the miner I bought it from had no problem with 8 of them on risers. Possibly the problem is windows 10 but that is a guess.
I would like to add more but the motherboard I am using only has a single x16 socket.

I am also running 4 at a time. Occasionally I get a stuck task. Sometimes 3 to 4 a day. Other times I can go for weeks at a time with no problem. A suspend followed by a resume :"unsticks" the tasks and I use boinctasks to handle the fixing automatically.
3) Questions and Answers : Web site : Request for delete my account (Message 75331)
Posted 18 Apr 2023 by Profile Joseph Stateson
Post:
I've done worse.
The only stupid question is the one not asked.
4) Questions and Answers : Unix/Linux : NVIDIA Tesla T4 (Message 75303)
Posted 9 Apr 2023 by Profile Joseph Stateson
Post:
I just had my 20.04 Ubuntu drop my gtx-1060 after a restart of BOINC. Coincidence that this happened minutes after reading your post here. I came right back to add my 0.02c opinion.

I had RX-570 and GTX-1060 running just fine. I made a minor change in cc_config..xml and restarted BOINC minutes ago. System only shows the AMD card now.

I ran clinfo and the Nvidia is not reported.

I ran nvidia-smi and there is a library conflict. I am guessing I did an update some time ago and got a new library that Nvidia does not like after restarting BOINC.

root@dual-linux:~# nvidia-smi
Failed to initialize NVML: Driver/library version mismatch


Your Tesla has the same gpu chip, TU104, as RTX-2080 super according to tech power up. That new driver should have worked but then, this is Ubuntu which sucks



[edit] This was fixed by running "upgrade" and rebooting
5) Message boards : Number crunching : Tried running nbody: CPU temps way too low. (Message 75293)
Posted 6 Apr 2023 by Profile Joseph Stateson
Post:
Perhaps you have exotic cooling :D

Also the command switch is --nthreads

I guess it shouldn't cause concern it does look like the task is progressing?


That was just a typo, the xml has --nthreads


The following is all the command arguments that I know of. Had not seen --nthreads before

<cmdline>--non-responsive --verbose --gpu-target-frequency 1 --gpu-polling-mode -1 --gpu-wait-factor 0 --process-priority 4 --gpu-disable-checkpointing</cmdline>
6) Message boards : Number crunching : Tried running nbody: CPU temps way too low. (Message 75292)
Posted 6 Apr 2023 by Profile Joseph Stateson
Post:
Both are on separate closed liquid cooling. Temp shown by boinctasks is 32 deg. the "0" is the throttling percentage, not the temperature.

The app finished and is awaiting verification so possible the temps are OK !


https://milkyway.cs.rpi.edu/milkyway/results.php?hostid=959265&offset=0&show_names=0&state=3&appid=2
7) Message boards : Number crunching : Tried running nbody: CPU temps way too low. (Message 75289)
Posted 6 Apr 2023 by Profile Joseph Stateson
Post:
I took a look at a couple of your hosts, are you sure you have selected the nbody application?


From boinc tasks

Milkyway@Home	1.82 Milkyway@home N-Body Simulation (mt)	de_nbody_02_27_2023_v182_pal5__data__3_1680661949_11332_0	02:15:26 (23:29:56)	130.12	71.177	00:54:50	4/18/2023 12:56:38 PM	32.0 °C	Running	8C	0	dual-linux	
Milkyway@Home	1.82 Milkyway@home N-Body Simulation (mt)	de_nbody_02_27_2023_v182_pal5__data__2_1680661949_11309_0	01:18:41 (12:19:55)	117.55	60.215	01:29:01	4/18/2023 12:56:38 PM	32.0 °C	Running	8C	0	dual-linux	


the "0" between the 8c and the dual-linux means there is no cpu throttling

The Intel i9-7900x has a more reasonable temperature of 63.9 c

Milkyway@Home	1.82 Milkyway@home N-Body Simulation (mt)	de_nbody_02_27_2023_v182_pal5__data__2_1674667492_1117989_3	01:00:41 (05:37:38)	69.55	98.758	00:00:45	4/18/2023 7:21:59 AM	63.9 °C	Running	8C	0	JYSArea51	
Milkyway@Home	1.82 Milkyway@home N-Body Simulation (mt)	de_nbody_02_27_2023_v182_pal5__data__1_1680661949_7646_0	00:25:56 (02:51:44)	82.77	15.649	02:19:48	4/18/2023 7:21:59 AM	63.9 °C	Running	8C	0	JYSArea51	
8) Message boards : Number crunching : Tried running nbody: CPU temps way too low. (Message 75287)
Posted 6 Apr 2023 by Profile Joseph Stateson
Post:
root@dual-linux:/var/lib/boinc/projects/milkyway.cs.rpi.edu_milkyway# sensors
coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +26.0°C  (high = +80.0°C, crit = +96.0°C)
Core 1:       +29.0°C  (high = +80.0°C, crit = +96.0°C)
Core 2:       +32.0°C  (high = +80.0°C, crit = +96.0°C)
Core 8:       +32.0°C  (high = +80.0°C, crit = +96.0°C)
Core 9:       +28.0°C  (high = +80.0°C, crit = +96.0°C)
Core 10:      +28.0°C  (high = +80.0°C, crit = +96.0°C)

amdgpu-pci-0400
Adapter: PCI adapter
vddgfx:        1.02 V
fan1:        2981 RPM  (min =    0 RPM, max = 3700 RPM)
edge:         +46.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:       79.02 W  (cap =  90.00 W)

coretemp-isa-0001
Adapter: ISA adapter
Core 0:       +23.0°C  (high = +80.0°C, crit = +96.0°C)
Core 1:       +21.0°C  (high = +80.0°C, crit = +96.0°C)
Core 2:       +25.0°C  (high = +80.0°C, crit = +96.0°C)
Core 8:       +24.0°C  (high = +80.0°C, crit = +96.0°C)
Core 9:       +22.0°C  (high = +80.0°C, crit = +96.0°C)
Core 10:      +18.0°C  (high = +80.0°C, crit = +96.0°C)

intel5500-pci-00a3
Adapter: PCI adapter
temp1:        +78.5°C  (high = +100.0°C, hyst = +95.0°C)
                       (crit = +110.0°C)



Dual Xeon Ubuntu: total of 12 cores, 24 huperthreads. Looks both CPUs are idling. This cannot be right?
Is the command ' --threads": documented anywhere? I tried setting it to 8 and leaving it off entirely then 32. Made no difference. l assume that "threads" refers to system or kernel threads and not hyperthread allocaton. I also assume "avg_ncpu" refers to hyperthreads of which there are 24 available.

app_config.xml


<app_config>
 <app>
  <name>milkyway_nbody</name>
  <max_concurrent>2</max_concurrent>
 </app>
 <app_version>
  <app_name>milkyway_nbody</app_name>
  <plan_class>mt</plan_class>
  <avg_ncpus>8</avg_ncpus>
  <cmdline>--nthreads 32</cmdline>
 </app_version>
</app_config>
9) Message boards : Number crunching : New Benchmark Thread - times wanted for any hardware, CPU or GPU, old or new! (Message 75286)
Posted 6 Apr 2023 by Profile Joseph Stateson
Post:
I kind of figured that with the majority of people using PC's that Mac's especially Apple Silicone Mac's would not be utilized to their potential whatever that might be.

I clued in someone over at MacRumors about this and he took a look at the Bionic Manager and he was excited to see a version for the old Apple Power PC systems. He said that it was old but that he was willing to code a modern version of it. I pointed out who he should contact.


Many years ago, the company I worked for was getting rid of their Sun and Apollo workstations. I asked over at SETI it their app would run on them and was told they would provide me with the source code if I would build and support the app. That was the last time I asked about getting apps to work on old systems.
10) Message boards : Number crunching : New Benchmark Thread - times wanted for any hardware, CPU or GPU, old or new! (Message 75283)
Posted 6 Apr 2023 by Profile Joseph Stateson
Post:
The M1 processor is much faster than that old AMD architecture. However, unfortunately, you will find very little BOINC project support for the the Apple architecture. Very few native applications and no gpu support. Some project apps can be run in x86 emulation mode if you install the Apple Rosetta x86 emulator.


I have not heard anything about that Microsoft sq1 chip in their Surface Pro X and assume the same problem. Probably a lot more Macs with be sold than that "X".
11) Questions and Answers : Unix/Linux : Problem switching from Windows to Ubuntu: ROCm question (Message 75178)
Posted 21 Mar 2023 by Profile Joseph Stateson
Post:
isn't this exactly what I said in the other thread?

Vega only supported by ROCm 4.5
ROCm 4.5 only supported on 20.04 with 5.11 kernel or 18.04 with 5.4 kernel.

i wasnt so specific, but I said you might need an older OS and to check the install docs, and the above is what the install docs say.


I did not see that (rather did not think about it). got too involved getting the Mi25 to work

It turned out that windows was easier to work with. Going to post in the GPUUG over at Einstein
12) Questions and Answers : Unix/Linux : Problem switching from Windows to Ubuntu: ROCm question (Message 75175)
Posted 21 Mar 2023 by Profile Joseph Stateson
Post:
Think I found the problem. The Mi25 is recognized in 22.04 with ROCm 5.x but is not supported
AMD Instinct MI25 End of Life
ROCm release v4.5 is the final release to support AMD Instinct MI25


When I tried amdgpu-install --rocmrelease=4.5.2
The error "the package was not found"
The 4.5.2 was last used in 20.04. I cannot find a repo that has the 4.5.2 rocm release.
Messing around with and updated repo from AMD I ended up with broken packages I cannot fix.
Will try 20.04 and see if it can find a 4.5 rocm version.
13) Questions and Answers : Unix/Linux : CL_OUT_OF_HOST_MEMORY with AMD RX 6600 XT on Xubuntu 20.04 (Message 75173)
Posted 20 Mar 2023 by Profile Joseph Stateson
Post:
i don't necessarily think it will solve the problem, but you could look for and try setting "Above 4G decoding" to enabled.


It is enabled. When I disabled that (mmio segment?) I got BIOS warning "not enough PCIe resources to go around, remove some PCIe boards***" (I didnt take a picture of the exact wording).

I pulled all the x9150 and s9100 and all the only difference was a lot less fan noise. Got the same error message.

The h110 can set gen 3 for "pcie #2" (of 0..7), all the rest of the slots can be gen1, gen2 or auto
The slots are numbered 0,1,2,3,4,....11
the x16 slot is 3 so probably 2 and 3 are gen3. The Mi25 started working (clinfo only) when gen3 was enabled. All the other slots are x1 with risers.
I tried disabling the h110 internal video thinking that might pull that mmio segment back in but it was worse. 22.04 would not boot. I tried a cheap AMD video board in slot 2 and the Mi25 in slot 3 and vice-versa but system hung.

Tried the following kernels. the one named "jyskern" took me over 7 hours to build.
The AMD driver I used was for 22.04.2
1   Advanced options for Ubuntu                                    │
                                             │           1>0 Ubuntu, with Linux 5.19.0-35-generic                           │
                                             │           1>1 Ubuntu, with Linux 5.19.0-35-generic (recovery mode)           │
                                             │           1>2 Ubuntu, with Linux 5.19.0-32-generic                           │
                                             │           1>3 Ubuntu, with Linux 5.19.0-32-generic (recovery mode)           │
                                             │           1>4 Ubuntu, with Linux 5.13.0jyskern                               │
                                             │           1>5 Ubuntu, with Linux 5.13.0jyskern (recovery mode)


I read the following article where the user burned the wx9100 bios onto the Mi25 and it working in kernel 5.13. He got the video to work also. Normally the Mi25 has no video.
The guide I used to downgrade the kernel is
https://youtu.be/IDYZ9Hm-p44
It was 7 hours to build on a dual xeon 24 thread.
However, when I looked at the mi25 and wxl9100 bios sizes there was huge differences so I do not plan to flash wxl9100 unless I get advice from the author
Plus it does not seem to work anyway.

-rw-rw-r--  1 jstateson jstateson   39201 Mar 20 16:08 218718.rom
-rw-rw-r--  1 jstateson jstateson  262144 Mar  8  2021 230670.rom
-rw-rw-r--  1 jstateson jstateson  262144 Apr 25  2022 245174.rom
-rw-rw-r--  1 jstateson jstateson 1048576 Feb  8  2019 AMD.RadeonVII.16384.190116.rom
-rwxrwxrwx  1 jstateson jstateson 1660176 May 20  2021 amdvbflash*
-rw-rw-r--  1 jstateson jstateson  620696 Mar 20 10:37 amdvbflash_linux_4.71.zip
-rw-r--r--  1 root      root      1048576 Mar 20 16:01 original_mi25


Another problem is I do not know if the rom is locked. Have not figured out exactly what the following denotes.

jstateson@mi25-john:~/flash$ sudo ./amdvbflash -checklock 0
AMDVBFLASH version 4.71, Copyright (c) 2020 Advanced Micro Devices, Inc.

SW protection fail (0x9C)


The following error shows up in einstein slot 0

[15:30:32][2285][INFO ] Application startup - thank you for supporting Einstein@Home!
[15:30:32][2285][INFO ] Starting data processing...
[15:30:32][2285][INFO ] Using OpenCL platform provided by: Advanced Micro Devices, Inc.
[15:30:32][2285][INFO ] Using OpenCL device "gfx900:xnack-" by: Advanced Micro Devices, Inc.
[15:30:32][2285][ERROR] Couldn't create OpenCL command queue (error: -6)!
[15:30:32][2285][INFO ] OpenCL shutdown complete!
[15:30:32][2285][ERROR] Demodulation failed (error: 2013)!
[15:30:32][2285][WARN ] Sorry, at the moment your system doesn't have enough free CPU/GPU memory to run this task!


System has 8gb which is not a problem for AMD VII and 4 of the s91x0 types so I suspect memory to not be a problem.

****edit - That warning included the phrase that If I continue to boot there could be problems but there was no option to continue. There was no "press f1 to continue" or anything like that so not sure what the bios author meant to be done.
14) Questions and Answers : Unix/Linux : Problem switching from Windows to Ubuntu: ROCm question (Message 75169)
Posted 20 Mar 2023 by Profile Joseph Stateson
Post:
I have "almost" got this to work. I forced Gen3 on the x16 slot and now both the VII and the Mi25 boards work in that slot.
Unfortunately, the VII will not work in any of the x1 slots and none of my x9150 boards work. The drivers for the mi25 and VII seem to be missing what is needed for those older S boards.
15) Questions and Answers : Unix/Linux : CL_OUT_OF_HOST_MEMORY with AMD RX 6600 XT on Xubuntu 20.04 (Message 75168)
Posted 20 Mar 2023 by Profile Joseph Stateson
Post:

https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.4.3/page/How_to_Install_ROCm.html

As reported by others, Einstein@Home GPU tasks work fine.
Best regards,

Samuel



I have just run into this problem testing out an AMD Mi25 that jpmboy sent me.

OpenCL: AMD/ATI GPU 0: Radeon Instinct MI25 (driver version 3513.0 (HSA1.1,LC), device version OpenCL 2.0, 16368MB, 16368MB available, 12288 GFLOPS peak)		
OS: Linux Ubuntu: Ubuntu 22.04.2 LTS [5.19.0-35-generic|libc 2.35]	
Task Ter5_4_cfbf00057_segment_13_dms_400_13200_243_250000_1 postponed for 900 seconds: Not enough free CPU/GPU memory available! Delaying next attempt for at least 15 minutes...


The above error is from an Einstein BREP7-opencl-ati task so it is not just Milkyway

Milkyway@home Separation v1.46 (opencl_ati_101)x86_64-pc-linux-gnu
Error getting device and context (-6): CL_OUT_OF_HOST_MEMORY


I am guessing the problem has to do with some memory segment that my H110-btc motherboard has enabled. I have little experience with UEFI bios and there are lot of settings. I was only able to get the Mi25 board to work by forcing GEN3 on the PCIe x16 slot it is in. I also had to disable CSM.
16) Questions and Answers : Unix/Linux : Problem switching from Windows to Ubuntu: ROCm question (Message 75163)
Posted 18 Mar 2023 by Profile Joseph Stateson
Post:
System with AMD VII and S91x0 boards that work fine in Windows 10. Want to switch to Ubuntu.
I installed 22.04 and AMD drivers using "=rocr,legacy". I disabled CSM in bios and the motherboard bios is UEFI
Only one slot is Gen3. The AMD VII board is in that one. The S91x0 boards are all in Gen2 slots. Gen 2 does not support atomics I read somewhere long ago.

Only the AMD VII board works
clinfo reports the other gfx9 boards (s9150, s9100) have no platform.
sudo lshw -c video mis-identifies the s9150 and s9100 as w9100 and w8100
However, I have seen windows do the same and they work anyway.

According to some ROCm doc I cannot find at the moment quote:

GFX9 GPUs (such as Vega 10) no longer require PCIe atomics


That Gen3 slot has atomics since it is Gen3. The Gen2 do not but supposedly ROCm does not need it for gfx9 boards. I specified legacy as the s91x0 boards are legacy but I might be mistaken.

Is anyone running ROCm in Gen 2 slots?
17) Message boards : Cafe MilkyWay : WCG Friends (Message 75133)
Posted 9 Mar 2023 by Profile Joseph Stateson
Post:
I have switched to SiDock for my COVID support when I ran out of WCG. Tasks take usually 2-3 days to complete which is a bummer.
Last month I contributed (as best as I can calculate) at least 2 of the 8 points the GPU Users group got for SiDock.
https://www.boincgames.com/sprint_details.php?id=11

SiDock have only CPU tasks. I have seen no indication they are working on OpenCL or CUDA applications.
18) Message boards : Cafe MilkyWay : WCG Friends (Message 75130)
Posted 8 Mar 2023 by Profile Joseph Stateson
Post:
Latest update (around 16:30 UTC 8th March)

Update: As of this morning, the data center continues to work on booting the temporary replacement DSS 7000 storage system. They are attempting multiple alternative strategies to resolve current failures.
Sounds like there re a lot of cobwebs to blow out of that bit of kit! :-) -- As far as I can determine (from reseller's sites[1]) it uses a now-discontinued Xeon E5 processor model, so it's not exactly new tech.

Someone posted a picture of what appears to be a DSS 7000 unit and said "It's only 90 drives, what could go wrong?..." (although that's a fully populated 4-processor version; there's also a dual Xeon 45 drive version) -- if they manage to get this working, I wonder how long it will be before there's another failure; perhaps we should organize a collection for a brand new storage device?

And if anyone who reads this can actually confirm or correct the specification, feel free to do so :-)

Cheers - Al.

[1] I tried finding specs on Dell's site, but there wasn't anything immediately obvious and useful...


I assume Krembil got this system from IBM when they took over WCG. IBM makes their own servers but who know, maybe IBM went for a cheaper Dell system when putting WCG together back then.
19) Message boards : Number crunching : Some seconds in cuda crunching saved (Message 75125)
Posted 8 Mar 2023 by Profile Joseph Stateson
Post:
Hy, the system variable

CUDA_CACHE_MAXSIZE 4294967296

can enhance cuda crunching times, please test it by yourselfes.

Greetings for germany


Milkyway GPU apps are all OpenCL. CUDA is not listed
20) Message boards : Number crunching : Future of Milkyway@Home (Message 75085)
Posted 27 Feb 2023 by Profile Joseph Stateson
Post:


The only, quasi-stable, config I have been able to do for linux for firepros (and Bristol Ridge R7):

Ubuntu 20.04 (either .1 or .2 (I think).
Install and pull the internet during installation as even if you say don't upgrade: it will helpfully update you to kernel 5.15.XX which I haven't been able to get amdgpu to work on. period.
Uninstall snap. Disable unattended upgrades. Turn off crash reporting.
Never allow upgrading to go above kernel 5.4.0.XXX otherwise amdgpu 20.50 will break, at least in my experience.


I use amdgpu v20.50. Nothing else works. I only install --opencl=legacy. I don't bother with vega or later if the firepro is on the system.
Since I have dedicated crunchers, I will ssh into them as amdgpu tends to make video broken, but the firepro is seen and works. boinctui makes life bearable via cmdline.

Hope this helps.


Thanks, I will try it. I spent 1/2 day with a clean install of 20.04.2 but it had kernel 5.15
Currently I have 5.4.0-139-generic
and will try 20.50 but am not holding my breath.


Next 20

©2024 Astroinformatics Group