Welcome to MilkyWay@home

AMD VII: Occasional a task never finishes and is "hot spot" too high?

Message boards : Number crunching : AMD VII: Occasional a task never finishes and is "hot spot" too high?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 74919 - Posted: 18 Jan 2023, 18:29:24 UTC
Last modified: 18 Jan 2023, 18:31:42 UTC

I am running 4 work units per GPU: One AMD VIi and several AMD S9xxx boards

About once every 4-5 days a task hangs up on the VII. It is easily fixed by suspending and then resuming the task. The VII is the fastest board, 2x as fast as S9150 and the problem is only on the VII. I used a boinctask "rule" to automatically suspend any MW task taking over 5 minutes which allows the card to continue processing 4 tasks as is normal instead of 3 and a hung task.

1 - Has anyone seen a problem like this before?

2 - GPUz has a feature that shows the "hot spot". My RTX-2080Ti and the VII card are the only ones that reports "hot spot". My other Nvidia and AMD cards lack that feature, probably too old. The VII has 3 fans and is in an open frame rack and there is a box fan cooling the rack. Its "hot spot" runs 102-107c. The RTX-2080Ti is in a case "Area51" that is cramped. It shows 80c for its hot spot. Do these values seem ok?
ID: 74919 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
.clair.

Send message
Joined: 3 Mar 13
Posts: 84
Credit: 779,527,671
RAC: 3,708
Message 74920 - Posted: 18 Jan 2023, 18:40:42 UTC
Last modified: 18 Jan 2023, 18:41:34 UTC

When one of my AMD 7970`s got to 104c one day it was dead the next week .
80c is still to hot for my likeing . even short term .
I do everything I can to keep every bit of a hard working GPU/card below 70c ,
As far below as possible .
If you want them to last .
ID: 74920 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 74921 - Posted: 18 Jan 2023, 19:14:53 UTC - in response to Message 74920.  

When one of my AMD 7970`s got to 104c one day it was dead the next week .
80c is still to hot for my likeing . even short term .
I do everything I can to keep every bit of a hard working GPU/card below 70c ,
As far below as possible .
If you want them to last .


Your HD-7970 is the "Tahiti" chip, same as the S9000 and S9050 cards
None of the S9xxx cards have the so called "hot spot" sensor capability
The temps shown by GPUz or MSI's afterburner are all under 60c

GPUz shows two temps for the VII "GPU"
One is low like 60-70 and is in the same location as the S9xxx series cards.
the other is the "Hot Spot" which jumps extremely, from 70 to 107 and all around
The S9xxx do not have that sensors and are missing other feature that GPUz can show for the VII
ID: 74921 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile HRFMguy

Send message
Joined: 12 Nov 21
Posts: 236
Credit: 575,038,236
RAC: 5,987
Message 74922 - Posted: 19 Jan 2023, 1:40:56 UTC - in response to Message 74921.  

can you use an IR thermometer to localize it? it also might be time to renew the thermal paste. I can easily see a hot spot develop when something loses contact with a heat sink.
ID: 74922 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Martin

Send message
Joined: 28 May 22
Posts: 17
Credit: 402,111,833
RAC: 0
Message 74923 - Posted: 19 Jan 2023, 2:15:11 UTC - in response to Message 74919.  

I am running 4 work units per GPU: One AMD VIi and several AMD S9xxx boards

About once every 4-5 days a task hangs up on the VII. It is easily fixed by suspending and then resuming the task. The VII is the fastest board, 2x as fast as S9150 and the problem is only on the VII. I used a boinctask "rule" to automatically suspend any MW task taking over 5 minutes which allows the card to continue processing 4 tasks as is normal instead of 3 and a hung task.

1 - Has anyone seen a problem like this before?

2 - GPUz has a feature that shows the "hot spot". My RTX-2080Ti and the VII card are the only ones that reports "hot spot". My other Nvidia and AMD cards lack that feature, probably too old. The VII has 3 fans and is in an open frame rack and there is a box fan cooling the rack. Its "hot spot" runs 102-107c. The RTX-2080Ti is in a case "Area51" that is cramped. It shows 80c for its hot spot. Do these values seem ok?


Some reviews I've read have mentioned that the VII is way over volted out of the box.

I undervolt my VII and do only 3 tasks at a time. Powering the VII with just 1006mv and adjusting the fan curve so noise is hardly noticeable, the temps haven't recently been above 88c for the hotspot (from what I've read, there are several "hotspot" sensors on the GPU chip and the chip's logic reports the hottest one at the moment) and 63c for memory. Ambient temp 73-74f.

I use HWMonitor mostly to keep an eye on what's happening system wide and I'm always trying to tweak performance bit -by-bit.
ID: 74923 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 74924 - Posted: 19 Jan 2023, 4:56:33 UTC - in response to Message 74923.  
Last modified: 19 Jan 2023, 5:01:52 UTC

I finally got the AMD performance software to work. Problem was after driver install, I had 5 "ghost" GPU cards. I had to edit the coproc_info.xml file, remove the extra 5 GPUs and then mark the file read only so that BOINC would not be able to add the "extra" GPUs' back in.

The driver I used was "win10-radeon-pro-software-enterprise-21.Q2.1"
Possible using a device driver cleaner and a re-install might fix the copro_info.xml problem. I got duplicate GPUs due to BOINC ( or clinfo) seeing two drivers instead of one so it marked my system as having two opencl platforms and I had almost 500 error'ed out tasks before I could suspend the project and fix the problem.

What driver(s) are you using? Does one card have one version and a different card have another?

Anyway, this is the display of 4 of the 5 gpus. The 5tth would not fit in the screen capture. Note the junction temp, the so-called "hot spot".
Tuning is only available for the VII. I have not tried any tuning yet.
Is it the same app you are using?


ID: 74924 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Martin

Send message
Joined: 28 May 22
Posts: 17
Credit: 402,111,833
RAC: 0
Message 74925 - Posted: 19 Jan 2023, 6:26:59 UTC - in response to Message 74924.  

I finally got the AMD performance software to work. Problem was after driver install, I had 5 "ghost" GPU cards. I had to edit the coproc_info.xml file, remove the extra 5 GPUs and then mark the file read only so that BOINC would not be able to add the "extra" GPUs' back in.

The driver I used was "win10-radeon-pro-software-enterprise-21.Q2.1"
Possible using a device driver cleaner and a re-install might fix the copro_info.xml problem. I got duplicate GPUs due to BOINC ( or clinfo) seeing two drivers instead of one so it marked my system as having two opencl platforms and I had almost 500 error'ed out tasks before I could suspend the project and fix the problem.

What driver(s) are you using? Does one card have one version and a different card have another?

Anyway, this is the display of 4 of the 5 gpus. The 5tth would not fit in the screen capture. Note the junction temp, the so-called "hot spot".
Tuning is only available for the VII. I have not tried any tuning yet.
Is it the same app you are using?



No, I'm using an AMD Adrenalin edition, 22.6.1.0. I only have the one GPU and it also is driving the monitor. I have not had any big problems with this software.
ID: 74925 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile HRFMguy

Send message
Joined: 12 Nov 21
Posts: 236
Credit: 575,038,236
RAC: 5,987
Message 74950 - Posted: 28 Jan 2023, 16:22:13 UTC

After seeing the new posts here, I undervolted to 1006 mV. The hot spot is now at 75C, down quite a bit from the mid 80s. Running 5 WU in parallel. Still seeing a WU finish every 11 seconds or less. Adrenalin 22.11.2
ID: 74950 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : AMD VII: Occasional a task never finishes and is "hot spot" too high?

©2024 Astroinformatics Group