Message boards :
Number crunching :
AMD VII: Occasional a task never finishes and is "hot spot" too high?
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,461,693,501 RAC: 0 |
I am running 4 work units per GPU: One AMD VIi and several AMD S9xxx boards About once every 4-5 days a task hangs up on the VII. It is easily fixed by suspending and then resuming the task. The VII is the fastest board, 2x as fast as S9150 and the problem is only on the VII. I used a boinctask "rule" to automatically suspend any MW task taking over 5 minutes which allows the card to continue processing 4 tasks as is normal instead of 3 and a hung task. 1 - Has anyone seen a problem like this before? 2 - GPUz has a feature that shows the "hot spot". My RTX-2080Ti and the VII card are the only ones that reports "hot spot". My other Nvidia and AMD cards lack that feature, probably too old. The VII has 3 fans and is in an open frame rack and there is a box fan cooling the rack. Its "hot spot" runs 102-107c. The RTX-2080Ti is in a case "Area51" that is cramped. It shows 80c for its hot spot. Do these values seem ok? |
Send message Joined: 3 Mar 13 Posts: 84 Credit: 779,527,712 RAC: 0 |
When one of my AMD 7970`s got to 104c one day it was dead the next week . 80c is still to hot for my likeing . even short term . I do everything I can to keep every bit of a hard working GPU/card below 70c , As far below as possible . If you want them to last . |
Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,461,693,501 RAC: 0 |
When one of my AMD 7970`s got to 104c one day it was dead the next week . Your HD-7970 is the "Tahiti" chip, same as the S9000 and S9050 cards None of the S9xxx cards have the so called "hot spot" sensor capability The temps shown by GPUz or MSI's afterburner are all under 60c GPUz shows two temps for the VII "GPU" One is low like 60-70 and is in the same location as the S9xxx series cards. the other is the "Hot Spot" which jumps extremely, from 70 to 107 and all around The S9xxx do not have that sensors and are missing other feature that GPUz can show for the VII |
Send message Joined: 12 Nov 21 Posts: 236 Credit: 575,038,236 RAC: 0 |
can you use an IR thermometer to localize it? it also might be time to renew the thermal paste. I can easily see a hot spot develop when something loses contact with a heat sink. |
Send message Joined: 28 May 22 Posts: 17 Credit: 402,111,833 RAC: 0 |
I am running 4 work units per GPU: One AMD VIi and several AMD S9xxx boards Some reviews I've read have mentioned that the VII is way over volted out of the box. I undervolt my VII and do only 3 tasks at a time. Powering the VII with just 1006mv and adjusting the fan curve so noise is hardly noticeable, the temps haven't recently been above 88c for the hotspot (from what I've read, there are several "hotspot" sensors on the GPU chip and the chip's logic reports the hottest one at the moment) and 63c for memory. Ambient temp 73-74f. I use HWMonitor mostly to keep an eye on what's happening system wide and I'm always trying to tweak performance bit -by-bit. |
Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,461,693,501 RAC: 0 |
I finally got the AMD performance software to work. Problem was after driver install, I had 5 "ghost" GPU cards. I had to edit the coproc_info.xml file, remove the extra 5 GPUs and then mark the file read only so that BOINC would not be able to add the "extra" GPUs' back in. The driver I used was "win10-radeon-pro-software-enterprise-21.Q2.1" Possible using a device driver cleaner and a re-install might fix the copro_info.xml problem. I got duplicate GPUs due to BOINC ( or clinfo) seeing two drivers instead of one so it marked my system as having two opencl platforms and I had almost 500 error'ed out tasks before I could suspend the project and fix the problem. What driver(s) are you using? Does one card have one version and a different card have another? Anyway, this is the display of 4 of the 5 gpus. The 5tth would not fit in the screen capture. Note the junction temp, the so-called "hot spot". Tuning is only available for the VII. I have not tried any tuning yet. Is it the same app you are using? |
Send message Joined: 28 May 22 Posts: 17 Credit: 402,111,833 RAC: 0 |
I finally got the AMD performance software to work. Problem was after driver install, I had 5 "ghost" GPU cards. I had to edit the coproc_info.xml file, remove the extra 5 GPUs and then mark the file read only so that BOINC would not be able to add the "extra" GPUs' back in. No, I'm using an AMD Adrenalin edition, 22.6.1.0. I only have the one GPU and it also is driving the monitor. I have not had any big problems with this software. |
Send message Joined: 12 Nov 21 Posts: 236 Credit: 575,038,236 RAC: 0 |
After seeing the new posts here, I undervolted to 1006 mV. The hot spot is now at 75C, down quite a bit from the mid 80s. Running 5 WU in parallel. Still seeing a WU finish every 11 seconds or less. Adrenalin 22.11.2 |
©2024 Astroinformatics Group