Welcome to MilkyWay@home

5970 validate errors (20% ~30% WU's)

Message boards : Number crunching : 5970 validate errors (20% ~30% WU's)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
popo666

Send message
Joined: 18 Dec 09
Posts: 9
Credit: 236,002,715
RAC: 0
Message 56614 - Posted: 22 Dec 2012, 23:19:23 UTC
Last modified: 22 Dec 2012, 23:20:01 UTC

I have obtained EAH5970 and started crunching this project on that card. Problem is, that on some days more than 30% WUs results in validation error. Card is not overclocked, after underclocking the card made aproximately same amount of invalid ressults. I have most up to date drivers. No anonymous platform. EAH5870 in same machine works flawlessly. I tryed some other projects too, World Comunity Grid, SETI@Home, and just two from 900 WUs ended with validity error (that is less than 0,2%).
Any sugestions?
ID: 56614 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,951,988
RAC: 21,328
Message 56616 - Posted: 23 Dec 2012, 11:56:27 UTC - in response to Message 56614.  

I have obtained EAH5970 and started crunching this project on that card. Problem is, that on some days more than 30% WUs results in validation error. Card is not overclocked, after underclocking the card made aproximately same amount of invalid ressults. I have most up to date drivers. No anonymous platform. EAH5870 in same machine works flawlessly. I tryed some other projects too, World Comunity Grid, SETI@Home, and just two from 900 WUs ended with validity error (that is less than 0,2%).
Any sugestions?


Yup what else are you running at the same time, are you perhaps gaming?! Whatever it is sounds like it is also using the gpu and causing the problems. Try snoozing Boinc while you are gaming, or whatever, and then after about an hour stop playing, or whatever, unsnooze Boinc and see if the units pick back up again and finish okay. You can snooze Boinc by right clicking on the icon down by the clock and selecting snooze. You can snooze all of Boinc or just the gpu, I would start with all of Boinc and thru testing see if just snoozing the gpu works okay too. Snoozing ONLY stops Boinc for 2 hours though, so if you 'play' longer than that in a session you may want to open the Boinc Manager and put the application in the exclusive applications section under Tools, computing preferences, exclusive applications. That will stop Boinc whenever that application is running, preventing future problems.

Gaming is the most common reason but it could also be anything else that uses the gpu alot. I personally even snooze Boinc when I burn a cd or dvd, it seems to keep the cache full instead of it dropping down near zero.
ID: 56616 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 12 Aug 09
Posts: 262
Credit: 92,631,041
RAC: 0
Message 56617 - Posted: 23 Dec 2012, 13:37:33 UTC

Well I have now 1994 WU's and 211 are invalid with two 5870 cards.
No gaming no other things than two GPU crunching MW and the CPU Rosetta.

Almost two years without problem here at MW, but in the summer they made some changes in the searches, and Travis is less involved an that was when it went wrong. Latest drivers, win7 ultimate, i7, 12Gb ram, three disks and latest BOINC.
Also a lot of errors, but that was fixed by Travis late September.
So no more errors for me but a lot of invalids.

Thats why I sometime crunch Einstein@home on this machine and that results in some invalids too, but way less.
Running Einstein@home 24/7 on a nVidia cards as well on another machine has not given any error or invalid.

Seems to be a combination of the ATI cards and the software (drivers) to me.
Greetings from,
TJ
ID: 56617 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
popo666

Send message
Joined: 18 Dec 09
Posts: 9
Credit: 236,002,715
RAC: 0
Message 56618 - Posted: 23 Dec 2012, 14:27:12 UTC
Last modified: 23 Dec 2012, 14:29:58 UTC

1: No games or other SW is turned on while errors occure. Errors are produced even after clean boot of system. Antivirus soft is turned off, I'll turn autostart of Steam just to be sure. Steam uses flash plugin as long as I know and it already crashed few times. In my system there is integrated GPU in CPU, display is connected directly to it, so AMD cards have no cable attachet to them and should not be used by system for other than BOINC.
2: As long as I know, 5970 has internal crossfire. Is there a way how to disable it?
3: What are acceptable temperatures for 5970?
to TJ: strange is, that mine 5870 works without errors, just 5970 produces them.
ID: 56618 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,951,988
RAC: 21,328
Message 56629 - Posted: 24 Dec 2012, 12:44:00 UTC - in response to Message 56618.  

1: No games or other SW is turned on while errors occure. Errors are produced even after clean boot of system. Antivirus soft is turned off, I'll turn autostart of Steam just to be sure. Steam uses flash plugin as long as I know and it already crashed few times. In my system there is integrated GPU in CPU, display is connected directly to it, so AMD cards have no cable attachet to them and should not be used by system for other than BOINC.
2: As long as I know, 5970 has internal crossfire. Is there a way how to disable it?
3: What are acceptable temperatures for 5970?
to TJ: strange is, that mine 5870 works without errors, just 5970 produces them.


I don't know about temps as I don't have a 5970, sorry. And no I do not think you can separate the internal cards from each other.

But one thing you might try is downgrading your drive a notch or two, sometimes the latest and greatest is not the ideal one for crunching.

As for turning OFF your a/v I would not do that, but I WOULD exclude the Boinc directories from scanning though. IF Boinc gets a REAL virus, as opposed to a false positive, it should try to infect the rest of your system and get caught then. IF it is a false positive, who cares and IF if it a REAL virus but never leaves the Boinc directories who cares then either! That just means the projects have to deal with it and if they don't have protection from the thousands of users that are sending files back and forth to them, shame on them!

I do not use Steam but my Flash crashes too and the units still crunch and finish up just fine, so I am not sure that is an issue for Boinc.

Are you overclocking by chance? How about the gpu, are you oc'ing it?
ID: 56629 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,951,988
RAC: 21,328
Message 56630 - Posted: 24 Dec 2012, 12:49:29 UTC - in response to Message 56617.  

Well I have now 1994 WU's and 211 are invalid with two 5870 cards.
No gaming no other things than two GPU crunching MW and the CPU Rosetta.

Almost two years without problem here at MW, but in the summer they made some changes in the searches, and Travis is less involved an that was when it went wrong. Latest drivers, win7 ultimate, i7, 12Gb ram, three disks and latest BOINC.
Also a lot of errors, but that was fixed by Travis late September.
So no more errors for me but a lot of invalids.

Thats why I sometime crunch Einstein@home on this machine and that results in some invalids too, but way less.
Running Einstein@home 24/7 on a nVidia cards as well on another machine has not given any error or invalid.

Seems to be a combination of the ATI cards and the software (drivers) to me.


You have your pc's hidden so I can't see what drivers you are using but you too might try backing off a version or two and see if that is better. Are you overclocking your pc or gpu? Are you trying to run multiple units at once? If you answer yes to any of the above questions try not doing it for a bit and see if that helps. I have two Nvidia 560Ti cards and on another project one can run two units at once while the other can't, turns out they are IDENTICAL, even though they are the same model! Gpu crunching can be picky, it is designed to use a certain part of the pc in a certain way and then we users try to change that for our own purposes, sometimes it all works and sometimes it doesn't.
ID: 56630 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
popo666

Send message
Joined: 18 Dec 09
Posts: 9
Credit: 236,002,715
RAC: 0
Message 56633 - Posted: 24 Dec 2012, 13:32:13 UTC - in response to Message 56629.  

I don't know about temps as I don't have a 5970, sorry. And no I do not think you can separate the internal cards from each other.

But one thing you might try is downgrading your drive a notch or two, sometimes the latest and greatest is not the ideal one for crunching.

As for turning OFF your a/v I would not do that, but I WOULD exclude the Boinc directories from scanning though. IF Boinc gets a REAL virus, as opposed to a false positive, it should try to infect the rest of your system and get caught then. IF it is a false positive, who cares and IF if it a REAL virus but never leaves the Boinc directories who cares then either! That just means the projects have to deal with it and if they don't have protection from the thousands of users that are sending files back and forth to them, shame on them!

I do not use Steam but my Flash crashes too and the units still crunch and finish up just fine, so I am not sure that is an issue for Boinc.

Are you overclocking by chance? How about the gpu, are you oc'ing it?

- Older drivers allowed users with 4970 cards to turn off internal crossfire, I hoped that it can still be done.
- Ill downgrade drivers as a last option, because I heared that no newer driver will be made for HD5xxx cards, just bugfixes for this one. As there was no bugfix for this driver, I assume, that its pretty stable. Dont forget that 5870 in this system chrunches without problems and other projects does not suffer with validity issue (SETI, WCG).
- card is not overclocked, after UNDERclocking the card the error rate of all WUs didnt change, so I dont find temperatures, or HW stability as an issue here. I'm now on stock clocks.
- Mine motherboard has some energy control unit (EPU). I had it on auto mode, now turned it to HIGH PERFORMANCE mode, hopefully this will help. Has anyone experienced any problems with EPU?
ID: 56633 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,951,988
RAC: 21,328
Message 56639 - Posted: 25 Dec 2012, 12:22:33 UTC - in response to Message 56633.  

I don't know about temps as I don't have a 5970, sorry. And no I do not think you can separate the internal cards from each other.

But one thing you might try is downgrading your drive a notch or two, sometimes the latest and greatest is not the ideal one for crunching.

As for turning OFF your a/v I would not do that, but I WOULD exclude the Boinc directories from scanning though. IF Boinc gets a REAL virus, as opposed to a false positive, it should try to infect the rest of your system and get caught then. IF it is a false positive, who cares and IF if it a REAL virus but never leaves the Boinc directories who cares then either! That just means the projects have to deal with it and if they don't have protection from the thousands of users that are sending files back and forth to them, shame on them!

I do not use Steam but my Flash crashes too and the units still crunch and finish up just fine, so I am not sure that is an issue for Boinc.

Are you overclocking by chance? How about the gpu, are you oc'ing it?

- Older drivers allowed users with 4970 cards to turn off internal crossfire, I hoped that it can still be done.
- Ill downgrade drivers as a last option, because I heared that no newer driver will be made for HD5xxx cards, just bugfixes for this one. As there was no bugfix for this driver, I assume, that its pretty stable. Dont forget that 5870 in this system chrunches without problems and other projects does not suffer with validity issue (SETI, WCG).
- card is not overclocked, after UNDERclocking the card the error rate of all WUs didnt change, so I dont find temperatures, or HW stability as an issue here. I'm now on stock clocks.
- Mine motherboard has some energy control unit (EPU). I had it on auto mode, now turned it to HIGH PERFORMANCE mode, hopefully this will help. Has anyone experienced any problems with EPU?


I haven't, but always turn all that kind of stuff off when building my pc's. I even go in and change the settings to NOT allow the the system to power down the network card, everything is running at stock speeds for me but nothing shuts down. In Win7 I even have the pc's running in performance mode so I can get the max crunching out of them. All of my pc's crunch, even the ones I use for every day other things. My wife has a laptop from work that doesn't crunch and a Mac laptop that is her personal one, it doesn't crunch but I don't work on it either. It is a deal we made, if it crunches I will fix it, if not then she is on her own, which means the Apple Store. Fortunately she hasn't had any problems yet.
ID: 56639 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 12 Aug 09
Posts: 262
Credit: 92,631,041
RAC: 0
Message 56652 - Posted: 27 Dec 2012, 23:31:27 UTC - in response to Message 56630.  

I haven't change anything as from September. It all started when there was a change in the searches.
I don't have the latest drivers, I never use them. They resolve some bugs, but create new ones. I even try to use a BOINC version that does what I want, but sometimes a newer is necessary to run certain projects.

I use F-Secure as a AV and FW, and have nothing excluded. They solved all the issues with not uploading and automatically new work, and everything is running smooth and save.

Only Milyway produces a lot on invalids (can't validate, while wingman are validated).
Greetings from,
TJ
ID: 56652 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,951,988
RAC: 21,328
Message 56655 - Posted: 28 Dec 2012, 12:41:40 UTC - in response to Message 56652.  

I haven't change anything as from September. It all started when there was a change in the searches.
I don't have the latest drivers, I never use them. They resolve some bugs, but create new ones. I even try to use a BOINC version that does what I want, but sometimes a newer is necessary to run certain projects.

I use F-Secure as a AV and FW, and have nothing excluded. They solved all the issues with not uploading and automatically new work, and everything is running smooth and save.

Only Milyway produces a lot on invalids (can't validate, while wingman are validated).


Are you running MW, Gpugrid and Einstein on the same gpu's at the same time? If so set the swap time so no unit gets suspended mid task before it switches to the next task if you can. I never run more than one gpu project at a time on a single pc, a reason to start building towards a 'ranch', which is 15 or more pc's! I tend to focus on a project, giving myself a goal then switching my resources to the next project etc, etc.
ID: 56655 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
popo666

Send message
Joined: 18 Dec 09
Posts: 9
Credit: 236,002,715
RAC: 0
Message 56660 - Posted: 28 Dec 2012, 14:00:16 UTC

But nothing mentioned here explains, why just one of mine cards produces invalid results, and just under this project. Can someone tell us, how is validity measured? I know that there is some difference in output numbers for each run, but how big this difference has to be, to mark WU as invalid? Thanks.
TJ: I'm thinking, that memmory on our cards is suffering from extra heat and some errors occure in it. Because I have just stock cooler, I'm not willing to rise up voltage. Nor can I underclock memmory, in catalyst control center 1000 MHz is minimum allowed and also, this is stock for 5790.
Another to TJ: How many PCIe slots does your MoBo have? In what number (from top) is your 5870 that produces errors? I may try to swap 5870 and 5970 to see, if its some PCIe related error.
ID: 56660 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
swiftmallard
Avatar

Send message
Joined: 18 Jul 09
Posts: 300
Credit: 303,562,776
RAC: 0
Message 56661 - Posted: 28 Dec 2012, 20:05:40 UTC

You can underclock the memory in Catalyst if you go in and adjust the values in your profile.
ID: 56661 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 12 Aug 09
Posts: 262
Credit: 92,631,041
RAC: 0
Message 56666 - Posted: 29 Dec 2012, 23:51:25 UTC - in response to Message 56655.  
Last modified: 29 Dec 2012, 23:52:06 UTC

I haven't change anything as from September. It all started when there was a change in the searches.
I don't have the latest drivers, I never use them. They resolve some bugs, but create new ones. I even try to use a BOINC version that does what I want, but sometimes a newer is necessary to run certain projects.

I use F-Secure as a AV and FW, and have nothing excluded. They solved all the issues with not uploading and automatically new work, and everything is running smooth and save.

Only Milyway produces a lot on invalids (can't validate, while wingman are validated).


Are you running MW, Gpugrid and Einstein on the same gpu's at the same time? If so set the swap time so no unit gets suspended mid task before it switches to the next task if you can. I never run more than one gpu project at a time on a single pc, a reason to start building towards a 'ranch', which is 15 or more pc's! I tend to focus on a project, giving myself a goal then switching my resources to the next project etc, etc.


No, only one at a time (GPUGRID can not with ATI).
To me it seems going wrong since September when the new searches and the opencl was introduced. Then started the invalids to appear. previous I had 600, 800 even more than 1300 consecutive tasks, now 3, 5, never seen 10 or more.
It also takes way longer to validate (days, several weeks), previous it was minutes to validate.
My 5870 are not the newest cards anymore, but they did great without error and only one per around 2000 validate errors until this summer.

All systems I build myself in the future will have nVidia cards.
For Popo666: The ATI works at 90-92 degrees Celsius, with the fan at 50-55% so no heat problems.
Greetings from,
TJ
ID: 56666 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
popo666

Send message
Joined: 18 Dec 09
Posts: 9
Credit: 236,002,715
RAC: 0
Message 56707 - Posted: 3 Jan 2013, 18:05:05 UTC

New info: sometimes I get message from system, that GPU driver stopped responding and was successfully recovered. In that case, each running milkyway WU fails imediately, or computation will continue indefinetley, until I recognize it and manually abort WUs. After this accident, validation errors or computation errors will ocure higher rate than before the accident. Strange is, that to fix this issue I dont have to reboot PC, I just have to wake on monitor (move mouse, hit keyboard). I have disabled monitor timeout in windows AND in EPU control panel provided by Asus (motherboard ventor).
TJ: 92+ Celsius seems to me like pretty high temperature. Mine 5970 runs at 82 to 85 Celsius max (ambient temp is arount 20 this time of a year)

EDIT: Now I have found, that something enablet display timeout in windows power manager to 20 minutes. I have disabled it and will monitor it if its still disabled.
ID: 56707 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sunny129
Avatar

Send message
Joined: 25 Jan 11
Posts: 271
Credit: 346,072,284
RAC: 0
Message 56736 - Posted: 4 Jan 2013, 22:43:30 UTC - in response to Message 56661.  

You can underclock the memory in Catalyst if you go in and adjust the values in your profile.

could you elaborate on this? what profile? where can i find it?

TIA,
Eric
ID: 56736 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
swiftmallard
Avatar

Send message
Joined: 18 Jul 09
Posts: 300
Credit: 303,562,776
RAC: 0
Message 56737 - Posted: 4 Jan 2013, 23:06:42 UTC
Last modified: 4 Jan 2013, 23:26:06 UTC

Try this at your own risk.

In Windows 7:

Stop Boinc and close Catalyst

Unhide your files/folders

Go to C: users/'username'/appdata/local/ATI/ACE/Profiles

Find the profile you are using and open it in Notepad.

Scroll down and you will find a line that begins:
MemoryClockTarget_PCI_VEN_ ....

Beneath that line you will see something like this:
<Property name=”Want_0” value=XXXXX”/>
<Property name=”Want_1” value=XXXXX”/>
<Property name=”Want_2” value=XXXXX”/>

Whatever those values for XXXXX are, change them to your desired underclock speed times 10.

For example, if you want to underclock your memory speed to 500MHz, change all three values to 50000.

Change nothing else in the profile. Save and close the file.

Start Catalyst.

Go to the AMD Overdrive section.

Your GPU and memory clock speed sliders are now on the main screen.

Your memory clock slider should now be able to go down to the lowest setting you changed in the profile, x10 MHz.

Move the slider to the desired setting and press Apply.

Start Boinc.

If this does not work, uninstall and re-install Catalyst.

Try this at your own risk.
ID: 56737 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sunny129
Avatar

Send message
Joined: 25 Jan 11
Posts: 271
Credit: 346,072,284
RAC: 0
Message 56738 - Posted: 4 Jan 2013, 23:58:42 UTC

it turns out i have a number of such entries in my profile:

<Feature name="MemoryClockTarget_PCI_VEN_1002&amp;DEV_679A&amp;SUBSYS_E207174B&amp;REV_00_4&amp;416692D&amp;0&amp;0010A">
<Property name="Want_0" value="15000" />
<Property name="Want_1" value="75000" />
</Feature>

<Feature name="MemoryClockTarget_PCI_VEN_1002&amp;DEV_68E1&amp;SUBSYS_30001043&amp;REV_00_4&amp;38A11392&amp;0&amp;0018A">
<Property name="Want_0" value="45000" />
<Property name="Want_1" value="45000" />
<Property name="Want_2" value="45000" />
</Feature>

<Feature name="MemoryClockTarget_PCI_VEN_1002&amp;DEV_679A&amp;SUBSYS_E207174B&amp;REV_00_4&amp;38A11392&amp;0&amp;0018A">
<Property name="Want_0" value="15000" />
<Property name="Want_1" value="75000" />
</Feature>

<Feature name="MemoryClockTarget_PCI_VEN_1002&amp;DEV_68E1&amp;SUBSYS_30001043&amp;REV_00_4&amp;416692D&amp;0&amp;0010A">
<Property name="Want_0" value="45000" />
<Property name="Want_1" value="45000" />
<Property name="Want_2" value="45000" />
</Feature>

<Feature name="MemoryClockTarget_PCI_VEN_1002&amp;DEV_6719&amp;SUBSYS_31201682&amp;REV_00_4&amp;19D28DA3&amp;0&amp;0018A">
<Property name="Want_0" value="15000" />
<Property name="Want_1" value="125000" />
</Feature>

<Feature name="MemoryClockTarget_PCI_VEN_1002&amp;DEV_6719&amp;SUBSYS_31201682&amp;REV_00_4&amp;416692D&amp;0&amp;0010A">
<Property name="Want_0" value="15000" />
<Property name="Want_1" value="125000" />
<Property name="Want_2" value="125000" />
</Feature>

<Feature name="MemoryClockTarget_PCI_VEN_1002&amp;DEV_6719&amp;SUBSYS_E182174B&amp;REV_00_4&amp;38A11392&amp;0&amp;0018A">
<Property name="Want_0" value="15000" />
<Property name="Want_1" value="130000" />
<Property name="Want_2" value="130000" />
</Feature>


some of the values in some of the entries look familiar. for instance, in the last entry, the 130000 corresponds to default 1300MHz memory clock of my HD 7950. the 15000 i believe corresponds to my attempt to set the memory clock as low as 150MHz, even though it failed to lock in at that clock speed. the 75000 in the top entry corresponds to my HD 7950's current memory clock of 750MHz.

...i'm assuming that i just need to change the first/top "MemoryClockTarget" entry, and that the subsequent entries are just residuals from old GPU presets?
ID: 56738 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
swiftmallard
Avatar

Send message
Joined: 18 Jul 09
Posts: 300
Credit: 303,562,776
RAC: 0
Message 56739 - Posted: 5 Jan 2013, 0:10:40 UTC

That would be my guess. I only use the one so I don't have that many. I would pick the one you want to use and adjust that one.
ID: 56739 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sunny129
Avatar

Send message
Joined: 25 Jan 11
Posts: 271
Credit: 346,072,284
RAC: 0
Message 56741 - Posted: 5 Jan 2013, 2:17:13 UTC - in response to Message 56739.  

That would be my guess. I only use the one so I don't have that many. I would pick the one you want to use and adjust that one.

just so we're clear, i don't have several profiles.xml files - its just a single profiles.xml file with multiple "MemoryClockTarget" entries. at any rate, i'll make a backup of the file just in case something goes wrong in changing the values and i forget how to get it back to its original state.
ID: 56741 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
swiftmallard
Avatar

Send message
Joined: 18 Jul 09
Posts: 300
Credit: 303,562,776
RAC: 0
Message 56742 - Posted: 5 Jan 2013, 2:20:31 UTC - in response to Message 56741.  

at any rate, i'll make a backup of the file just in case something goes wrong in changing the values and i forget how to get it back to its original state.

Good idea!
ID: 56742 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : 5970 validate errors (20% ~30% WU's)

©2024 Astroinformatics Group