Welcome to MilkyWay@home

Still getting 10% invalid results

Message boards : Number crunching : Still getting 10% invalid results
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile magyarficko

Send message
Joined: 22 Jan 09
Posts: 35
Credit: 46,731,190
RAC: 0
Message 38819 - Posted: 18 Apr 2010, 6:38:59 UTC

So here's my situation ....

I left the project for a few days immediately after the new validator was released and the problem with the ATI HD58xx GPUs was discovered.

I've been back for a week or more and suffered with the rest of the folks thru the database crash and repair. So now that things finally seem to be settling down I've been reviewing recent results.

I have two hosts, one with two HD4850's using the provided stock application (no app_info.xml used). And the second host with two HD4890's also using stock application.

All four GPUs are mildly overclocked. The host with the HD4850's does not have a single invalid result listed amongst my results, but meanwhile the host with the HD4890's is suffering from an approximate 10% invalid results rate.

I was used to seeing these invalids before - when the invalid HD58xx results were clobbering mine. But now the invalid results I am seeing have many other HD48xx GPUs as wingmen so I'm not sure what's going on.

The only thing I can think of is it MIGHT be an overclock issue but I am using the EXACT SAME clocking I was on these GPUs before the new validator and I never had these kind of problems.

I have no idea where or what to look at, but if someone else more knowledgable could have a look or offer suggestions I would be very appreciative.

My good host with NO invalids ...
http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=159400&offset=0&show_names=0&state=4

and the other host suffering with an approximate 10% invalid result rate ...
http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=140918&offset=0&show_names=0&state=4


ID: 38819 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Simplex0
Avatar

Send message
Joined: 11 Nov 07
Posts: 232
Credit: 178,229,009
RAC: 0
Message 38820 - Posted: 18 Apr 2010, 7:17:59 UTC - in response to Message 38819.  
Last modified: 18 Apr 2010, 7:19:00 UTC

Taken from your 'bad' host "GPU core clock: 920 MHz"
Maybe it's just overclocked to high?
Try to lower your GPU clock to 750 to se if the error disappear.
ID: 38820 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zydor
Avatar

Send message
Joined: 24 Feb 09
Posts: 620
Credit: 100,587,625
RAC: 0
Message 38822 - Posted: 18 Apr 2010, 8:36:40 UTC
Last modified: 18 Apr 2010, 8:58:18 UTC

Its definitely overclocked way too high. This is a bit long, but decided to give a full(er) answer as many are overclocking 4xxx way way too high, is the cause of the vast majority of errors with otherwise stable applications, and it also may help others.

The 4890 is a slightly redesigned RV770 used on the 4870. The resulting RV790 is pretty much the same , but had some changes allowing the reference clock to go to 850Mhz. The net effect of that was it allowed AMD to give the go ahead to Partners to sell overclocked cards above 900Mhz. They do that by overvolting the card.

Health and Wealth warning - Overvolting incorrectly will damage a card, do it at your own risk

In the latter is probably why you are bombing out, you are running above 900Mhz without overvolting. As the card has improved gpu speed over the 4870 already built in, I would expect there to be less room for overclock on air without overvolting. You would probably get close to 900Mhz, but not over that, it will almost certainly bomb out. Especially - vertually certainly - if it was not being overvolted when running above 900Mhz.

Bring it back down to stock speed of 850/975, run a few WUs to check all is stable. Then go get "Furmark", Guru3d.com carry it, its free, and very good. Then run the card against furmark (watch tempurature, furmark stresses (safely) a card about 20% more than normal, if it performs with furmark, it will be rock solid elsewhere). Increase gpu rate by say 15Mhz a go, and watch for artifacts in furmark - keep an eye on temperature - stop increasing gpu rate as soon as you see just one artifact, back off 25Mhz and your done, thats your card limit.

Thats probably going to explain why you ran at 920 before. No two cards are alike, they come out of manufacturing with different results. The card is sold at a speed rating which gives room to operate at the speed claimed on the box. Thats why there is often room for overclock, and why cards of the same model overclock differently, its pot luck. However, when overclocking, never leave it at the edge, find the upper limit and back off 25Mhz.

Thats probably why you fell over, the MW app changed slightly and it placed a higher stress on the gpu, that tipped you over the edge at 920. Meanwhile the improved validator picked up the bad results and marked them invalid. Golden Rule when overclocking is back off 25Mhz once the top limit is found, that way you will avoid problems with slight changes to the applications running. Never ever live on the edge, its not worth it, you never gain in the long run with failed WUs, and you are in danger of burning out the card in the long run.

Once you found the desirable gpu clock rate, then bring your card memory speed down to 200-300Mhz, lower the better even below 200 if it will take it without slowing/errors, as that will save power and heat. MW does not need high gpu memory, all cards at MW - whatever the model - should be at least reduced down to 300mhz, and lower if you can manage it. Running higher memory speed at MW is a total waste of power and will just create an improved space heater.

I would expect a safe level of o/c with that card without overvolting would be around 900/250, but test it ......

Regards
Zy
ID: 38822 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile magyarficko

Send message
Joined: 22 Jan 09
Posts: 35
Credit: 46,731,190
RAC: 0
Message 38823 - Posted: 18 Apr 2010, 10:06:03 UTC
Last modified: 18 Apr 2010, 10:10:16 UTC

OK ... well thanks for the detailed explanation. I've turned the cards back down to default clocks and the invalids have gone away, so I guess it must be as you said. The app changed and pushed me over the edge ... because I never had this before.

I've also downloaded and run FurMark, but to be honest, I have no idea what I'm doing or what I'm looking for -- I wouldn't know an "artifact" if it bit me in the ass. Besides, that screen is so busy it's really hard to see anything.

Plus, what options do I use? I downloaded version 1.8 (the latest version ??) and I can start it in multi-GPU mode or not -- I have two HD4890's in the one box. Then there are options for "displacement mapping", "post fx", etc. I have no idea what that is?

As for dropping memory frequency, I've already been doing that, but I'm using Catalyst Control Center, and that doesn't let me take it lower than 490. Some team-mates suggested MSI Afterburner for over-volting so I may have a look at that, but again, I really don't know how far I can push things so I may not do anything other than looking at the software for now.

I'm glad the invalids are gone, but I'm REALLY bummed about running at default clocks. Wall clock run times have increased roughly 30 seconds for me which is approximately 15%.

Hmmmmmm .... all valid but 15% slower, or run them faster for a 10% error rate -- tough choice!

P.S. The Furmark "Extreme Burning Mode" didn't take my GPU temps anywhere near as high as the MilyWay WUs do ????
ID: 38823 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Emanuel

Send message
Joined: 18 Nov 07
Posts: 280
Credit: 2,442,757
RAC: 0
Message 38824 - Posted: 18 Apr 2010, 10:58:59 UTC

Maybe the authors of Furmark should talk to Milkyway@Home to see how they're stressing the cards this thoroughly XD
ID: 38824 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zydor
Avatar

Send message
Joined: 24 Feb 09
Posts: 620
Credit: 100,587,625
RAC: 0
Message 38825 - Posted: 18 Apr 2010, 11:37:44 UTC - in response to Message 38823.  
Last modified: 18 Apr 2010, 12:00:00 UTC

OK ... well thanks for the detailed explanation. I've turned the cards back down to default clocks and the invalids have gone away, so I guess it must be as you said. The app changed and pushed me over the edge ... because I never had this before.

I've also downloaded and run FurMark, but to be honest, I have no idea what I'm doing or what I'm looking for -- I wouldn't know an "artifact" if it bit me in the ass. Besides, that screen is so busy it's really hard to see anything.

Plus, what options do I use? I downloaded version 1.8 (the latest version ??) and I can start it in multi-GPU mode or not -- I have two HD4890's in the one box. Then there are options for "displacement mapping", "post fx", etc. I have no idea what that is?

As for dropping memory frequency, I've already been doing that, but I'm using Catalyst Control Center, and that doesn't let me take it lower than 490. Some team-mates suggested MSI Afterburner for over-volting so I may have a look at that, but again, I really don't know how far I can push things so I may not do anything other than looking at the software for now


Good choice - dont push anything to do with O/C or overvolting until you thoroughly understand the problems of the extreme end of things. Especially overvolting - you really will cook a card pretty quickly if overvolting is not done correctly.

I'm glad the invalids are gone, but I'm REALLY bummed about running at default clocks. Wall clock run times have increased roughly 30 seconds for me which is approximately 15%.

Hmmmmmm .... all valid but 15% slower, or run them faster for a 10% error rate -- tough choice!


No contest - stay on the side of the rightuous :) Its always better to err on the side of caution. Take your recent experience, probably around 60 WUs totalled @ whatever credit per WU. Now add up the potential increase for 60 WUs at 15% - would take a loooong time to make up the lost ground by going too far on o/c

P.S. The Furmark "Extreme Burning Mode" didn't take my GPU temps anywhere near as high as the MilyWay WUs do ????

It does have the reputation of being the toughest tester on the planet, it really hammers the gpu - its going to depend on what you did using it. Same goes as for overvolting, if there is initial doubt dont use it, park the topic until you have time to research more (the guru3d forum is a great place to start re extreme ends of computing, some really knowledgeable guys there, and provided you remember your "please" and "thank you's" they will bend over backwards to help anyone.


I would suggest from what you said, leave the overvolting alone (dont go near it ....) until you read up more about it. With Furmark, its sounds like you need to do some more research on how to use it (guru3d is your friend on this - use it and the forum)

For now, I suggest you downclock the gpu memory to 300, then slooooowly increase gpu clocks by 15mhz steps - keep a sharp eye on the WU result. As soon as you spot an invalid, back off 15mhz, then leave it like that. Dont be tempted to keep pushing and pushing, it will fall over in the end without watercooling and/or overvolting, as you found out. Dont get into the latter without more research.

There is a kind of wierd e-pine some attatch to o/c, dont get sucked into that. Mere mortels like us have more respect for our bank balance and developers hard work than to trash either by chasing an e-pine thing (not for a moment suggesting you went down that road, just an illustration to try and put across what I mean).

Memory down to 300, gentle step by step increaes of the gpu speed by 15Mhz steps - pausing for a while after each increase to crunch 30/40 WUs to see how its going - should nail it without too many issues, as long as you keep an eye on temperature. Just watch for the first failed WU and back off, and you'll be fine. Probably end up around 880-890/300 - maybe - depends hugely how lucky with the card you are re build standard.

Just be conservative, take your time withit, and back off the precipice when you find it - dont stay on the edge.

Here is an overclock guide for a 5970, ok its a different card, but the principle is the same. It also shows you how to set the ATI overclock application for lower memory speed via the ATI Overclock configuration file.

Overclock guide & MSI use & Config file changes

Overclock session on a 5970 at Guru3d.com (principles the same for any card):
5970 overclock session Guru3d.com

Regards
Zy
ID: 38825 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile magyarficko

Send message
Joined: 22 Jan 09
Posts: 35
Credit: 46,731,190
RAC: 0
Message 38826 - Posted: 18 Apr 2010, 12:14:10 UTC

All good advice .... and it HASN'T fallen upon deaf ears :)

Thanks again.


ID: 38826 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zydor
Avatar

Send message
Joined: 24 Feb 09
Posts: 620
Credit: 100,587,625
RAC: 0
Message 38830 - Posted: 18 Apr 2010, 13:39:17 UTC - in response to Message 38826.  

Just came across this at Guru3d.com - told you it was your friend :)

Its an overclocking guide to a 4890 .... Have Fun!

Guru3d Overclocking guide for a 4890

Regards
Zy
ID: 38830 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
SkyeHunter

Send message
Joined: 6 Mar 09
Posts: 41
Credit: 38,856,291
RAC: 0
Message 38834 - Posted: 18 Apr 2010, 16:24:01 UTC

Got a similar issue with one of my 2 4870's. I lost a few percent (3 to 5), but it runs rock stable now .... The new app seems to stress the GPU a bit more. During the server problems, I discovered that DNETC also caused my both cards to produce errors...
ID: 38834 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zydor
Avatar

Send message
Joined: 24 Feb 09
Posts: 620
Credit: 100,587,625
RAC: 0
Message 38837 - Posted: 18 Apr 2010, 16:58:17 UTC
Last modified: 18 Apr 2010, 16:58:56 UTC

Might be worth checking out the overclock at DNETC, many problems with the "never ending bug" there are related to overclocking. Different apps stress a gpu in different ways, what works at one app often will not at another when a card is stressed to its maximum.

In the latter case a slight application change - even in the same application - can stress it in a different way, and if the card is clocked to the edge, it will fall over. The recent small app change here did stress cards more, I had to increase my fan 5% to bring temperatures back down to 81-83 after they jumped to 89-91 with the new app.

Its likely a change to the speeds on your cards there will produce similar benefits. Apologies if you have already done that ....

Regards
Zy
ID: 38837 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Still getting 10% invalid results

©2024 Astroinformatics Group