Welcome to MilkyWay@home

Getting lots of Invalid Wus - Help?

Message boards : Number crunching : Getting lots of Invalid Wus - Help?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Drpop [BlackOps]
Avatar

Send message
Joined: 3 May 10
Posts: 6
Credit: 104,596,950
RAC: 0
Message 42192 - Posted: 16 Sep 2010, 18:22:53 UTC

Hi, this is Jed Smith, I crunch for Team Seti.Usa under the username DrPop.

My ATI 4870 used to do about 95 - 100K per day, but for the last 4 days only around 40 - 60K. I checked my account, and it seems like I am getting TONS of invalid or inconclusive WUs now.

I never had this happen before - is there something I need to setup differently?
Thank you for any help,
DrPop
ID: 42192 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Crunch3r
Volunteer developer
Avatar

Send message
Joined: 17 Feb 08
Posts: 363
Credit: 258,227,990
RAC: 0
Message 42194 - Posted: 16 Sep 2010, 18:51:32 UTC - in response to Message 42192.  

Hi, this is Jed Smith, I crunch for Team Seti.Usa under the username DrPop.

My ATI 4870 used to do about 95 - 100K per day, but for the last 4 days only around 40 - 60K. I checked my account, and it seems like I am getting TONS of invalid or inconclusive WUs now.

I never had this happen before - is there something I need to setup differently?
Thank you for any help,
DrPop


GPU core clock: 800 MHz, memory clock: 800 MHz

Reduce the core clock to stock settings or slightly above that, say 775 MHz.
Along with that, drop the memory clock to 500 Mhz. That'll save energy and your card will run quite cooler. MW doesn't depend on memory bandwidth.




Join Support science! Joinc Team BOINC United now!
ID: 42194 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Drpop [BlackOps]
Avatar

Send message
Joined: 3 May 10
Posts: 6
Credit: 104,596,950
RAC: 0
Message 42195 - Posted: 16 Sep 2010, 19:29:06 UTC - in response to Message 42194.  

Thanks, I will try that!

I wonder why it never bothered anything before? Strange.
I'll let you know how it goes.

Jed
ID: 42195 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Drpop [BlackOps]
Avatar

Send message
Joined: 3 May 10
Posts: 6
Credit: 104,596,950
RAC: 0
Message 42225 - Posted: 17 Sep 2010, 23:51:58 UTC - in response to Message 42195.  

Hmmm...not sure what's going on. I dropped the rates to 750 / 500 like you said, and now it runs REALLY cool - 56 degrees under load. Amazing.

But, still erroring WUs, I guess...says many of them are inconclusive. This never happened before.

Do you think I should reinstall the driver, or remove the opti app for a while or something?

Thanks for any tips.
ID: 42225 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zydor
Avatar

Send message
Joined: 24 Feb 09
Posts: 620
Credit: 100,587,625
RAC: 0
Message 42226 - Posted: 18 Sep 2010, 0:11:30 UTC
Last modified: 18 Sep 2010, 0:44:38 UTC

Your fine - inconclusive just means "waiting for a Wingman" to validate against, hence its chucked into pendings. Most get sorted on a 2-10 day view, although inevitably you'll get some stuck in pendings for 2-4 weeks waiting for timeouts on wingmen.

Pendings will climb for circa 2weeks+, then even off as WUs going in equal WUs coming out of pendings. RAC takes a temporary dip for that period, then all's well.

Alarm bells only ring in the mind if the invalid or error tab starts filling, pendings (inconclusive) are fine.

EDIT:
Forgot this bit:
Hmmm...not sure what's going on. I dropped the rates to 750 / 500 like you said, and now it runs REALLY cool - 56 degrees under load. Amazing.


Think of memory "speed" (and its bandwidth not speed), like a freeway. If you only have light traffic normaly, a - say - 3 lane freeway will do. If its mega busy with high City traffic, probably need a 6 lane freeway. Pointless building a 6 lane freeway for light traffic, waste of resource.

Same with memory, the rating refers to bandwidth (lanes on the freeway). MW does not need high bandwidth for the data it passes through, only relatively light loads of data pass at any one instance, so a lower bandwidth is fine, uses less resource, so less heat generated. Pointless having higher bandwidth, as like the car on an empty freeway, it will not travel any faster (as such).

Regards
Zy
ID: 42226 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile banditwolf
Avatar

Send message
Joined: 12 Nov 07
Posts: 2425
Credit: 524,164
RAC: 0
Message 42228 - Posted: 18 Sep 2010, 2:40:05 UTC - in response to Message 42225.  

Hmmm...not sure what's going on. I dropped the rates to 750 / 500 like you said, and now it runs REALLY cool - 56 degrees under load. Amazing.

But, still erroring WUs, I guess...says many of them are inconclusive. This never happened before.

Do you think I should reinstall the driver, or remove the opti app for a while or something?

Thanks for any tips.


You only have one listed as invalid, you should be ok.
Doesn't expecting the unexpected make the unexpected the expected?
If it makes sense, DON'T do it.
ID: 42228 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Drpop [BlackOps]
Avatar

Send message
Joined: 3 May 10
Posts: 6
Credit: 104,596,950
RAC: 0
Message 42229 - Posted: 18 Sep 2010, 3:24:01 UTC - in response to Message 42228.  

Thanks for the replies, it's making more sense, then. Just have to wait for the wingmen to catch up.

I appreciate explaining the memory too - I'm glad Crunch3r told me to drop it that low...I though 800 was low already. :-)
It rus a good 10 to 15 degrees cooler now, depending on time of day and other factors.
ID: 42229 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zydor
Avatar

Send message
Joined: 24 Feb 09
Posts: 620
Credit: 100,587,625
RAC: 0
Message 42231 - Posted: 18 Sep 2010, 9:57:42 UTC - in response to Message 42229.  
Last modified: 18 Sep 2010, 10:49:03 UTC

Just keep your eye on invalids/error WUs in the tasks tab until it all settles. If you still get regular invalids - albeit infrequently, circa one or two a day - reduce your setting from 775 down to - say - 765 or 760. Probably not needed, as its looking good at present, just keep your eye on it.

Dont be tempted to push the card to the absolute limit in terms of the highest possible settings within a few Mhz before errors appear - it will - eventually - burn out. Its like a car, travel at full speed forever and high revs, it wears out much faster than if it were driven with a conservative driving style. Dont be fooled by months of "its ok" running at full stretch to the last possible Mhz, all thats happening is the wear and tear is drastically increased, and it will die on you.

General rule of thumb is get to the settings where there are no errors over time, and reduce by 15Mhz (or just leave it at default settings ....)
and it will purr away happily forever. In a very generalist sense, if you overclock more than 5-10% without taking other precautions and increased TLC you are pushing your luck in the long run.

The card will run at those faster 5-10%+ speeds, but leave it alone, dont go there unless you get to the stage of really understanding overclocking - its an expensive way to learn lessons (!).

edit:
Just thought ..... for a bit of fun ..... if you are looking at learning more re overclocking, take a look at the link below. Its an extreme overclock session with liquid nitrogen by one of the leading gurus on LN overclocking. During the session he broke three world records. Guru3d also do overclock sessions with each card as it comes out, always worthwhile looking up your own card on past Guru3d reviews to get a feel of what it can do. Its basic everyday overclocking done by them but useful to readup on, these guys know what they are doing.

Do be careful ..... dont get carried away on a wave of enthusiasm for it (!) ..... its expensive going that learning route.

There is a forum attached to the Site where there are some genuine Uber Geek guys who are the types to take subliminal learning whilst asleep about their latest nano computer build, and think going to a seminar on the latest IBM registers is a holiday treat. They are a good bunch of guys, and will help anyone as long as you remember the please and thank yous. Bowing and scraping not expected, just common decency and curtesy.

(just watch out for the usual loadmouth wannabie surfacing there, the guys slap them down hard when they surface, just pause to make sure you are dealing with a reputable guy - the rest will dive in to protect you if they spot a numpty sounding off).

Liquid Nitrogen Session at Guru3d

Regards
Zy
ID: 42231 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Drpop [BlackOps]
Avatar

Send message
Joined: 3 May 10
Posts: 6
Credit: 104,596,950
RAC: 0
Message 42286 - Posted: 21 Sep 2010, 16:17:55 UTC - in response to Message 42231.  

Thank you, I enjoyed the detailed explanation of all this. I will check into that link when I have some time to learn more.

Seems like things are going better on the card - have not had an invalid WU since the 9/14 and my numbers are slowly going back up (did nearly 70K yesterday).

Thanks again,
Jed
ID: 42286 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
James Nunley

Send message
Joined: 29 Nov 07
Posts: 39
Credit: 74,300,629
RAC: 0
Message 42305 - Posted: 22 Sep 2010, 18:39:19 UTC

What I find odd is your 4870 is running almost 400 seconds per work unit. My 4850 at 727 Mhz is running right around 200 seconds per unit so not sure what is going on there.
ID: 42305 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile The Gas Giant
Avatar

Send message
Joined: 24 Dec 07
Posts: 1947
Credit: 240,884,648
RAC: 0
Message 42306 - Posted: 22 Sep 2010, 19:39:11 UTC - in response to Message 42305.  

What I find odd is your 4870 is running almost 400 seconds per work unit. My 4850 at 727 Mhz is running right around 200 seconds per unit so not sure what is going on there.

Running 2 at once most likely. DrPop may want to try going back to 1 wu at a time and see if the number of errors reduces.

ID: 42306 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zydor
Avatar

Send message
Joined: 24 Feb 09
Posts: 620
Credit: 100,587,625
RAC: 0
Message 42307 - Posted: 22 Sep 2010, 19:48:04 UTC

A GPU can take more than one WU at anyone time, but they are not processed any faster. Each gets its "turn" individually for processing as time goes on. The time taken therefore appears to be doubled (400 secs), but on completion two WUs are ready not one.

Therefore the effective throughput is still the same as such, although there is a minimal marginal advantage in terms of loading time overall into the GPU. Two loaded at the same time (via an app_info file), therefore (in this case) come out every 400secs. The effective throughput is 200 seconds, precisely what you are getting.

The advantage of this method is it cuts down on "thermal cycling". The latter means that when a WU first starts, and finally finishes, there is a load/unload time. During that time, the card cools, then rapidly reheats as the loaded WU(s) start crunching).

The worst enemies of computer silicon is the various forms of heat they put up with. Overwhelmingly, primarily, its the issues of interior temperature in the case, and pushing the card too hard with together with poor airflow.

However thermal cycling plays its part as enemy number two, albeit in a far far less abrasive way. The latter is why its better from a longevity point of view to minimise turning PCs on & off during use.

So thermal cycling is not something to loose sleep over as such, but its there, so if easily minimised, as in this case, why not ...

Regards
Zy
ID: 42307 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 2 Jan 08
Posts: 122
Credit: 69,480,163
RAC: 1,419
Message 42425 - Posted: 28 Sep 2010, 9:04:00 UTC

I too am having problems with work units that error out.

I have three 5870 cards, all the same type with one in one computer and two in the other.
The single card is running just fine without a problem or an error.

The two card set up is the one with the issues.
I was wondering why my RAC was not moving up very fast on the dual card machine so I checked the work units.
A very large number were getting 0.00 and classed as invalid.
Nearly all were paired with a 64 bit computer and when another 64 bit computer was added to check results I was always kicked out of the Quorum.

I have a large pending amount as well, this is probably related to the inconclusive results but all are now getting 0.00 as well when quorum is met.

Settings are the same on all cards, app_info file is at default settings.

I have now stopped these two cards from doing Milkyway and gone back to DNETC which is having no such problems at all, leaving my single card doing Milkyway.

I seem to recall I may of had this problem before with my 4870 cards, so I stopped them as well.

I now have new cards and updated the application as well but still have the same issue.

Not sure what is going on. I did down clock the single card but as I was doing Collatz as a backup project I put it back to normal a it all slowed down.
ID: 42425 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cthulhu

Send message
Joined: 22 Jan 10
Posts: 3
Credit: 71,094,774
RAC: 0
Message 42610 - Posted: 6 Oct 2010, 1:47:27 UTC

Lots of recent errors on my 5870 as well. Typically they are of two cases: either 1) "too many errors" where nobody can validate, which results in "Completed, marked as invalid" for me, or 2) I get ganged up on by a couple of v0.19 users compared to my 0.23 (ati13ati) version. I haven't changed video drivers or anything BOINC-related as things were running pretty smoothly since I put in the ATI 58x0 fix. Well, until perhaps about a week ago, when I noticed I was getting lots of "Completed, marked as invalid" and "Completed, can't validate." I have always been running the memory downclocked and have been dropping the core back towards stock speed, but errors are still occurring. Have there been any recent requirements to upgrade drivers for GPU crunching I may have missed posted on the message boards?

Thanks for any insight/advice,
Brent
ID: 42610 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Getting lots of Invalid Wus - Help?

©2024 Astroinformatics Group