Welcome to MilkyWay@home

Many "errors while computing" in the stats!

Message boards : Number crunching : Many "errors while computing" in the stats!
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile cenit

Send message
Joined: 16 Mar 09
Posts: 58
Credit: 1,129,612
RAC: 0
Message 38631 - Posted: 12 Apr 2010, 15:12:36 UTC

When I browse into my tasks, what I see is that more or less every work unit has at least one "Error while computing" in the hosts processing them (often/always after 0.00 seconds).
From your side, do you see so many wrong configs? What could it be? On my side, milkyway has an history of full reliability (ati app), never trashed any wu if not intentionally.
ID: 38631 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zydor
Avatar

Send message
Joined: 24 Feb 09
Posts: 620
Credit: 100,587,625
RAC: 0
Message 38632 - Posted: 12 Apr 2010, 15:35:44 UTC
Last modified: 12 Apr 2010, 15:38:13 UTC

When did the errors start - reason for asking is that version 0.23 came out recently, and it was harder on the GPU than previously and ran hotter. Some with cards o/c right to the edge will have been pushed over the edge by the change. That will be caused by too high a gpu clocks mhz rate and/or no high enough overvolting depending on the o/c being done.

If you do o/c, try a session for a couple of hours at stock speed, see what happens.

Along the lines of the new version, have you detatched and reattached via BOINCstats Hosts? That will make sure you have all the correct uptodate files coming in.

Are other Projects behaving ? Might have a memory stick going on the mainboard, unlikely for sure, but they do go every now and then albeit rarely.


Regards
Zy
ID: 38632 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Chris S
Avatar

Send message
Joined: 20 Sep 08
Posts: 1391
Credit: 203,563,566
RAC: 0
Message 38633 - Posted: 12 Apr 2010, 16:38:12 UTC
Last modified: 12 Apr 2010, 16:40:35 UTC

That's exactly what I have been getting for days but I may have solved it.

I was running the enhanced app 0.20b I've now upgraded to 0.22. I was also running some earlier versions of Boinc Manager (6.4.7) so I took out the <co-proc> lines in the app_info file. Since then I updated to 6.10.43 but forgot to put the lines back. I've now done so. As I'm running Catalyst 8.12 for stability, I also made sure I downloaded the AMD versions of enhanced apps.

It seems to have worked so far (2 hours) we will see. If the CPU and GPU workunits are different then it could have been that the GPU was running the wrong ones.
Don't drink water, that's the stuff that rusts pipes
ID: 38633 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 15 Jul 08
Posts: 383
Credit: 729,293,740
RAC: 0
Message 38635 - Posted: 12 Apr 2010, 17:45:06 UTC - in response to Message 38633.  

I was running the enhanced app 0.20b I've now upgraded to 0.22. I was also running some earlier versions of Boinc Manager (6.4.7) so I took out the <co-proc> lines in the app_info file. Since then I updated to 6.10.43 but forgot to put the lines back. I've now done so. As I'm running Catalyst 8.12 for stability, I also made sure I downloaded the AMD versions of enhanced apps.

v0.23 is current. Earlier versions will give incorrect results especially with 58xx/59xx cards. Also try v10.3 drivers, the stability issues were solved long ago AFAIK.

http://milkyway.cs.rpi.edu/milkyway/apps.php

ID: 38635 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Crunch3r
Volunteer developer
Avatar

Send message
Joined: 17 Feb 08
Posts: 363
Credit: 258,227,990
RAC: 0
Message 38636 - Posted: 12 Apr 2010, 17:45:14 UTC - in response to Message 38632.  
Last modified: 12 Apr 2010, 17:49:56 UTC

When did the errors start - reason for asking is that version 0.23 came out recently, and it was harder on the GPU than previously and ran hotter. Some with cards o/c right to the edge will have been pushed over the edge by the change. That will be caused by too high a gpu clocks mhz rate and/or no high enough overvolting depending on the o/c being done.


I don't know where you got the impression that the "new" app works the gpu any harder than the previous one.

They're the same. The only thing that "changed" was the handling of the streams in regard to the 58xx series. That's all.


If you do o/c, try a session for a couple of hours at stock speed, see what happens.

Along the lines of the new version, have you detatched and reattached via BOINCstats Hosts? That will make sure you have all the correct uptodate files coming in.

Are other Projects behaving ? Might have a memory stick going on the mainboard, unlikely for sure, but they do go every now and then albeit rarely.


Regards
Zy


I don't think it's to much OC either. I suspect the 10.3 drivers to be quite buggy/unstable and of no use here.

I had something similar happening here where 1.4.556 computed only garbage and after downgrading to 10.2 everything is fine again.

Join Support science! Joinc Team BOINC United now!
ID: 38636 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile cenit

Send message
Joined: 16 Mar 09
Posts: 58
Credit: 1,129,612
RAC: 0
Message 38638 - Posted: 12 Apr 2010, 18:01:17 UTC - in response to Message 38636.  

maybe I expressed myself wrongly.
I do not have any problem!
In fact, you can see that the few wu that I do on my 4870 are all done well, using stock app and without OC.
I was only exposing the fact that many (all?) of my wus have wingmen with problems. In many if not all there's someone that has errored out in no time. Those are machines that maybe are really stressful on the database, am I wrong?

It was my curiosity to ask how could it be that there're so many hosts doing bad work (or, saying better, no work at all, because they often error out after 0.00 seconds of computation, mainly on CUDA and ATi app).

Is it clearer now?
ID: 38638 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
SkyeHunter

Send message
Joined: 6 Mar 09
Posts: 41
Credit: 38,856,291
RAC: 0
Message 38641 - Posted: 12 Apr 2010, 19:11:48 UTC - in response to Message 38636.  


I don't know where you got the impression that the "new" app works the gpu any harder than the previous one. They're the same. The only thing that "changed" was the handling of the streams in regard to the 58xx series. That's all.

I don't think it's to much OC either. ....


I clocked about 50 MHz back in order to keep the app stable; and still the GPU runs 4 to 5 degrees hotter than before...
ID: 38641 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zydor
Avatar

Send message
Joined: 24 Feb 09
Posts: 620
Credit: 100,587,625
RAC: 0
Message 38657 - Posted: 12 Apr 2010, 22:05:35 UTC - in response to Message 38641.  

I had to increase the fan by 5% when I started the new 0.23 as it had jumped to running at 89-91 degrees. It now back running at 81-83 degrees.

Regards
Zy
ID: 38657 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
Message 38660 - Posted: 12 Apr 2010, 22:54:32 UTC - in response to Message 38638.  

maybe I expressed myself wrongly.
I do not have any problem!
In fact, you can see that the few wu that I do on my 4870 are all done well, using stock app and without OC.
I was only exposing the fact that many (all?) of my wus have wingmen with problems. In many if not all there's someone that has errored out in no time. Those are machines that maybe are really stressful on the database, am I wrong?

It was my curiosity to ask how could it be that there're so many hosts doing bad work (or, saying better, no work at all, because they often error out after 0.00 seconds of computation, mainly on CUDA and ATi app).

Is it clearer now?


I'm not seeing very many errors on my end. Also, the number of invalids the server is seeing is down to < 2% (including results that just error out). Swapping to the new application should drop this even farther.
ID: 38660 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile cenit

Send message
Joined: 16 Mar 09
Posts: 58
Credit: 1,129,612
RAC: 0
Message 38667 - Posted: 13 Apr 2010, 8:14:45 UTC - in response to Message 38660.  

I'm not seeing very many errors on my end. Also, the number of invalids the server is seeing is down to < 2% (including results that just error out). Swapping to the new application should drop this even farther.

good!
thanks a lot!
ID: 38667 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Simplex0
Avatar

Send message
Joined: 11 Nov 07
Posts: 232
Credit: 178,229,009
RAC: 0
Message 38668 - Posted: 13 Apr 2010, 9:12:33 UTC - in response to Message 38657.  

I had to increase the fan by 5% when I started the new 0.23 as it had jumped to running at 89-91 degrees. It now back running at 81-83 degrees.

Regards
Zy


Same experience here.
The temp. is someting like +5 - +10 Celsius up with new 0.23 apps.
but my new PowerColor Radeon HD 5870 1GB LCS will arrive to day
Fry on! :)
ID: 38668 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Chris S
Avatar

Send message
Joined: 20 Sep 08
Posts: 1391
Credit: 203,563,566
RAC: 0
Message 38670 - Posted: 13 Apr 2010, 9:31:03 UTC - in response to Message 38635.  

Many thanks for the responce there Beyond, appreciated. I've checked again this morning and so far my 7 boxes with cards are reporting and vaildating OK. Crunch3r feels that the 10.3 drivers are buggy and thinks 10.2 are better. I'd be quite happy to upgrade to 0.23 but it needs a driver 9.3 or above.

Any consensus out there as to which catalyst driver 9.3 or above is now stable? The other problem I had was that some of the boxes have AGP cards and I had to to use the 8.12 AGP hotfix version to get them to work. Is there a hotfix for AGP above 9.3?
Don't drink water, that's the stuff that rusts pipes
ID: 38670 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Zydor
Avatar

Send message
Joined: 24 Feb 09
Posts: 620
Credit: 100,587,625
RAC: 0
Message 38672 - Posted: 13 Apr 2010, 9:43:18 UTC - in response to Message 38670.  
Last modified: 13 Apr 2010, 9:45:18 UTC

I've used 10.3 on a 5970 for a couple of weeks now, been stable for me, but clearly thats not an indication it works for all as everyone has their own specific circumstances that the driver responds to.

I dont know the situation re AGP on 10.3, if 8.12 had a AGP hotfix I would have thought that by now it would have been incorporated into the main driver, however the latter is a guess only.

Give 10.2 or 10.3 a whirl on one AGP & one 5xxx/4xxx box see how it goes for 24 hrs as a test, would be a good thing overall to run all the boxes on a 10.xxx series driver in the longer term.

Regards
Zy
ID: 38672 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Chris S
Avatar

Send message
Joined: 20 Sep 08
Posts: 1391
Credit: 203,563,566
RAC: 0
Message 38674 - Posted: 13 Apr 2010, 10:15:09 UTC

Thanks Zydor, good advice, I'll upgrade a couple of boxes and see how I get on.
Don't drink water, that's the stuff that rusts pipes
ID: 38674 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile The Gas Giant
Avatar

Send message
Joined: 24 Dec 07
Posts: 1947
Credit: 240,884,648
RAC: 0
Message 38676 - Posted: 13 Apr 2010, 10:40:49 UTC

With the recent MW outrage I took the time to move my P4 with the 3850 AGO from an old box to a newer one and while I was at it I upgraded it to 10.3 - which didn't take, so I tried the 10.3 hotfix and it worked a treat.
ID: 38676 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Chris S
Avatar

Send message
Joined: 20 Sep 08
Posts: 1391
Credit: 203,563,566
RAC: 0
Message 38678 - Posted: 13 Apr 2010, 12:01:39 UTC

With the recent MW outrage I took the time to move my P4 with the 3850 AGO from an old box to a newer one and while I was at it I upgraded it to 10.3 - which didn't take, so I tried the 10.3 hotfix and it worked a treat.


That is just what I wanted to hear! Well done GG, I owe you a beer! :-)
Don't drink water, that's the stuff that rusts pipes
ID: 38678 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile arkayn
Avatar

Send message
Joined: 14 Feb 09
Posts: 999
Credit: 74,932,619
RAC: 0
Message 38697 - Posted: 13 Apr 2010, 22:54:00 UTC

I have been running 10.3 since before it officially came out, but I needed it as the original drivers were 10.2 and did not completely support the card.
ID: 38697 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 2 Jan 08
Posts: 122
Credit: 69,479,692
RAC: 1,456
Message 38891 - Posted: 20 Apr 2010, 0:59:15 UTC

I have been away from Milkyway for a while as I was doing DNETC but it started to run "xfer_error" errors at the rate of 25%.
So I switched back to Milky to test the cards.
Not a good move unfortunately, as all I kept getting were VPU reset errors, I did not get those before on any project and was not getting them when I last left.
Updated to the latest 0.23 version but nothing changed, as soon as I try to do some work on the same Windows computer that is running Milky the computer freezes then a VPU errors reset. Only the one card locks up and a suspend/resume will allow the WU to start again with a much longer run time as the process time does not stop only percent done stops.

I have now switched to Collatz and after again updating the app version, it is running great with no hassles and I can use the computer at the same time without it freezing.

I am using two 4800 cards, heat does not appear a problem and the cards are standard.

So I am at a loss as to why I can no longer run Milkyway
ID: 38891 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Dan T. Morris
Avatar

Send message
Joined: 17 Mar 08
Posts: 165
Credit: 410,228,216
RAC: 0
Message 38892 - Posted: 20 Apr 2010, 1:59:51 UTC - in response to Message 38891.  
Last modified: 20 Apr 2010, 2:02:42 UTC

The newer wu's seem to run hotter for me so I lowered my GPU mem speed and all is well now.

Lowering the mem speed did not effect my output very much.
ID: 38892 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Chris S
Avatar

Send message
Joined: 20 Sep 08
Posts: 1391
Credit: 203,563,566
RAC: 0
Message 38906 - Posted: 20 Apr 2010, 9:59:56 UTC

OK, over the w/e I upgraded an HD4850 box to catalyst 9.3 and the 0.23 ati app and all seems well so far. I'll give it a couple more days and then do some others. Haven't noticed any heat issues, but that's probably because I run Akasa Vortexx Neo after-market coolers :-)
Don't drink water, that's the stuff that rusts pipes
ID: 38906 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Many "errors while computing" in the stats!

©2024 Astroinformatics Group