Welcome to MilkyWay@home

What is the cause of these 'validate errors'

Message boards : Number crunching : What is the cause of these 'validate errors'
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Tex1954

Send message
Joined: 22 Apr 11
Posts: 64
Credit: 899,267,854
RAC: 7,501
Message 62943 - Posted: 3 Jan 2015, 23:25:09 UTC

Well, I did also check to see if I can still run my 6990 with 12.4 and it still works, but anything past 12.8 fails miserably.

I wish someone somewhere (maybe the development folks at MW?) would nail down the 69xx series problems and dual GPU problems so more folks could participate.

Seems to ME we sorta PAYED them to do it didn't we?

How about some help guys!

Good crunching and Happy New Year!

8-)
ID: 62943 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 696
Credit: 539,988,638
RAC: 87,054
Message 62944 - Posted: 4 Jan 2015, 4:40:13 UTC - in response to Message 62898.  

I just got back to crunching recently and am running into the same problem as you guys are. However I'm only running one 270x and am getting the random tasks not validating.

I'm not sure if it was a certain batch of WUs or what, but I'm down to just one invalid currently. I'll keep an eye on it and post back if it goes back up.


I saw a comment somewhere about what the problem might be. Essentially, the cleanup work done at the end of computation gets interrupted by an overbusy computer probably because you are also running other projects and their aren't enough resources to complete the task completion transaction. The only solution is not to run other projects at the same time and/or wait for the programmers to fix the underlying communications that are getting fouled.

Cheers, Keith
ID: 62944 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tex1954

Send message
Joined: 22 Apr 11
Posts: 64
Credit: 899,267,854
RAC: 7,501
Message 62986 - Posted: 11 Jan 2015, 4:25:51 UTC - in response to Message 62812.  
Last modified: 11 Jan 2015, 4:26:27 UTC

That IS a lot of testing I wouldn't have the patience for. I run two identical Nvidia cards in Sabertooth 990FX motherboards. I had the issue with the GTX 670's and now still with the GTX 970's. The cards will run the tasks from SETI and Einstein with NO issues. Just have the problem with the Modified Fit tasks ... about a 3% error rate. No problem with the standard GPU tasks. I ran with the problem with the 670's, stopped the Modified Fit and retried again with the 970's to see if the change in cards might have fixed the issue. Still have the problem so have turned off the Modified Fit again. I think, like you do that the problem is with BOINC or the application itself. Running the latest BOINC and only a point version or two behind the current drivers. Had to update the Nvidia drivers so they could handle the 970's.

As I stated, the error rate is only about 3% on the Modified Fit. The majority of tasks finish correctly. I think there is some kind of contention problem going on. Whether it is with the application, BOINC or the hardware.... I don't know. I tried running single task, single project on the cards and it made no difference in the error rate. From your experiments, I kinda don't think it is a hardware issue. Wish the project app developers would chime in on this observed behavior and issue some kind of statement on what the issue is.

Cheers, Keith


With Einstein being down, I've been running MW on another system... a totally stable XEON E3-1230 V2 3.5GHZ on new Z77 mobo and have 182 invalid and 4155 valid on a 7970 using AMD 14.12 drivers and BOINC 7.4.36... looks like about 4.5% error rate at slightly less than stock speeds.

Soo, there is for sure a problem with the tasks somehow and it would be really nice if some developer could figure it out... All these errors can't be helping the project...

8-)
ID: 62986 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tex1954

Send message
Joined: 22 Apr 11
Posts: 64
Credit: 899,267,854
RAC: 7,501
Message 63054 - Posted: 19 Jan 2015, 18:41:52 UTC

Hmm, seems one of my systems lost a Mobo... the 8x8 PCIe 2.0 one... sooo, replaced it with a Sabertooth 16x16 and going to try the dual card thing again...

Seems the Mobo lost a PCIe lane on one slot... that is possibly due to a card that went bad some months ago.

It is possible the DUAL card tests are not so accurate because of that.

Anyways, we will see... running a 7970 and GTX 670 together on the Sabertooth to see how it goes..

8-)
ID: 63054 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 696
Credit: 539,988,638
RAC: 87,054
Message 63055 - Posted: 19 Jan 2015, 21:37:13 UTC - in response to Message 63054.  

I have just accepted that I can live with a 3% error rate. That has stayed consistent for the tasks completed on MilkyWay. Only project that is giving me errors. None at Einstein. The ones at Seti are from known incorrectly formatted tasks. I try to abort those as soon as I find them so I don't process them and waste time and energy. I tried an experiment where I shut off the 1.36 tasks to see if only those gave errors since I didn't see any at first on the 1.02 tasks but soon found out that the 1.02 tasks were just as likely to have the 3% error rate too. So now crunching both types of tasks with the stable error rate. I don't think there is any solution other than wait for the developer to fix the apps so that the task completion mechanism isn't constrained by contention of GPU resources. I found out that even when just solely running MW by itself on a card that you get the same error rate.

I only give MW 15% of system resources anyway and 80% to Seti. Einstein gets the 5% leftover since their GPU app is so poorly optimized for Nvidia cards.

I have very good luck with my two Sabertooth motherboards. Hope you find the same.

Cheers, Keith
ID: 63055 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tex1954

Send message
Joined: 22 Apr 11
Posts: 64
Credit: 899,267,854
RAC: 7,501
Message 63069 - Posted: 25 Jan 2015, 3:36:53 UTC - in response to Message 63055.  



I have very good luck with my two Sabertooth motherboards. Hope you find the same.

Cheers, Keith



So far so good on the Sabertooth... running like a clock since I replaced the suspect mobo...

I suppose the error rate is going to stay with us as well... mine is decreasing now and is less than 1% overall... happy with that! Like my OLD Sabertooth 990X as well...

15600 tasks, 80 errors...

8-)
ID: 63069 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ananas

Send message
Joined: 19 Aug 08
Posts: 12
Credit: 2,500,263
RAC: 0
Message 63072 - Posted: 25 Jan 2015, 12:35:00 UTC
Last modified: 25 Jan 2015, 12:36:09 UTC

Just out of curiousity : Unlike all other projects, the validator here seems to grab a workunit when only one out of two required results are back. Is there a reason for that and does a single result with a wingman that is still "in progress" even have a chance to be valid if the quorum is higher than 1?

To me this looks like a strange behaviour of the validator (or transitioner). I sometimes have results with quorum=2, initial replication=2, only one result back but that one has the status "Completed, validation inconclusive" instead of the status "Completed, waiting for validation" that I would expect for a result that isn't ready for a first validator attempt yet.
ID: 63072 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 Sep 12
Posts: 219
Credit: 456,474
RAC: 0
Message 63074 - Posted: 25 Jan 2015, 12:50:12 UTC - in response to Message 63072.  

My belief is that at some point in the dim and distant past, somebody set an obscure configuration policy on the server, and subsequently forgot all about it:

Adaptive Replication
ID: 63074 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ananas

Send message
Joined: 19 Aug 08
Posts: 12
Credit: 2,500,263
RAC: 0
Message 63075 - Posted: 25 Jan 2015, 14:25:05 UTC - in response to Message 63074.  

Thanks, that explains it - and it explains, why results are "born" with quorum and replication = 1, which is increased during their lifetime.
ID: 63075 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ananas

Send message
Joined: 19 Aug 08
Posts: 12
Credit: 2,500,263
RAC: 0
Message 63078 - Posted: 25 Jan 2015, 23:07:06 UTC - in response to Message 63075.  
Last modified: 25 Jan 2015, 23:10:05 UTC

... results are "born" with quorum and replication = 1, which is increased during their lifetime.

Situation changed, my host seems to be classified "reliable" now and results are accepted without having a wingman.

So that configuration setting does have the potential to increase the overall project performance.
ID: 63078 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tex1954

Send message
Joined: 22 Apr 11
Posts: 64
Credit: 899,267,854
RAC: 7,501
Message 63213 - Posted: 11 Mar 2015, 4:22:07 UTC - in response to Message 63078.  

... results are "born" with quorum and replication = 1, which is increased during their lifetime.

Situation changed, my host seems to be classified "reliable" now and results are accepted without having a wingman.

So that configuration setting does have the potential to increase the overall project performance.


Well, I'm still getting about 1.5% errors on Modified Fit tasks on ALL systems, even the E3-1230 setup...

The problem is NOT with my hardware, but rather with the MW code somehow... and still no response from the managers...

You know, we paid to keep this project going, would be NICE if they could support us a bit, unless they will be giving up after funds run out again...

Sheesh...
ID: 63213 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Blurf
Volunteer moderator
Project administrator

Send message
Joined: 13 Mar 08
Posts: 804
Credit: 26,380,161
RAC: 0
Message 63214 - Posted: 12 Mar 2015, 3:04:47 UTC - in response to Message 63213.  
Last modified: 12 Mar 2015, 3:07:12 UTC

You know, we paid to keep this project going, would be NICE if they could support us a bit, unless they will be giving up after funds run out again...

Sheesh...


We're not "giving up"...Your concern has been escalated to try to get you an answer ASAP on this matter.

ID: 63214 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matthew
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 6 May 09
Posts: 217
Credit: 6,856,375
RAC: 0
Message 63219 - Posted: 12 Mar 2015, 17:19:19 UTC - in response to Message 63213.  

Hey guys - any chance you can post the log output on the errored units? Or link us to one of these results? We need something to go on...

One of the advantages of the forums is that posts can get community support. Since we do not have access to every possible hardware configuration, this is usually the way to go. If you have an issue that you think needs to be addressed directly by the team, you should send it to our email address, which is displayed in various places: astro[at]cs.lists.rpi[dot]edu
ID: 63219 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tex1954

Send message
Joined: 22 Apr 11
Posts: 64
Credit: 899,267,854
RAC: 7,501
Message 63224 - Posted: 13 Mar 2015, 12:27:16 UTC - in response to Message 63219.  
Last modified: 13 Mar 2015, 12:43:48 UTC

Hey guys - any chance you can post the log output on the errored units? Or link us to one of these results? We need something to go on...

One of the advantages of the forums is that posts can get community support. Since we do not have access to every possible hardware configuration, this is usually the way to go. If you have an issue that you think needs to be addressed directly by the team, you should send it to our email address, which is displayed in various places: astro[at]cs.lists.rpi[dot]edu


Sure...

http://milkyway.cs.rpi.edu/milkyway/results.php?userid=159224&offset=0&show_names=0&state=5&appid=

All my errors...

8-)

PS: I might add that last I checked, my 6970 or 6990 cards don't work with newer tasks with OpenCL drivers past 12.4 without major errors... but they work fine with later drivers on other projects... might have to test that again... In fact, I think later drivers mess up with 7970/ R9 280x as well..
ID: 63224 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tex1954

Send message
Joined: 22 Apr 11
Posts: 64
Credit: 899,267,854
RAC: 7,501
Message 63228 - Posted: 14 Mar 2015, 9:32:06 UTC - in response to Message 63224.  
Last modified: 14 Mar 2015, 10:24:47 UTC

I've tried 2 versions of windows (re-installed twice from scratch), tried three (3) different 7970 cards, tried everything I can think of and running pure stock speeds with NO overclocks and I still get a a lot of validate errors on modified fit tasks as before...

http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=612072&offset=0&show_names=0&state=5&appid=10

Hope this helps... other tasks seem fine... other projects no problem either.

It's still only the Modified fit tasks giving me INVALID results... I have the E3-1230 and 2600K setups running ONLY modified fit tasks now for you... I also may add that from time to time a WU fails quickly and has caused a BSOD or more often causes the GPU to reset to a low clock speed and stay there... Driver versions make no difference.


8-)
ID: 63228 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dunx

Send message
Joined: 13 Feb 11
Posts: 31
Credit: 1,403,524,537
RAC: 0
Message 63232 - Posted: 15 Mar 2015, 10:09:24 UTC
Last modified: 15 Mar 2015, 10:09:37 UTC

I have two R9 280X cards and my error rate is similar to yours Tex....

I'm running two WU's per card and Cosmology on the free CPU cores.

dunx

P.S. Much as I dislike the waste of resources, it's a %age I'm able to tolerate....

P.P.S. NOT using X-Fire as ( like you I'm not a gamer ! ) - wonder if actually using X-Fire with two cards might work better overall ? ? ?
ID: 63232 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,939,765
RAC: 22,862
Message 63234 - Posted: 15 Mar 2015, 12:25:08 UTC - in response to Message 63232.  

I have two R9 280X cards and my error rate is similar to yours Tex....

I'm running two WU's per card and Cosmology on the free CPU cores.

dunx

P.S. Much as I dislike the waste of resources, it's a %age I'm able to tolerate....

P.P.S. NOT using X-Fire as ( like you I'm not a gamer ! ) - wonder if actually using X-Fire with two cards might work better overall ? ? ?


No since using both gpu's for the same units is not possible in Boinc, as it is gaming, then x-fire is NOT helpful and in fact can make it worse in some circumstances.
ID: 63234 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dunx

Send message
Joined: 13 Feb 11
Posts: 31
Credit: 1,403,524,537
RAC: 0
Message 63235 - Posted: 15 Mar 2015, 16:20:51 UTC - in response to Message 63234.  

Well, TBH having nearly reached 10^8 I'm ready to move to another project for credits....

dunx
ID: 63235 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tex1954

Send message
Joined: 22 Apr 11
Posts: 64
Credit: 899,267,854
RAC: 7,501
Message 63237 - Posted: 15 Mar 2015, 17:54:00 UTC - in response to Message 63235.  

Well, I've had some Modified Fit WU's get hung on 2 systems running them... abort them and the system hangs for some reason... reboot and things running better now. Seems there was something wrong with some of them.

Also, I was getting very strange GPU Usage waveforms from them. They seem to have settled down now...

All of this is obviously a coding problem on the project side OR something wrong with the BOINC control side or possibly both. I simply don't know...

Anyways, off to other places for a day or two then be back...

I hope the folks can figure out what is wrong...

8-)
ID: 63237 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,939,765
RAC: 22,862
Message 63240 - Posted: 16 Mar 2015, 11:01:54 UTC - in response to Message 63235.  

Well, TBH having nearly reached 10^8 I'm ready to move to another project for credits....

dunx


Personally I do that when I hit one billion at a project, but I am pretty sure MilkyWay appreciates your contributions to date and will still like it whether you decide to stay or go elsewhere.
ID: 63240 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : What is the cause of these 'validate errors'

©2024 Astroinformatics Group