Message boards :
Number crunching :
What is the cause of these 'validate errors'
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 22 Apr 11 Posts: 66 Credit: 904,197,220 RAC: 35,262 |
Well, I did also check to see if I can still run my 6990 with 12.4 and it still works, but anything past 12.8 fails miserably. I wish someone somewhere (maybe the development folks at MW?) would nail down the 69xx series problems and dual GPU problems so more folks could participate. Seems to ME we sorta PAYED them to do it didn't we? How about some help guys! Good crunching and Happy New Year! 8-) |
Send message Joined: 24 Jan 11 Posts: 715 Credit: 555,493,277 RAC: 38,602 |
I just got back to crunching recently and am running into the same problem as you guys are. However I'm only running one 270x and am getting the random tasks not validating. I saw a comment somewhere about what the problem might be. Essentially, the cleanup work done at the end of computation gets interrupted by an overbusy computer probably because you are also running other projects and their aren't enough resources to complete the task completion transaction. The only solution is not to run other projects at the same time and/or wait for the programmers to fix the underlying communications that are getting fouled. Cheers, Keith |
Send message Joined: 22 Apr 11 Posts: 66 Credit: 904,197,220 RAC: 35,262 |
That IS a lot of testing I wouldn't have the patience for. I run two identical Nvidia cards in Sabertooth 990FX motherboards. I had the issue with the GTX 670's and now still with the GTX 970's. The cards will run the tasks from SETI and Einstein with NO issues. Just have the problem with the Modified Fit tasks ... about a 3% error rate. No problem with the standard GPU tasks. I ran with the problem with the 670's, stopped the Modified Fit and retried again with the 970's to see if the change in cards might have fixed the issue. Still have the problem so have turned off the Modified Fit again. I think, like you do that the problem is with BOINC or the application itself. Running the latest BOINC and only a point version or two behind the current drivers. Had to update the Nvidia drivers so they could handle the 970's. With Einstein being down, I've been running MW on another system... a totally stable XEON E3-1230 V2 3.5GHZ on new Z77 mobo and have 182 invalid and 4155 valid on a 7970 using AMD 14.12 drivers and BOINC 7.4.36... looks like about 4.5% error rate at slightly less than stock speeds. Soo, there is for sure a problem with the tasks somehow and it would be really nice if some developer could figure it out... All these errors can't be helping the project... 8-) |
Send message Joined: 22 Apr 11 Posts: 66 Credit: 904,197,220 RAC: 35,262 |
Hmm, seems one of my systems lost a Mobo... the 8x8 PCIe 2.0 one... sooo, replaced it with a Sabertooth 16x16 and going to try the dual card thing again... Seems the Mobo lost a PCIe lane on one slot... that is possibly due to a card that went bad some months ago. It is possible the DUAL card tests are not so accurate because of that. Anyways, we will see... running a 7970 and GTX 670 together on the Sabertooth to see how it goes.. 8-) |
Send message Joined: 24 Jan 11 Posts: 715 Credit: 555,493,277 RAC: 38,602 |
I have just accepted that I can live with a 3% error rate. That has stayed consistent for the tasks completed on MilkyWay. Only project that is giving me errors. None at Einstein. The ones at Seti are from known incorrectly formatted tasks. I try to abort those as soon as I find them so I don't process them and waste time and energy. I tried an experiment where I shut off the 1.36 tasks to see if only those gave errors since I didn't see any at first on the 1.02 tasks but soon found out that the 1.02 tasks were just as likely to have the 3% error rate too. So now crunching both types of tasks with the stable error rate. I don't think there is any solution other than wait for the developer to fix the apps so that the task completion mechanism isn't constrained by contention of GPU resources. I found out that even when just solely running MW by itself on a card that you get the same error rate. I only give MW 15% of system resources anyway and 80% to Seti. Einstein gets the 5% leftover since their GPU app is so poorly optimized for Nvidia cards. I have very good luck with my two Sabertooth motherboards. Hope you find the same. Cheers, Keith |
Send message Joined: 22 Apr 11 Posts: 66 Credit: 904,197,220 RAC: 35,262 |
So far so good on the Sabertooth... running like a clock since I replaced the suspect mobo... I suppose the error rate is going to stay with us as well... mine is decreasing now and is less than 1% overall... happy with that! Like my OLD Sabertooth 990X as well... 15600 tasks, 80 errors... 8-) |
Send message Joined: 19 Aug 08 Posts: 12 Credit: 2,500,263 RAC: 0 |
Just out of curiousity : Unlike all other projects, the validator here seems to grab a workunit when only one out of two required results are back. Is there a reason for that and does a single result with a wingman that is still "in progress" even have a chance to be valid if the quorum is higher than 1? To me this looks like a strange behaviour of the validator (or transitioner). I sometimes have results with quorum=2, initial replication=2, only one result back but that one has the status "Completed, validation inconclusive" instead of the status "Completed, waiting for validation" that I would expect for a result that isn't ready for a first validator attempt yet. |
Send message Joined: 4 Sep 12 Posts: 219 Credit: 456,474 RAC: 0 |
My belief is that at some point in the dim and distant past, somebody set an obscure configuration policy on the server, and subsequently forgot all about it: Adaptive Replication |
Send message Joined: 19 Aug 08 Posts: 12 Credit: 2,500,263 RAC: 0 |
Thanks, that explains it - and it explains, why results are "born" with quorum and replication = 1, which is increased during their lifetime. |
Send message Joined: 19 Aug 08 Posts: 12 Credit: 2,500,263 RAC: 0 |
... results are "born" with quorum and replication = 1, which is increased during their lifetime. Situation changed, my host seems to be classified "reliable" now and results are accepted without having a wingman. So that configuration setting does have the potential to increase the overall project performance. |
Send message Joined: 22 Apr 11 Posts: 66 Credit: 904,197,220 RAC: 35,262 |
... results are "born" with quorum and replication = 1, which is increased during their lifetime. Well, I'm still getting about 1.5% errors on Modified Fit tasks on ALL systems, even the E3-1230 setup... The problem is NOT with my hardware, but rather with the MW code somehow... and still no response from the managers... You know, we paid to keep this project going, would be NICE if they could support us a bit, unless they will be giving up after funds run out again... Sheesh... |
Send message Joined: 13 Mar 08 Posts: 804 Credit: 26,380,161 RAC: 0 |
You know, we paid to keep this project going, would be NICE if they could support us a bit, unless they will be giving up after funds run out again... We're not "giving up"...Your concern has been escalated to try to get you an answer ASAP on this matter. |
Send message Joined: 6 May 09 Posts: 217 Credit: 6,856,375 RAC: 0 |
Hey guys - any chance you can post the log output on the errored units? Or link us to one of these results? We need something to go on... One of the advantages of the forums is that posts can get community support. Since we do not have access to every possible hardware configuration, this is usually the way to go. If you have an issue that you think needs to be addressed directly by the team, you should send it to our email address, which is displayed in various places: astro[at]cs.lists.rpi[dot]edu |
Send message Joined: 22 Apr 11 Posts: 66 Credit: 904,197,220 RAC: 35,262 |
Hey guys - any chance you can post the log output on the errored units? Or link us to one of these results? We need something to go on... Sure... http://milkyway.cs.rpi.edu/milkyway/results.php?userid=159224&offset=0&show_names=0&state=5&appid= All my errors... 8-) PS: I might add that last I checked, my 6970 or 6990 cards don't work with newer tasks with OpenCL drivers past 12.4 without major errors... but they work fine with later drivers on other projects... might have to test that again... In fact, I think later drivers mess up with 7970/ R9 280x as well.. |
Send message Joined: 22 Apr 11 Posts: 66 Credit: 904,197,220 RAC: 35,262 |
I've tried 2 versions of windows (re-installed twice from scratch), tried three (3) different 7970 cards, tried everything I can think of and running pure stock speeds with NO overclocks and I still get a a lot of validate errors on modified fit tasks as before... http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=612072&offset=0&show_names=0&state=5&appid=10 Hope this helps... other tasks seem fine... other projects no problem either. It's still only the Modified fit tasks giving me INVALID results... I have the E3-1230 and 2600K setups running ONLY modified fit tasks now for you... I also may add that from time to time a WU fails quickly and has caused a BSOD or more often causes the GPU to reset to a low clock speed and stay there... Driver versions make no difference. 8-) |
Send message Joined: 13 Feb 11 Posts: 31 Credit: 1,403,524,537 RAC: 0 |
I have two R9 280X cards and my error rate is similar to yours Tex.... I'm running two WU's per card and Cosmology on the free CPU cores. dunx P.S. Much as I dislike the waste of resources, it's a %age I'm able to tolerate.... P.P.S. NOT using X-Fire as ( like you I'm not a gamer ! ) - wonder if actually using X-Fire with two cards might work better overall ? ? ? |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
I have two R9 280X cards and my error rate is similar to yours Tex.... No since using both gpu's for the same units is not possible in Boinc, as it is gaming, then x-fire is NOT helpful and in fact can make it worse in some circumstances. |
Send message Joined: 13 Feb 11 Posts: 31 Credit: 1,403,524,537 RAC: 0 |
Well, TBH having nearly reached 10^8 I'm ready to move to another project for credits.... dunx |
Send message Joined: 22 Apr 11 Posts: 66 Credit: 904,197,220 RAC: 35,262 |
Well, I've had some Modified Fit WU's get hung on 2 systems running them... abort them and the system hangs for some reason... reboot and things running better now. Seems there was something wrong with some of them. Also, I was getting very strange GPU Usage waveforms from them. They seem to have settled down now... All of this is obviously a coding problem on the project side OR something wrong with the BOINC control side or possibly both. I simply don't know... Anyways, off to other places for a day or two then be back... I hope the folks can figure out what is wrong... 8-) |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
Well, TBH having nearly reached 10^8 I'm ready to move to another project for credits.... Personally I do that when I hit one billion at a project, but I am pretty sure MilkyWay appreciates your contributions to date and will still like it whether you decide to stay or go elsewhere. |
©2024 Astroinformatics Group