Message boards :
Number crunching :
What is the cause of these 'validate errors'
Message board moderation
Author | Message |
---|---|
Send message Joined: 24 Jan 11 Posts: 715 Credit: 555,491,260 RAC: 38,656 |
Haven't figured out what causes these validate errors. They all finish with the correct exit code but I notice the majority of them have an empty std_error.txt output. So, in all about 3% of my tasks have these Invalid errors or 32 out of 964 valid results. Anyone have an idea what is going on? http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=257518&offset=0&show_names=0&state=5&appid= Cheers, Keith |
Send message Joined: 8 Oct 07 Posts: 52 Credit: 5,866,162 RAC: 4,027 |
Haven't figured out what causes these validate errors. They all finish with the correct exit code but I notice the majority of them have an empty std_error.txt output. So, in all about 3% of my tasks have these Invalid errors or 32 out of 964 valid results. Anyone have an idea what is going on? I was just checking up on one of my WU's and noticed that one of my wingmen is having similar validation issues. I know that's not much help but at least you know you are not alone. |
Send message Joined: 24 Jan 11 Posts: 715 Credit: 555,491,260 RAC: 38,656 |
Before I posted I tried a search on the forums and didn't come up with anything specific to this problem, therefore the inquiry. I did find this thread: http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3631 but that didn't match what I'm seeing with the empty std_err.txt outputs. Unless someone else points us to the specific problem, we might just have to live with the errors and chalk them up to a flaky app or something. Cheers, Keith |
Send message Joined: 8 Oct 07 Posts: 52 Credit: 5,866,162 RAC: 4,027 |
Unless someone else points us to the specific problem, we might just have to live with the errors and chalk them up to a flaky app or something. I presume your nvidia driver is up to date. As I'm sure you know, other versions of the Modified Fit program have had/still have bugs. (i.e. See this thread.) Therefore, it's seems logical to assume that there may be problems with the opencl_ati_101 and opencl_nvidia_101 version as well. In other words, it might be a good idea for you to block Mod Fit units for a while. |
Send message Joined: 24 Jan 11 Posts: 715 Credit: 555,491,260 RAC: 38,656 |
Yes, running the latest driver for my Nvidia cards. I only process GPU tasks. Thanks for the link to that thread. Seems to focus on ATI cards or CPU tasks but like you said it could have issues with NV Open_CL apps too. I only seem to generate errors on the 1.36 Open_CL tasks. Currently 60 Invalids out of 2000, so about 3% with the truncated stderr.txt outputs. That seems to match one of the reported problems in that thread but not all situations. The only reason I noticed the problem is because my systems are currently running mainly MW and Einstein while SETI works its problems out. Usually, the mix is heavily weighted toward SETI and only generate a small number of invalids in MW that I put down to resource contention with SETI and it never bothered me because of the low daily amount of just one or two a day. Its just gotten noticeable because of the huge amount of work I've processed for MW in the last couple of days. If the SETI problem doesn't sort itself out this week, I will likely stop the 1.36 Open_CL Modified Fit work as you suggest. Cheers, Keith |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
Yes, running the latest driver for my Nvidia cards. I only process GPU tasks. So you have an 8 core processor and don't run ANY cpu units at all, from any project? What do your pc's do besides use the gpu's then? I'm asking because the problem could be the gpu not getting enough resources when it needs them. Most people leave a cpu core free for each physical gpu they have in the machine. Also do you game? If not why are you using the latest gpu drivers? Gpu makers primary market is mainframes and gamers, we crunchers are barely a blip on their radar and the software they put out is geared towards their primary customers, not us crunchers. Meaning the latest drivers are often BAD for us crunchers, they tweak things for better gaming and that often makes things slow down for crunching. Some versions have actually made it worse by 10% or more. If you are a gamer do you suspend Boinc when you are playing? If not track your bad units and see if it is happening when you are gaming, as gaming uses ALL of the gpu leaving nothing for Boinc. One last thing did you put on the strap between the gpu's connecting them as one? If so and you are NOT a gamer you can take it off as Boinc does not use gpu's in that way. Do you have a "cc_config.xml" file telling Boinc to use all your gpu's? I notice you have 2 gpu's in each pc, without the line Boinc can get confused sometimes. Also do both gpu's crunch for the same project? That too can cause problem with Nvidia cards at some projects, splitting them to each run their own project can be done thru the same "cc_config.xml" file. |
Send message Joined: 24 Jan 11 Posts: 715 Credit: 555,491,260 RAC: 38,656 |
I only process GPU tasks for MW and Einstein alone. I crunch both CPU and GPU tasks for SETI. I only use 6 of the 8 cores for the CPU tasks keeping the other two cores to feed the GPU's. I don't game. I don't have any need for SLI. I have noticed that some people have remained on older drivers but I've never seen any thread that definitely states that so and so driver is the best one for crunching. I just keep the drivers up to date as that seems to also keep the math capabilities up to date too from the driver notes. And now with some new hardware, I have to run later drivers for them to work in BOINC anyway. I use all co-processors in cc_config.xml. I have so ever since I re-built these workstations back in 2011. I have been crunching SETI since 2001. I have switched off the MW 1.36 Mod Fit tasks for the while since that app seems to be the only one errroring out occasionally. I just do the MW 1.02 Open_CL app now and the errors have stopped. I'll check back with the project every so often and see whether the developers have figured out the issues with the 1.36 app. Cheers, Keith |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
I only process GPU tasks for MW and Einstein alone. I crunch both CPU and GPU tasks for SETI. I only use 6 of the 8 cores for the CPU tasks keeping the other two cores to feed the GPU's. I don't game. I don't have any need for SLI. I have noticed that some people have remained on older drivers but I've never seen any thread that definitely states that so and so driver is the best one for crunching. I just keep the drivers up to date as that seems to also keep the math capabilities up to date too from the driver notes. And now with some new hardware, I have to run later drivers for them to work in BOINC anyway. I use all co-processors in cc_config.xml. I have so ever since I re-built these workstations back in 2011. I have been crunching SETI since 2001. I have switched off the MW 1.36 Mod Fit tasks for the while since that app seems to be the only one errroring out occasionally. I just do the MW 1.02 Open_CL app now and the errors have stopped. I'll check back with the project every so often and see whether the developers have figured out the issues with the 1.36 app. It sounds like you are go to go then, congratulations and keep on crunching! |
Send message Joined: 22 Apr 11 Posts: 66 Credit: 904,194,460 RAC: 35,207 |
I get a ton of them as well, but ONLY with the modified fit WUs... everything else runs fine.. 8-) |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
I get a ton of them as well, but ONLY with the modified fit WUs... everything else runs fine.. Updating to the latest version of Boinc might help. IF you use a zero resources share at some project then the latest Beta might be better as they fixed several problems in it. You can get all the versions here: http://boinc.berkeley.edu/dl/?C=M;O=D |
Send message Joined: 22 Apr 11 Posts: 66 Credit: 904,194,460 RAC: 35,207 |
I have gone through 5 versions of AMD drivers, lowered clock speeds a lot, raised them a little, re-installed Windoz, changed the MTU, went from SSD to hard disk, underclocked and overclocked the CPU, changed memory sticks, disabled HD caching, several versions of BOINC, and pretty much tried everything except change the CPU & motherboard and still get tons of errors on the Modified Fit tasks. The thing is, I see others are running it fine! The only two things different is the PCIe bus is running x8 mode because I have 2 GPU's plugged into it and the CPU on some invalid tasks. I've done everything I can otherwise and only managed to slightly reduce the error rate which seems to be about 10% or so... Any more idea's???? 8-) |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
I have gone through 5 versions of AMD drivers, lowered clock speeds a lot, raised them a little, re-installed Windoz, changed the MTU, went from SSD to hard disk, underclocked and overclocked the CPU, changed memory sticks, disabled HD caching, several versions of BOINC, and pretty much tried everything except change the CPU & motherboard and still get tons of errors on the Modified Fit tasks. I'm guessing you already have both an app_config.xml and cc_config.xml file, add the needed lines to the cc_config.xml file and see if they help: <cc_config> <options> <use_all_gpus>1</use_all_gpus> <exclude_gpu> <url>http://milkyway.cs.rpi.edu/milkyway/</url> <device_num>1</device_num> <type>ATI</type> </exclude_gpu> </options> </cc_config> The exclude gpu part will exclude 1 gpu from MW and you can then switch to see which gpu is causing you the problems. I'm ALSO assuming you are leaving 2 cpu cores free just to feed the 2 gpu's...right? One other thing is this is MW do not keep messing with the drivers, MW tests their apps to ensure they work with all driver versions, but that isn't done until the drivers are released, meaning don't always jump to the latest and greatest just because they are out. Your 6900 gpu's are using driver version 1.4.1720, while your 7800 cards are using driver version 1.4.1848. While they MAY work just fine, they also may not because of some tweak that MW then needs to adjust for. Once you find a driver that works stick with it, only switching if you see others saying it is better. OR of course if you game, then all bets are off, as NOT upgrading can lessen your gaming experience. |
Send message Joined: 22 Apr 11 Posts: 66 Credit: 904,194,460 RAC: 35,207 |
well, I had thought to check if one of the 7970's is causing problems... then forgot about it. And yes, I don't like to update if things are working well... I'll give it a go and see if it's board specific.. Thanks! 9-) |
Send message Joined: 22 Apr 11 Posts: 66 Credit: 904,194,460 RAC: 35,207 |
Okay, 6 hours running on one GPU and then the other... no problems at all. Not a single error from either GPU. The INSTANT I run BOTH GPU's, it starts spitting out errors... ONLY does this with Modified fit tasks. Runs PrimeGrid, Einstein, Collatz just fine. Seems to me I either have a unique and weird Mobo problem or there is something in the Modified Fit tasks interfering with each other. 8-) Edit: Running PrimeGrid on one of the cards while MW is running on the other card (only Modified Fit WUs) causes no problems!!!! Both run fine! I think that eliminates any motherboard PCIe problem as the PG tasks use a LOT more bandwidth on the PCIe buss compared to MW tasks.. I think it is certain now that some global parameter is getting clobbered or some sort of variable/name/label is messing up when MW runs on BOTH GPU's at the same time... I think I have to leave it to developers to solve this one. 8-) |
Send message Joined: 22 Apr 11 Posts: 66 Credit: 904,194,460 RAC: 35,207 |
Okay, moved two 7970's to my Sabertooth 990FX and tried all the tests again... same kind of failures on this DUAL PCIe 16x16 setup. The PCIe buss width seems not to be an issue, merely the fact that two AMD cards are installed. IF I run ONLY one card or the other, it works fine. IF I run MW on either card and anything else on the other, it screws up also. Sooo, seems there is definitely something wrong with Modified Fit tasks since no problems of any sort are exhibited on any other MW tasks or other projects with DUAL 7970's. To solve the problem, I swapped a GTX 670 from one system with one of the 7970's and swapped tasks on systems. This seems to work fine so far... MW on the 7970 and another project on the 670. Again, I am only running Modified Fit tasks... 8-) PS: FWIW, install NVidia drivers, then re-install the AMD drivers and make sure the AMD/ATI card is the MAIN display in the FIRST PCIe slot and it all works fine. Doesn't seem to work in the reverse... AMD insists it be king I suppose... LOL! |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
That's ALOT of testing!!! I am glad it is working now though, I have heard of 2 Nvidia cards having problems at some projects, but those were Boinc related. I had never heard of 2 AMD cards having problems, I AM glad it is only the modified-fit units you are having problems with though! |
Send message Joined: 22 Apr 11 Posts: 66 Credit: 904,194,460 RAC: 35,207 |
I just wanted to narrow it down. I've also tested some overclocking and it's very very sensitive to anything not perfect. A very slight change up/down can make the WU's in question fail... and it doesn't seem to be a real problem per-se, rather I think it's a timing problem somehow on the PCIe buss. But, that is speculation. It could easily be a base OpenCL driver problem that only shows up when two cards are being used. Personally, I haven't debug tools to even begin to assess the problem area nor the expertise. In any case, I've about done all I can... I can always make the Modified Fit WU's fail with a marginal overclock... it's very touchy... but I can also overclock the boards 10% without problems. Sooo, somebody else will have to figure out the rest. 8-) |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
I just wanted to narrow it down. I've also tested some overclocking and it's very very sensitive to anything not perfect. A very slight change up/down can make the WU's in question fail... and it doesn't seem to be a real problem per-se, rather I think it's a timing problem somehow on the PCIe buss. But, that is speculation. Sounds good to me. |
Send message Joined: 24 Jan 11 Posts: 715 Credit: 555,491,260 RAC: 38,656 |
That IS a lot of testing I wouldn't have the patience for. I run two identical Nvidia cards in Sabertooth 990FX motherboards. I had the issue with the GTX 670's and now still with the GTX 970's. The cards will run the tasks from SETI and Einstein with NO issues. Just have the problem with the Modified Fit tasks ... about a 3% error rate. No problem with the standard GPU tasks. I ran with the problem with the 670's, stopped the Modified Fit and retried again with the 970's to see if the change in cards might have fixed the issue. Still have the problem so have turned off the Modified Fit again. I think, like you do that the problem is with BOINC or the application itself. Running the latest BOINC and only a point version or two behind the current drivers. Had to update the Nvidia drivers so they could handle the 970's. As I stated, the error rate is only about 3% on the Modified Fit. The majority of tasks finish correctly. I think there is some kind of contention problem going on. Whether it is with the application, BOINC or the hardware.... I don't know. I tried running single task, single project on the cards and it made no difference in the error rate. From your experiments, I kinda don't think it is a hardware issue. Wish the project app developers would chime in on this observed behavior and issue some kind of statement on what the issue is. Cheers, Keith |
Send message Joined: 21 Nov 09 Posts: 49 Credit: 20,942,758 RAC: 0 |
I just got back to crunching recently and am running into the same problem as you guys are. However I'm only running one 270x and am getting the random tasks not validating. I'm not sure if it was a certain batch of WUs or what, but I'm down to just one invalid currently. I'll keep an eye on it and post back if it goes back up. |
©2024 Astroinformatics Group