Message boards :
News :
Apology for recent bad batches of workunits
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 24 Jul 12 Posts: 40 Credit: 7,123,301,054 RAC: 0 |
Link, Tackleway has some good points. I've been checking the error from my system, less now than before but still coming through. I check the WUs to see what other systems have failed. If I see 4 errors while computing for the work unit, I assume I have no problems with my system. The WU was bad. The failures are across many systems, CPUs, and GPUs. One recent failure on a "free_3" WU showed an error for an AMD CPU computer, 2 opencl-nvidia computers, and my opencl_amd_ati machine for a total of 4 (WU assumed bad). I've seen many combinations of CPU and GPU failures. This is NOT a GPU driver issue. Joe |
Send message Joined: 19 Jul 10 Posts: 624 Credit: 19,299,838 RAC: 2,590 |
One recent failure on a "free_3" WU showed an error for an AMD CPU computer, 2 opencl-nvidia computers, and my opencl_amd_ati machine for a total of 4 (WU assumed bad). I've seen many combinations of CPU and GPU failures. That I've seen too, I even posted an example of such WU in message 55867 in this thread. Tackleway however wrote "all the tasks failures here were processed only by CPU". That's wrong. I don't think it's a driver issue either, see my post in NC or my message 55867 in this thread. |
Send message Joined: 17 Mar 10 Posts: 20 Credit: 5,641,904 RAC: 0 |
Sorry, maybe you have misunderstood my statement "all the tasks failures here were processed only by CPU". was referring to my 4 machines at this location,therefore it is not wrong as I do not use my GPU's to process for Milkyway. [/b] |
Send message Joined: 19 Jul 10 Posts: 624 Credit: 19,299,838 RAC: 2,590 |
Ups... my mistake, sorry. |
Send message Joined: 29 Aug 10 Posts: 25 Credit: 2,172,252,217 RAC: 0 |
It is possible that the remaining errors are due to outdated BOINC applications or GPU drivers. So far, every erroring computer that I have looked at (which is not all of them) has outdated GPU or BOINC versions. I have tried various different drivers and BOINC versions and things have not improved. With older cards in particular the older drivers tend to be better - faster and more stable. The 7.x versions of BOINC are also crap for this project because the scheduler causes clients to sit idle half the time. |
Send message Joined: 19 Jul 10 Posts: 624 Credit: 19,299,838 RAC: 2,590 |
The 7.x versions of BOINC are also crap for this project because the scheduler causes clients to sit idle half the time. That's usually because v7 needs other cache settings than v6 and older. What do you have set as "Maintain enough tasks to keep busy for at least X days" and "... and up to an additional Y days"? |
Send message Joined: 29 Aug 10 Posts: 25 Credit: 2,172,252,217 RAC: 0 |
I think they are both set to 1 day. I have tried various setting and it makes no difference. It rarely gets more than about 5 minutes work at a time and when it runs out it sits there doing nothing for ages. Now that there are so many errors it seems to defer for even longer, like several hours. Since these new WUs have started my output is less than half what it was before and this is mainly due to the errors causing boinc to sit idle for hours at a time. |
Send message Joined: 29 Aug 12 Posts: 31 Credit: 40,781,945 RAC: 0 |
I don't believe that really applies here, if you run more than 1 project Milkyway will soon have no tasks to run. This happens for two reasons, the Milkway limits on the number of tasks a computer is allowed at a time, I get 75 max. and the schedular filling all remaining time with tasks from other projects when Milkway has none available. Other projects attempt to fill the x Days and Y Extra settings but not Milkyway, happens to me regularly and causes a lot of manual intervention where it should'nt be required. |
Send message Joined: 6 Apr 08 Posts: 13 Credit: 139,088,163 RAC: 0 |
OK loads of errors still is there an end in sight ? i'm running at 270 in my stats for just one machine.. whats happening ? cheers folks |
Send message Joined: 25 Jan 11 Posts: 271 Credit: 346,072,284 RAC: 0 |
my error rate is right around 3.5% right now, up slightly from yesterday and the day before... |
Send message Joined: 5 Nov 10 Posts: 69 Credit: 15,064,831 RAC: 0 |
OK, a bit weird I know ... On this one PC all GPU WU's failed after 4 seconds with Computation Error. Tens of 'em. I sensed this because I was typing stuff on this PC and everything ran much quicker. LOL! So, checked GPU results - all fails. Suspended Project. Cold-booted. Next GPU WU completed OK. And the next one. Anthe next one is currently at +75% ... looks OK now. HTH, Ray XP Pro SP3, HD3650 AGP, BOINC 6 12.34, CCC 11.8 (8.881) ... all installed for a looong time on this one PC. |
Send message Joined: 25 Jan 11 Posts: 271 Credit: 346,072,284 RAC: 0 |
my error rate has risen to 4% (up from 3.5%) in the last 24 hours. i know there's no way to tell if a WU is bad until it has failed for multiple wingmen...but does anyone have an educated guess as to how much longer it'll be until all these "bad" WU's are flushed out of the system? |
Send message Joined: 6 Apr 08 Posts: 13 Credit: 139,088,163 RAC: 0 |
4.7% here 323 errors in my one machines stats pretty constantly Please let us know a little more news about these bad units Also the quorum appears too big, 4 machines have to show the same error before the unit is dumped as far as i can see the main problem is that once the boinc client has an error computing it is much less likely to keep the cache full, so unless you nurse your machines they sit there with empty caches a little update would be appreciated :-) cheers |
Send message Joined: 6 May 09 Posts: 217 Credit: 6,856,375 RAC: 0 |
Hi - sorry guys, I've been hit by job application season and I've been buried. :P This is a weird error; the binaries haven't been updated since before the last set of stable jobs, but the server code has. The server shouldn't be able to throw bad WU's that behave like these do. I've found that the "de_separation_22_3s_free_3" WUs seem to be returning unphysical likelihoods; it is possible that this run is responsible for the errors that people are seeing. Sorry for the inconvenience, we'll try to get this sorted out in the next day or 2. -Matthew |
Send message Joined: 6 Apr 08 Posts: 13 Credit: 139,088,163 RAC: 0 |
great, thanks for the update Matthew |
Send message Joined: 25 Jan 11 Posts: 271 Credit: 346,072,284 RAC: 0 |
looking forward to the fix...my error rate is now up to %5 |
Send message Joined: 24 Jul 12 Posts: 40 Credit: 7,123,301,054 RAC: 0 |
Matthew, It's not just the "de_separation_22_3s_free_3" WUs that have problems. Many "de_separation_22_3s_edge_3" WUs are also failing on AMD CPUs, Intel CPUS, Nvidia cards, and AMD cards. Joe |
Send message Joined: 1 Apr 10 Posts: 49 Credit: 171,863,025 RAC: 0 |
Matthew, I very much concur with Mr. Marshals findings ! Regards John G |
Send message Joined: 24 Jul 12 Posts: 40 Credit: 7,123,301,054 RAC: 0 |
John, Good to get confirmation that others are seeing the same failures. Thanks, Matthew, Travis, I don't know if this is going to help, but instead of pulling my machines off MW, I'm going to reallocated my resources to MW (1 NVidia GTX 560Ti, 3 AMD HD 7950s, and 7 of 14 CPU cores spread across three machines). I'll either help clear out the bad WUs or add to the confusion. By the way, I haven't seen any failures where my wingmen didn't also have faiures. That brings up a question. If these WUs have bad data or something else wrong how confident are you that the WUs that validate are really good? Joe |
Send message Joined: 13 Mar 08 Posts: 804 Credit: 26,380,161 RAC: 0 |
John, Thank you Joe... |
©2024 Astroinformatics Group