Apology for recent bad batches of workunits

Author	Message
JHMarshall Send message Joined: 24 Jul 12 Posts: 40 Credit: 7,123,301,054 RAC: 0	Message 55894 - Posted: 21 Oct 2012, 1:59:12 UTC - in response to Message 55884. Link, Tackleway has some good points. I've been checking the error from my system, less now than before but still coming through. I check the WUs to see what other systems have failed. If I see 4 errors while computing for the work unit, I assume I have no problems with my system. The WU was bad. The failures are across many systems, CPUs, and GPUs. One recent failure on a "free_3" WU showed an error for an AMD CPU computer, 2 opencl-nvidia computers, and my opencl_amd_ati machine for a total of 4 (WU assumed bad). I've seen many combinations of CPU and GPU failures. This is NOT a GPU driver issue. Joe ID: 55894 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 812 Credit: 20,950,354 RAC: 6,108	Message 55895 - Posted: 21 Oct 2012, 7:52:50 UTC - in response to Message 55894. Last modified: 21 Oct 2012, 7:53:35 UTC One recent failure on a "free_3" WU showed an error for an AMD CPU computer, 2 opencl-nvidia computers, and my opencl_amd_ati machine for a total of 4 (WU assumed bad). I've seen many combinations of CPU and GPU failures. This is NOT a GPU driver issue. That I've seen too, I even posted an example of such WU in message 55867 in this thread. Tackleway however wrote "all the tasks failures here were processed only by CPU". That's wrong. I don't think it's a driver issue either, see my post in NC or my message 55867 in this thread. ID: 55895 · Rating: 0 · rate: / Reply Quote

Tackleway Send message Joined: 17 Mar 10 Posts: 20 Credit: 5,641,904 RAC: 0	Message 55896 - Posted: 21 Oct 2012, 9:25:40 UTC - in response to Message 55895. Sorry, maybe you have misunderstood my statement "all the tasks failures here were processed only by CPU". was referring to my 4 machines at this location,therefore it is not wrong as I do not use my GPU's to process for Milkyway. [/b] ID: 55896 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 812 Credit: 20,950,354 RAC: 6,108	Message 55902 - Posted: 21 Oct 2012, 12:17:59 UTC - in response to Message 55896. Ups... my mistake, sorry. ID: 55902 · Rating: 0 · rate: / Reply Quote

Phil Send message Joined: 29 Aug 10 Posts: 25 Credit: 2,172,252,217 RAC: 0	Message 55911 - Posted: 22 Oct 2012, 19:06:37 UTC - in response to Message 55863. It is possible that the remaining errors are due to outdated BOINC applications or GPU drivers. So far, every erroring computer that I have looked at (which is not all of them) has outdated GPU or BOINC versions. I'm still looking into the problem, but if you are having errors it wouldn't hurt to update your drivers and/or BOINC app and see if that fixes the problem. Please let us know if it does. I have tried various different drivers and BOINC versions and things have not improved. With older cards in particular the older drivers tend to be better - faster and more stable. The 7.x versions of BOINC are also crap for this project because the scheduler causes clients to sit idle half the time. ID: 55911 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 812 Credit: 20,950,354 RAC: 6,108	Message 55912 - Posted: 22 Oct 2012, 20:01:24 UTC - in response to Message 55911. The 7.x versions of BOINC are also crap for this project because the scheduler causes clients to sit idle half the time. That's usually because v7 needs other cache settings than v6 and older. What do you have set as "Maintain enough tasks to keep busy for at least X days" and "... and up to an additional Y days"? ID: 55912 · Rating: 0 · rate: / Reply Quote

Phil Send message Joined: 29 Aug 10 Posts: 25 Credit: 2,172,252,217 RAC: 0	Message 55913 - Posted: 22 Oct 2012, 21:01:59 UTC I think they are both set to 1 day. I have tried various setting and it makes no difference. It rarely gets more than about 5 minutes work at a time and when it runs out it sits there doing nothing for ages. Now that there are so many errors it seems to defer for even longer, like several hours. Since these new WUs have started my output is less than half what it was before and this is mainly due to the errors causing boinc to sit idle for hours at a time. ID: 55913 · Rating: 0 · rate: / Reply Quote

GaryG Send message Joined: 29 Aug 12 Posts: 31 Credit: 40,781,945 RAC: 0	Message 55914 - Posted: 22 Oct 2012, 21:04:39 UTC - in response to Message 55912. I don't believe that really applies here, if you run more than 1 project Milkyway will soon have no tasks to run. This happens for two reasons, the Milkway limits on the number of tasks a computer is allowed at a time, I get 75 max. and the schedular filling all remaining time with tasks from other projects when Milkway has none available. Other projects attempt to fill the x Days and Y Extra settings but not Milkyway, happens to me regularly and causes a lot of manual intervention where it should'nt be required. ID: 55914 · Rating: 0 · rate: / Reply Quote

Adrian Taylor Send message Joined: 6 Apr 08 Posts: 13 Credit: 139,088,163 RAC: 0	Message 55916 - Posted: 22 Oct 2012, 22:19:16 UTC OK loads of errors still is there an end in sight ? i'm running at 270 in my stats for just one machine.. whats happening ? cheers folks ID: 55916 · Rating: 0 · rate: / Reply Quote

Sunny129 Send message Joined: 25 Jan 11 Posts: 271 Credit: 346,072,284 RAC: 0	Message 55917 - Posted: 22 Oct 2012, 22:35:42 UTC my error rate is right around 3.5% right now, up slightly from yesterday and the day before... ID: 55917 · Rating: 0 · rate: / Reply Quote

Ray_GTI-R Send message Joined: 5 Nov 10 Posts: 69 Credit: 15,064,831 RAC: 0	Message 55918 - Posted: 23 Oct 2012, 0:49:47 UTC OK, a bit weird I know ... On this one PC all GPU WU's failed after 4 seconds with Computation Error. Tens of 'em. I sensed this because I was typing stuff on this PC and everything ran much quicker. LOL! So, checked GPU results - all fails. Suspended Project. Cold-booted. Next GPU WU completed OK. And the next one. Anthe next one is currently at +75% ... looks OK now. HTH, Ray XP Pro SP3, HD3650 AGP, BOINC 6 12.34, CCC 11.8 (8.881) ... all installed for a looong time on this one PC. ID: 55918 · Rating: 0 · rate: / Reply Quote

Sunny129 Send message Joined: 25 Jan 11 Posts: 271 Credit: 346,072,284 RAC: 0	Message 55926 - Posted: 23 Oct 2012, 19:57:07 UTC Last modified: 23 Oct 2012, 19:57:38 UTC my error rate has risen to 4% (up from 3.5%) in the last 24 hours. i know there's no way to tell if a WU is bad until it has failed for multiple wingmen...but does anyone have an educated guess as to how much longer it'll be until all these "bad" WU's are flushed out of the system? ID: 55926 · Rating: 0 · rate: / Reply Quote

Adrian Taylor Send message Joined: 6 Apr 08 Posts: 13 Credit: 139,088,163 RAC: 0	Message 55952 - Posted: 25 Oct 2012, 11:25:13 UTC 4.7% here 323 errors in my one machines stats pretty constantly Please let us know a little more news about these bad units Also the quorum appears too big, 4 machines have to show the same error before the unit is dumped as far as i can see the main problem is that once the boinc client has an error computing it is much less likely to keep the cache full, so unless you nurse your machines they sit there with empty caches a little update would be appreciated :-) cheers ID: 55952 · Rating: 0 · rate: / Reply Quote

Matthew Volunteer moderator Project developer Project scientist Send message Joined: 6 May 09 Posts: 217 Credit: 6,856,375 RAC: 0	Message 55955 - Posted: 25 Oct 2012, 19:58:21 UTC - in response to Message 55952. Hi - sorry guys, I've been hit by job application season and I've been buried. :P This is a weird error; the binaries haven't been updated since before the last set of stable jobs, but the server code has. The server shouldn't be able to throw bad WU's that behave like these do. I've found that the "de_separation_22_3s_free_3" WUs seem to be returning unphysical likelihoods; it is possible that this run is responsible for the errors that people are seeing. Sorry for the inconvenience, we'll try to get this sorted out in the next day or 2. -Matthew ID: 55955 · Rating: 0 · rate: / Reply Quote

Adrian Taylor Send message Joined: 6 Apr 08 Posts: 13 Credit: 139,088,163 RAC: 0	Message 55956 - Posted: 25 Oct 2012, 20:04:23 UTC - in response to Message 55955. great, thanks for the update Matthew ID: 55956 · Rating: 0 · rate: / Reply Quote

Sunny129 Send message Joined: 25 Jan 11 Posts: 271 Credit: 346,072,284 RAC: 0	Message 55960 - Posted: 26 Oct 2012, 0:26:35 UTC looking forward to the fix...my error rate is now up to %5 ID: 55960 · Rating: 0 · rate: / Reply Quote

JHMarshall Send message Joined: 24 Jul 12 Posts: 40 Credit: 7,123,301,054 RAC: 0	Message 55964 - Posted: 26 Oct 2012, 5:29:37 UTC - in response to Message 55955. Matthew, It's not just the "de_separation_22_3s_free_3" WUs that have problems. Many "de_separation_22_3s_edge_3" WUs are also failing on AMD CPUs, Intel CPUS, Nvidia cards, and AMD cards. Joe ID: 55964 · Rating: 0 · rate: / Reply Quote

John G Send message Joined: 1 Apr 10 Posts: 49 Credit: 171,863,025 RAC: 0	Message 55967 - Posted: 26 Oct 2012, 11:59:48 UTC - in response to Message 55964. Matthew, It's not just the "de_separation_22_3s_free_3" WUs that have problems. Many "de_separation_22_3s_edge_3" WUs are also failing on AMD CPUs, Intel CPUS, Nvidia cards, and AMD cards. Joe I very much concur with Mr. Marshals findings ! Regards John G ID: 55967 · Rating: 0 · rate: / Reply Quote

JHMarshall Send message Joined: 24 Jul 12 Posts: 40 Credit: 7,123,301,054 RAC: 0	Message 55973 - Posted: 26 Oct 2012, 20:36:13 UTC - in response to Message 55967. John, Good to get confirmation that others are seeing the same failures. Thanks, Matthew, Travis, I don't know if this is going to help, but instead of pulling my machines off MW, I'm going to reallocated my resources to MW (1 NVidia GTX 560Ti, 3 AMD HD 7950s, and 7 of 14 CPU cores spread across three machines). I'll either help clear out the bad WUs or add to the confusion. By the way, I haven't seen any failures where my wingmen didn't also have faiures. That brings up a question. If these WUs have bad data or something else wrong how confident are you that the WUs that validate are really good? Joe ID: 55973 · Rating: 0 · rate: / Reply Quote

Blurf Volunteer moderator Project administrator Send message Joined: 13 Mar 08 Posts: 804 Credit: 26,380,161 RAC: 0	Message 55975 - Posted: 26 Oct 2012, 23:38:08 UTC - in response to Message 55973. John, Good to get confirmation that others are seeing the same failures. Thanks, Matthew, Travis, I don't know if this is going to help, but instead of pulling my machines off MW, I'm going to reallocated my resources to MW (1 NVidia GTX 560Ti, 3 AMD HD 7950s, and 7 of 14 CPU cores spread across three machines). I'll either help clear out the bad WUs or add to the confusion. By the way, I haven't seen any failures where my wingmen didn't also have faiures. That brings up a question. If these WUs have bad data or something else wrong how confident are you that the WUs that validate are really good? Joe Thank you Joe... ID: 55975 · Rating: 0 · rate: / Reply Quote