Welcome to MilkyWay@home

Apology for recent bad batches of workunits


Advanced search

Message boards : News : Apology for recent bad batches of workunits
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
JHMarshall

Send message
Joined: 24 Jul 12
Posts: 39
Credit: 2,342,251,823
RAC: 6,900,605
2 billion credit badge7 year member badge
Message 55894 - Posted: 21 Oct 2012, 1:59:12 UTC - in response to Message 55884.  

Link,

Tackleway has some good points. I've been checking the error from my system, less now than before but still coming through. I check the WUs to see what other systems have failed. If I see 4 errors while computing for the work unit, I assume I have no problems with my system. The WU was bad. The failures are across many systems, CPUs, and GPUs.

One recent failure on a "free_3" WU showed an error for an AMD CPU computer, 2 opencl-nvidia computers, and my opencl_amd_ati machine for a total of 4 (WU assumed bad). I've seen many combinations of CPU and GPU failures.

This is NOT a GPU driver issue.

Joe
ID: 55894 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 356
Credit: 16,317,754
RAC: 0
10 million credit badge9 year member badge
Message 55895 - Posted: 21 Oct 2012, 7:52:50 UTC - in response to Message 55894.  
Last modified: 21 Oct 2012, 7:53:35 UTC

One recent failure on a "free_3" WU showed an error for an AMD CPU computer, 2 opencl-nvidia computers, and my opencl_amd_ati machine for a total of 4 (WU assumed bad). I've seen many combinations of CPU and GPU failures.

This is NOT a GPU driver issue.

That I've seen too, I even posted an example of such WU in message 55867 in this thread. Tackleway however wrote "all the tasks failures here were processed only by CPU". That's wrong.

I don't think it's a driver issue either, see my post in NC or my message 55867 in this thread.
.
ID: 55895 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTackleway

Send message
Joined: 17 Mar 10
Posts: 20
Credit: 5,172,701
RAC: 829
5 million credit badge9 year member badge
Message 55896 - Posted: 21 Oct 2012, 9:25:40 UTC - in response to Message 55895.  

Sorry, maybe you have misunderstood my statement
"all the tasks failures here were processed only by CPU".
was referring to my 4 machines at this location,therefore
it is not wrong as I do not use my GPU's to process for Milkyway.

[/b]
ID: 55896 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 356
Credit: 16,317,754
RAC: 0
10 million credit badge9 year member badge
Message 55902 - Posted: 21 Oct 2012, 12:17:59 UTC - in response to Message 55896.  

Ups... my mistake, sorry.
.
ID: 55902 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Phil

Send message
Joined: 29 Aug 10
Posts: 25
Credit: 2,172,252,217
RAC: 0
2 billion credit badge9 year member badge
Message 55911 - Posted: 22 Oct 2012, 19:06:37 UTC - in response to Message 55863.  

It is possible that the remaining errors are due to outdated BOINC applications or GPU drivers. So far, every erroring computer that I have looked at (which is not all of them) has outdated GPU or BOINC versions.

I'm still looking into the problem, but if you are having errors it wouldn't hurt to update your drivers and/or BOINC app and see if that fixes the problem. Please let us know if it does.


I have tried various different drivers and BOINC versions and things have not improved. With older cards in particular the older drivers tend to be better - faster and more stable. The 7.x versions of BOINC are also crap for this project because the scheduler causes clients to sit idle half the time.
ID: 55911 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 356
Credit: 16,317,754
RAC: 0
10 million credit badge9 year member badge
Message 55912 - Posted: 22 Oct 2012, 20:01:24 UTC - in response to Message 55911.  

The 7.x versions of BOINC are also crap for this project because the scheduler causes clients to sit idle half the time.

That's usually because v7 needs other cache settings than v6 and older. What do you have set as "Maintain enough tasks to keep busy for at least X days" and "... and up to an additional Y days"?
.
ID: 55912 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Phil

Send message
Joined: 29 Aug 10
Posts: 25
Credit: 2,172,252,217
RAC: 0
2 billion credit badge9 year member badge
Message 55913 - Posted: 22 Oct 2012, 21:01:59 UTC

I think they are both set to 1 day. I have tried various setting and it makes no difference. It rarely gets more than about 5 minutes work at a time and when it runs out it sits there doing nothing for ages. Now that there are so many errors it seems to defer for even longer, like several hours. Since these new WUs have started my output is less than half what it was before and this is mainly due to the errors causing boinc to sit idle for hours at a time.
ID: 55913 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
GaryG
Avatar

Send message
Joined: 29 Aug 12
Posts: 31
Credit: 40,781,945
RAC: 0
30 million credit badge7 year member badge
Message 55914 - Posted: 22 Oct 2012, 21:04:39 UTC - in response to Message 55912.  

I don't believe that really applies here, if you run more than 1 project Milkyway will soon have no tasks to run. This happens for two reasons, the Milkway limits on the number of tasks a computer is allowed at a time, I get 75 max. and the schedular filling all remaining time with tasks from other projects when Milkway has none available. Other projects attempt to fill the x Days and Y Extra settings but not Milkyway, happens to me regularly and causes a lot of manual intervention where it should'nt be required.
ID: 55914 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileAdrian Taylor
Avatar

Send message
Joined: 6 Apr 08
Posts: 13
Credit: 139,088,163
RAC: 0
100 million credit badge10 year member badge
Message 55916 - Posted: 22 Oct 2012, 22:19:16 UTC

OK
loads of errors still
is there an end in sight ?

i'm running at 270 in my stats for just one machine..

whats happening ?

cheers folks
ID: 55916 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sunny129
Avatar

Send message
Joined: 25 Jan 11
Posts: 271
Credit: 346,072,284
RAC: 0
300 million credit badge8 year member badgeextraordinary contributions badge
Message 55917 - Posted: 22 Oct 2012, 22:35:42 UTC

my error rate is right around 3.5% right now, up slightly from yesterday and the day before...
ID: 55917 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileRay_GTI-R
Avatar

Send message
Joined: 5 Nov 10
Posts: 69
Credit: 15,062,785
RAC: 0
10 million credit badge9 year member badge
Message 55918 - Posted: 23 Oct 2012, 0:49:47 UTC

OK, a bit weird I know ...

On this one PC all GPU WU's failed after 4 seconds with Computation Error. Tens of 'em. I sensed this because I was typing stuff on this PC and everything ran much quicker. LOL!

So, checked GPU results - all fails.
Suspended Project.
Cold-booted.

Next GPU WU completed OK.
And the next one.
Anthe next one is currently at +75% ... looks OK now.


HTH, Ray
XP Pro SP3, HD3650 AGP, BOINC 6 12.34, CCC 11.8 (8.881) ... all installed for a looong time on this one PC.

ID: 55918 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sunny129
Avatar

Send message
Joined: 25 Jan 11
Posts: 271
Credit: 346,072,284
RAC: 0
300 million credit badge8 year member badgeextraordinary contributions badge
Message 55926 - Posted: 23 Oct 2012, 19:57:07 UTC
Last modified: 23 Oct 2012, 19:57:38 UTC

my error rate has risen to 4% (up from 3.5%) in the last 24 hours.

i know there's no way to tell if a WU is bad until it has failed for multiple wingmen...but does anyone have an educated guess as to how much longer it'll be until all these "bad" WU's are flushed out of the system?
ID: 55926 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileAdrian Taylor
Avatar

Send message
Joined: 6 Apr 08
Posts: 13
Credit: 139,088,163
RAC: 0
100 million credit badge10 year member badge
Message 55952 - Posted: 25 Oct 2012, 11:25:13 UTC

4.7% here
323 errors in my one machines stats pretty constantly

Please let us know a little more news about these bad units

Also the quorum appears too big, 4 machines have to show the same error before the unit is dumped as far as i can see

the main problem is that once the boinc client has an error computing it is much less likely to keep the cache full, so unless you nurse your machines they sit there with empty caches

a little update would be appreciated :-)

cheers
ID: 55952 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matthew
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 6 May 09
Posts: 217
Credit: 6,856,375
RAC: 0
5 million credit badge10 year member badge
Message 55955 - Posted: 25 Oct 2012, 19:58:21 UTC - in response to Message 55952.  

Hi - sorry guys, I've been hit by job application season and I've been buried. :P

This is a weird error; the binaries haven't been updated since before the last set of stable jobs, but the server code has. The server shouldn't be able to throw bad WU's that behave like these do.

I've found that the "de_separation_22_3s_free_3" WUs seem to be returning unphysical likelihoods; it is possible that this run is responsible for the errors that people are seeing.

Sorry for the inconvenience, we'll try to get this sorted out in the next day or 2.

-Matthew
ID: 55955 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileAdrian Taylor
Avatar

Send message
Joined: 6 Apr 08
Posts: 13
Credit: 139,088,163
RAC: 0
100 million credit badge10 year member badge
Message 55956 - Posted: 25 Oct 2012, 20:04:23 UTC - in response to Message 55955.  

great,

thanks for the update Matthew
ID: 55956 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sunny129
Avatar

Send message
Joined: 25 Jan 11
Posts: 271
Credit: 346,072,284
RAC: 0
300 million credit badge8 year member badgeextraordinary contributions badge
Message 55960 - Posted: 26 Oct 2012, 0:26:35 UTC

looking forward to the fix...my error rate is now up to %5
ID: 55960 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
JHMarshall

Send message
Joined: 24 Jul 12
Posts: 39
Credit: 2,342,251,823
RAC: 6,900,605
2 billion credit badge7 year member badge
Message 55964 - Posted: 26 Oct 2012, 5:29:37 UTC - in response to Message 55955.  

Matthew,

It's not just the "de_separation_22_3s_free_3" WUs that have problems. Many "de_separation_22_3s_edge_3" WUs are also failing on AMD CPUs, Intel CPUS, Nvidia cards, and AMD cards.

Joe
ID: 55964 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John G

Send message
Joined: 1 Apr 10
Posts: 49
Credit: 171,863,025
RAC: 0
100 million credit badge9 year member badge
Message 55967 - Posted: 26 Oct 2012, 11:59:48 UTC - in response to Message 55964.  

Matthew,

It's not just the "de_separation_22_3s_free_3" WUs that have problems. Many "de_separation_22_3s_edge_3" WUs are also failing on AMD CPUs, Intel CPUS, Nvidia cards, and AMD cards.

Joe


I very much concur with Mr. Marshals findings !

Regards

John G
ID: 55967 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
JHMarshall

Send message
Joined: 24 Jul 12
Posts: 39
Credit: 2,342,251,823
RAC: 6,900,605
2 billion credit badge7 year member badge
Message 55973 - Posted: 26 Oct 2012, 20:36:13 UTC - in response to Message 55967.  

John,

Good to get confirmation that others are seeing the same failures.

Thanks,

Matthew, Travis,

I don't know if this is going to help, but instead of pulling my machines off MW, I'm going to reallocated my resources to MW (1 NVidia GTX 560Ti, 3 AMD HD 7950s, and 7 of 14 CPU cores spread across three machines). I'll either help clear out the bad WUs or add to the confusion.

By the way, I haven't seen any failures where my wingmen didn't also have faiures. That brings up a question. If these WUs have bad data or something else wrong how confident are you that the WUs that validate are really good?

Joe
ID: 55973 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileBlurf
Volunteer moderator
Project administrator

Send message
Joined: 13 Mar 08
Posts: 804
Credit: 26,380,161
RAC: 0
20 million credit badge10 year member badgeextraordinary contributions badge
Message 55975 - Posted: 26 Oct 2012, 23:38:08 UTC - in response to Message 55973.  

John,

Good to get confirmation that others are seeing the same failures.

Thanks,

Matthew, Travis,

I don't know if this is going to help, but instead of pulling my machines off MW, I'm going to reallocated my resources to MW (1 NVidia GTX 560Ti, 3 AMD HD 7950s, and 7 of 14 CPU cores spread across three machines). I'll either help clear out the bad WUs or add to the confusion.

By the way, I haven't seen any failures where my wingmen didn't also have faiures. That brings up a question. If these WUs have bad data or something else wrong how confident are you that the WUs that validate are really good?

Joe


Thank you Joe...

ID: 55975 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : News : Apology for recent bad batches of workunits

©2019 Astroinformatics Group