Welcome to MilkyWay@home

Getting validate errors


Advanced search

Message boards : Number crunching : Getting validate errors
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
JohnDK
Avatar

Send message
Joined: 18 Feb 10
Posts: 19
Credit: 187,827,998
RAC: 161,334
100 million credit badge11 year member badge
Message 70394 - Posted: 19 Jan 2021, 10:28:44 UTC

I'm getting some validate errors on my linux host the last days. First I thought is was the rtx 2080 super card failing, but it it also happens with the gtx 1080 card.

I have done some updates on the host lately, updating to Mint 20.1, Nvidia to 460.x, linux kernel. Tried going back to Nvidia 450.x but didn't help. Have also updated kernel since I first noticed the errors, but no go.

So any idea was could be wrong?

https://milkyway.cs.rpi.edu/milkyway/results.php?hostid=816425&offset=0&show_names=0&state=5&appid=
ID: 70394 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 65
Credit: 65,346,251
RAC: 43,857
50 million credit badge11 year member badgeextraordinary contributions badge
Message 70396 - Posted: 19 Jan 2021, 15:10:11 UTC - in response to Message 70394.  
Last modified: 19 Jan 2021, 15:11:09 UTC

See Tom Donlon's announcement in the News thread regarding the latest Separation runs. This was a possibility for stripe 85, and it has just become a reality..

Nothing wrong with your kit - nothing to worry about as long as it is only happening on modfit_85 and you are only getting validate errors...

Cheers - Al.

P.S. I've got a few dozen validate errors too!...
ID: 70396 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 148
Credit: 60,667,208
RAC: 141,494
50 million credit badge2 year member badge
Message 70400 - Posted: 19 Jan 2021, 17:03:36 UTC

Hey All,

For whatever reason, the Stripe 84 and 85 Separation runs sometimes start getting these validate errors after a couple weeks. I want to try to crunch workunits for these stripes as long as possible, but I also don't want to waste peoples cycles. What percentage of runs end up in these validate errors? If it is greater than 10% or so, I am willing to take the runs down, but if it is only a few workunits here and there I would like to try to get more optimization done.

Tom
ID: 70400 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Holdolin

Send message
Joined: 9 Dec 11
Posts: 33
Credit: 1,041,621,794
RAC: 0
1 billion credit badge9 year member badge
Message 70403 - Posted: 19 Jan 2021, 17:38:50 UTC - in response to Message 70400.  

Hey All,

For whatever reason, the Stripe 84 and 85 Separation runs sometimes start getting these validate errors after a couple weeks. I want to try to crunch workunits for these stripes as long as possible, but I also don't want to waste peoples cycles. What percentage of runs end up in these validate errors? If it is greater than 10% or so, I am willing to take the runs down, but if it is only a few workunits here and there I would like to try to get more optimization done.

Tom

Personally, my ratio of validate errors to valid work units, I'm seeing around 0.5% validate errors. I was a bit concerned about the raw number of validate errors until I compared them to my valid results, and you saying pretty much less than 10% ok (obviously the less the better) I'm feeling better about it.
ID: 70403 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 148
Credit: 60,667,208
RAC: 141,494
50 million credit badge2 year member badge
Message 70404 - Posted: 19 Jan 2021, 17:46:11 UTC - in response to Message 70403.  
Last modified: 19 Jan 2021, 17:47:40 UTC

I'll admit the 10% line is an arbitrary value that I thought was reasonable, but I'm just trying to walk the tightrope between getting science done and keeping the volunteers happy. I don't want you to spend cycles crunching workunits that cause problems or don't get you any credits!

I'm glad to hear that the relative number of problematic workunits is small and that you are all understanding.
ID: 70404 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 474
Credit: 355,586,873
RAC: 273,920
300 million credit badge10 year member badgeextraordinary contributions badge
Message 70405 - Posted: 19 Jan 2021, 17:51:52 UTC - in response to Message 70404.  

I'm at 5% validate errors on these 85 stripes. Climbing trend.
ID: 70405 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 65
Credit: 65,346,251
RAC: 43,857
50 million credit badge11 year member badgeextraordinary contributions badge
Message 70406 - Posted: 20 Jan 2021, 4:15:13 UTC

Of about 30 stripe 85 tasks I've processed in the last 48 hours [and for which data is still visible on the web site] almost every one has either had at least one Invalid return (not always mine!) or is sitting waiting for validation with two or more already failing to match.

Looks like the platform/driver differences (in rounding behaviour or handling of very small values?) are starting to bite as they did in previous runs... Ah, well, we can but try!

Tom - good luck deciding when to stop this stripe; it is definitely getting worse, but if it's really close to converging and the majority of tasks only have one user failing to validate it's probably only annoying if it's always happening to the same user(s)! For me it's nowhere near as annoying as having much longer Einstein@home tasks failing to validate with no real clue as to why!

Cheers - Al.
ID: 70406 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Vester
Avatar

Send message
Joined: 30 Dec 14
Posts: 33
Credit: 907,367,162
RAC: 340
500 million credit badge6 year member badge
Message 70410 - Posted: 20 Jan 2021, 12:55:36 UTC
Last modified: 20 Jan 2021, 12:57:15 UTC

My validate errors: Currently 1845. This is a stable Windows 10 rig (ASUS B250 Mining Expert) currently running three Radeon HD 7990 video cards (six GPUs) with Blockchain Beta video drivers.
ID: 70410 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 148
Credit: 60,667,208
RAC: 141,494
50 million credit badge2 year member badge
Message 70411 - Posted: 20 Jan 2021, 15:45:24 UTC

Stripes 84 and 85 are getting somewhat close to converging, but they are not where I would like them to be yet. I'm keeping a close eye on these threads though in order to make sure things don't break down while we try to get some more runs done.
ID: 70411 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Vester
Avatar

Send message
Joined: 30 Dec 14
Posts: 33
Credit: 907,367,162
RAC: 340
500 million credit badge6 year member badge
Message 70422 - Posted: 22 Jan 2021, 13:32:05 UTC

I changed from Blockchain to Radeon driver 20.12.1 (WHQL) of 12/4/2020 and changed the number of WUs to 1 per GPU, but saw no improvement. I have reverted to 5 WUs per GPU. My errors persist. The rig runs VRAM at 150 MHz instead of 1500 MHz to aid cooling, but VRAM speed did not affect output in the past and I do not expect that to be the issue. In reviewing the WUs involved, I see no correlation among other users that would indicate that OpenCL 2.1 is an issue.

I'll be glad when stripe 85 is done! :-)
ID: 70422 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 474
Credit: 355,586,873
RAC: 273,920
300 million credit badge10 year member badgeextraordinary contributions badge
Message 70427 - Posted: 22 Jan 2021, 22:35:14 UTC - in response to Message 70422.  

I'll be glad when stripe 85 is done! :-)

It's done. Tom pulled them an hour ago. Never converged.
ID: 70427 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 148
Credit: 60,667,208
RAC: 141,494
50 million credit badge2 year member badge
Message 70428 - Posted: 22 Jan 2021, 23:25:07 UTC

Yeah, the Stripe got pulled. Stripes 84 and 85 have gaps in them due to interstellar dust problems, and I think these gaps cause problems when running these workunits. Between the several runs that we did of these stripes we should have enough data to make educated guesses about what is going on in the stripes though. :)
ID: 70428 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dirk Riesener

Send message
Joined: 19 Aug 19
Posts: 3
Credit: 7,529,662
RAC: 1,521
5 million credit badge2 year member badge
Message 70444 - Posted: 26 Jan 2021, 12:55:16 UTC - in response to Message 70396.  

Today I already got 2 validation errors on modfit_84. Nothing I like very much...
Dirk
ID: 70444 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 148
Credit: 60,667,208
RAC: 141,494
50 million credit badge2 year member badge
Message 70446 - Posted: 26 Jan 2021, 14:45:10 UTC

Stripe 84 has already been pulled, the validation errors you got are probably from validation workunits trickling in since I stopped the run.
ID: 70446 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileKeith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 474
Credit: 355,586,873
RAC: 273,920
300 million credit badge10 year member badgeextraordinary contributions badge
Message 70453 - Posted: 27 Jan 2021, 2:17:11 UTC

The stripe 84 and 85 resends continue to pour in. I try and abort them as much as I can every few hours in the hopes they will be pulled by the server for too many errors.
Tedious work to go through my caches.
ID: 70453 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 148
Credit: 60,667,208
RAC: 141,494
50 million credit badge2 year member badge
Message 70456 - Posted: 27 Jan 2021, 17:50:55 UTC

I'm not aware of any way to terminate existing workunits besides manually going through each one and deleting it from the database, which would also cause problems with our results. Unfortunately we will just have to be patient until the runs finish validating.
ID: 70456 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Holdolin

Send message
Joined: 9 Dec 11
Posts: 33
Credit: 1,041,621,794
RAC: 0
1 billion credit badge9 year member badge
Message 70457 - Posted: 27 Jan 2021, 17:55:54 UTC - in response to Message 70456.  
Last modified: 27 Jan 2021, 17:56:54 UTC

No worries here. Just crunchin away on whatever comes next. Ya did what ya could and from my personal point of view it's all good. thanks for your work :)

Besides, my 84 and 85 trickle is way down now so life is good in the cosmos.
ID: 70457 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Vester
Avatar

Send message
Joined: 30 Dec 14
Posts: 33
Credit: 907,367,162
RAC: 340
500 million credit badge6 year member badge
Message 70558 - Posted: 9 Feb 2021, 7:45:46 UTC
Last modified: 9 Feb 2021, 7:48:33 UTC

Workunit 21351206, de_modfit_85_bundle4_4s_south4s_bgset_7_1612355394_3477041, failed to validate on two AMD GPUs including one of my HD 7990s. This was my first invalid WU in seven days.
ID: 70558 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 148
Credit: 60,667,208
RAC: 141,494
50 million credit badge2 year member badge
Message 70559 - Posted: 9 Feb 2021, 13:37:19 UTC - in response to Message 70558.  

So it begins. Please keep monitoring the workunits that get validate errors, when they reach ~10% of stripe 84 and 85 I'll take them down.
ID: 70559 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Holdolin

Send message
Joined: 9 Dec 11
Posts: 33
Credit: 1,041,621,794
RAC: 0
1 billion credit badge9 year member badge
Message 70561 - Posted: 9 Feb 2021, 16:03:27 UTC - in response to Message 70559.  

I've seen like 2 since the maintenance. At that rate it could be something totally unrelated to the WUs themselves such as upload/download corruptions or some such. WIll most def keep an eye on things though. Great to see others are doing the same:)
ID: 70561 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Getting validate errors

©2021 Astroinformatics Group