Message boards :
Number crunching :
Getting validate errors
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Feb 10 Posts: 53 Credit: 221,720,860 RAC: 4,505 |
I'm getting some validate errors on my linux host the last days. First I thought is was the rtx 2080 super card failing, but it it also happens with the gtx 1080 card. I have done some updates on the host lately, updating to Mint 20.1, Nvidia to 460.x, linux kernel. Tried going back to Nvidia 450.x but didn't help. Have also updated kernel since I first noticed the errors, but no go. So any idea was could be wrong? https://milkyway.cs.rpi.edu/milkyway/results.php?hostid=816425&offset=0&show_names=0&state=5&appid= |
Send message Joined: 16 Mar 10 Posts: 210 Credit: 106,067,881 RAC: 24,034 |
See Tom Donlon's announcement in the News thread regarding the latest Separation runs. This was a possibility for stripe 85, and it has just become a reality.. Nothing wrong with your kit - nothing to worry about as long as it is only happening on modfit_85 and you are only getting validate errors... Cheers - Al. P.S. I've got a few dozen validate errors too!... |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Hey All, For whatever reason, the Stripe 84 and 85 Separation runs sometimes start getting these validate errors after a couple weeks. I want to try to crunch workunits for these stripes as long as possible, but I also don't want to waste peoples cycles. What percentage of runs end up in these validate errors? If it is greater than 10% or so, I am willing to take the runs down, but if it is only a few workunits here and there I would like to try to get more optimization done. Tom |
Send message Joined: 9 Dec 11 Posts: 38 Credit: 1,497,896,956 RAC: 0 |
Hey All, Personally, my ratio of validate errors to valid work units, I'm seeing around 0.5% validate errors. I was a bit concerned about the raw number of validate errors until I compared them to my valid results, and you saying pretty much less than 10% ok (obviously the less the better) I'm feeling better about it. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
I'll admit the 10% line is an arbitrary value that I thought was reasonable, but I'm just trying to walk the tightrope between getting science done and keeping the volunteers happy. I don't want you to spend cycles crunching workunits that cause problems or don't get you any credits! I'm glad to hear that the relative number of problematic workunits is small and that you are all understanding. |
Send message Joined: 24 Jan 11 Posts: 708 Credit: 543,855,181 RAC: 126,906 |
I'm at 5% validate errors on these 85 stripes. Climbing trend. |
Send message Joined: 16 Mar 10 Posts: 210 Credit: 106,067,881 RAC: 24,034 |
Of about 30 stripe 85 tasks I've processed in the last 48 hours [and for which data is still visible on the web site] almost every one has either had at least one Invalid return (not always mine!) or is sitting waiting for validation with two or more already failing to match. Looks like the platform/driver differences (in rounding behaviour or handling of very small values?) are starting to bite as they did in previous runs... Ah, well, we can but try! Tom - good luck deciding when to stop this stripe; it is definitely getting worse, but if it's really close to converging and the majority of tasks only have one user failing to validate it's probably only annoying if it's always happening to the same user(s)! For me it's nowhere near as annoying as having much longer Einstein@home tasks failing to validate with no real clue as to why! Cheers - Al. |
Send message Joined: 30 Dec 14 Posts: 34 Credit: 909,988,687 RAC: 67 |
My validate errors: Currently 1845. This is a stable Windows 10 rig (ASUS B250 Mining Expert) currently running three Radeon HD 7990 video cards (six GPUs) with Blockchain Beta video drivers. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Stripes 84 and 85 are getting somewhat close to converging, but they are not where I would like them to be yet. I'm keeping a close eye on these threads though in order to make sure things don't break down while we try to get some more runs done. |
Send message Joined: 30 Dec 14 Posts: 34 Credit: 909,988,687 RAC: 67 |
I changed from Blockchain to Radeon driver 20.12.1 (WHQL) of 12/4/2020 and changed the number of WUs to 1 per GPU, but saw no improvement. I have reverted to 5 WUs per GPU. My errors persist. The rig runs VRAM at 150 MHz instead of 1500 MHz to aid cooling, but VRAM speed did not affect output in the past and I do not expect that to be the issue. In reviewing the WUs involved, I see no correlation among other users that would indicate that OpenCL 2.1 is an issue. I'll be glad when stripe 85 is done! :-) |
Send message Joined: 24 Jan 11 Posts: 708 Credit: 543,855,181 RAC: 126,906 |
I'll be glad when stripe 85 is done! :-) It's done. Tom pulled them an hour ago. Never converged. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Yeah, the Stripe got pulled. Stripes 84 and 85 have gaps in them due to interstellar dust problems, and I think these gaps cause problems when running these workunits. Between the several runs that we did of these stripes we should have enough data to make educated guesses about what is going on in the stripes though. :) |
Send message Joined: 19 Aug 19 Posts: 3 Credit: 7,529,662 RAC: 0 |
Today I already got 2 validation errors on modfit_84. Nothing I like very much... Dirk |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Stripe 84 has already been pulled, the validation errors you got are probably from validation workunits trickling in since I stopped the run. |
Send message Joined: 24 Jan 11 Posts: 708 Credit: 543,855,181 RAC: 126,906 |
The stripe 84 and 85 resends continue to pour in. I try and abort them as much as I can every few hours in the hopes they will be pulled by the server for too many errors. Tedious work to go through my caches. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
I'm not aware of any way to terminate existing workunits besides manually going through each one and deleting it from the database, which would also cause problems with our results. Unfortunately we will just have to be patient until the runs finish validating. |
Send message Joined: 9 Dec 11 Posts: 38 Credit: 1,497,896,956 RAC: 0 |
No worries here. Just crunchin away on whatever comes next. Ya did what ya could and from my personal point of view it's all good. thanks for your work :) Besides, my 84 and 85 trickle is way down now so life is good in the cosmos. |
Send message Joined: 30 Dec 14 Posts: 34 Credit: 909,988,687 RAC: 67 |
Workunit 21351206, de_modfit_85_bundle4_4s_south4s_bgset_7_1612355394_3477041, failed to validate on two AMD GPUs including one of my HD 7990s. This was my first invalid WU in seven days. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
So it begins. Please keep monitoring the workunits that get validate errors, when they reach ~10% of stripe 84 and 85 I'll take them down. |
Send message Joined: 9 Dec 11 Posts: 38 Credit: 1,497,896,956 RAC: 0 |
I've seen like 2 since the maintenance. At that rate it could be something totally unrelated to the WUs themselves such as upload/download corruptions or some such. WIll most def keep an eye on things though. Great to see others are doing the same:) |
©2024 Astroinformatics Group