Welcome to MilkyWay@home

New Separation Runs 6/9/2021

Message boards : News : New Separation Runs 6/9/2021
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
paris
Avatar

Send message
Joined: 26 Apr 08
Posts: 87
Credit: 64,801,496
RAC: 0
Message 70921 - Posted: 26 Jun 2021, 14:46:03 UTC

Same problem here. 50 invalids over the past three or four days.


Plus SETI Classic = 21,082 WUs
ID: 70921 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Frank

Send message
Joined: 2 Nov 10
Posts: 25
Credit: 1,894,269,109
RAC: 0
Message 70928 - Posted: 27 Jun 2021, 13:34:20 UTC - in response to Message 70858.  

Well, if you wanted Invalids on the gapfix_bgset3 you got them by their hundreds. Personally, I don't like them. They are mainly occurring on CPU tasks but some show up on GPU tasks.
ID: 70928 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Vester
Avatar

Send message
Joined: 30 Dec 14
Posts: 34
Credit: 909,988,687
RAC: 133
Message 70932 - Posted: 27 Jun 2021, 19:01:45 UTC

I currently have 452 invalid tasks on three different computers. Invalid tasks are on AMD GPUs and CPU tasks on a Dell T-420 server with two Intel Xeon processors. I've given up trying to figure if I possibly have hardware problems.
ID: 70932 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 70933 - Posted: 27 Jun 2021, 19:11:57 UTC

Ugh, it looks like the invalid problems may not be fixed after all. I'll have to look into things in the next couple days.
ID: 70933 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 70936 - Posted: 28 Jun 2021, 21:17:43 UTC
Last modified: 28 Jun 2021, 21:18:13 UTC

These validation problems are happening because different types of machines & setups are returning (slightly) different values for the likelihood of each workunit. The likelihoods are only off by a few parts in a million, but luckily we don't actually need the results to be that precise. I am looking into changing the validator for the project instead taking down the runs so that these stripes can get crunched. That may take a couple days though, so thanks for your patience in the meantime.
ID: 70936 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 708
Credit: 542,929,839
RAC: 151,984
Message 70937 - Posted: 29 Jun 2021, 0:17:31 UTC - in response to Message 70936.  

Will relaxing the validator limits hurt the science results for future task sets?
ID: 70937 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3319
Credit: 520,258,023
RAC: 20,480
Message 70938 - Posted: 29 Jun 2021, 1:52:19 UTC - in response to Message 70937.  

Will relaxing the validator limits hurt the science results for future task sets?


If so maybe they can match up the pc's doing the crunching so ie Linux pc's only validate against each other and the same with Windows pc's. With Windows being MUCH more represented in the stats page it really shouldn't matter very much.
ID: 70938 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 708
Credit: 542,929,839
RAC: 151,984
Message 70939 - Posted: 29 Jun 2021, 3:28:53 UTC - in response to Message 70938.  

Not sure where in the BOINC server code you can configure the scheduler for that change. Don't think it possible. Hope a dev will chime in and tell me I'm wrong.
ID: 70939 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 210
Credit: 105,878,763
RAC: 26,220
Message 70940 - Posted: 29 Jun 2021, 3:32:07 UTC - in response to Message 70938.  

Will relaxing the validator limits hurt the science results for future task sets?


If so maybe they can match up the pc's doing the crunching so ie Linux pc's only validate against each other and the same with Windows pc's. With Windows being MUCH more represented in the stats page it really shouldn't matter very much.

That might help quite a lot with with CPU-to-CPU comparisons as it would [hopefully] eliminate rounding inconsistencies due to different compilers doing different optimizations and ordering of operati0ns within complicated expressions.

However, it might not help as much when a CPU job is matched against a GPU job, especially if the OpenCL is using fused multiply-add... Different GPUs have different attitudes to rounding and how close one can get to zero before a value is treated as zero - and that can be a "feature" of hardware versions from the same supplier! And I've seen tasks where a Windows CPU result wouldn't validate with a Windows GPU job, so splitting out by Operating System alone may not be enough...

I suspect that if Tom relaxes the validation constraints slightly we might still see some issues, but probably nowhere near as many. Without knowing the current required precision one can only speculate - so good luck to Tom!

Cheers - Al.
ID: 70940 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 708
Credit: 542,929,839
RAC: 151,984
Message 70941 - Posted: 29 Jun 2021, 5:13:11 UTC

Well for an example of what is possible. I have not a single invalid or inconclusive task at TN-Grid where there are 3 separate cpu applications in play on the same datasets. SSE2, AVX and FMA. I have validated against every other type of application by my wingmen.

So it is possible to get the same results or close enough to be considered valid if you setup the validator with relaxed enough parameters.
ID: 70941 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 70942 - Posted: 30 Jun 2021, 2:52:11 UTC
Last modified: 30 Jun 2021, 2:54:46 UTC

There is a way to set up BOINC so that machines will only validate against other machines of the same type, but it seems like a pain and I believe it would force us to reconfigure a large portion of the project. Definitely not something I'm trying to do.

When I was digging through workunit validations, it looked like the failed validations were all working but were just off by something like +/-0.00001, probably due to inconsistencies in CPU vs GPU or CPU compiler options, etc. I am planning on turning down (up?) the tolerance so that those slightly different results are accepted. I was looking through the server code today to try to find where to do that, but didn't see it. I will keep working on this so that hopefully things will be fixed soon.

I'm not sure if a lowered validation threshold could potentially cause problems with the project science in the future... I'm not immediately sure why they would, as long as the threshold isn't set to be too lenient. It could just as easily be that we are already using too stringent of a validation threshold right now! Anyways, if it causes problems with the science, hopefully it would just be a matter of changing the tolerance back to the value it was before.
ID: 70942 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Nigel Conway

Send message
Joined: 11 Feb 14
Posts: 4
Credit: 321,527
RAC: 0
Message 70943 - Posted: 30 Jun 2021, 16:38:39 UTC - in response to Message 70858.  

Hi! FYI, I have had validate errors for tasks 231095611 and 231095541!
Nigel.
ID: 70943 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 70944 - Posted: 1 Jul 2021, 18:03:35 UTC

The good news is that I've found the tolerance setting. The bad news is that in order to change it, I have to rebuild a couple binaries. I'm going to be working on this for a little while, but I think that it may require a server restart (or at least a temporary outage of a few services). I'll keep everyone up to date with progress as I figure more out.

The validation errors seem to be holding steady at <8% for the time being, which isn't ideal, but it's not a lot and it doesn't seem to be increasing very quickly.
ID: 70944 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Frank

Send message
Joined: 2 Nov 10
Posts: 25
Credit: 1,894,269,109
RAC: 0
Message 70948 - Posted: 10 Jul 2021, 0:58:50 UTC - in response to Message 70944.  

I am seeing 11.44% Invalids and fit is increasing daily. I have seen reports where four computers (2 Windows 10 and 2 Linux) tried to evaluate 4 completed tasks. 2 Tasks were judged to be Invalid, 1 Windows 10 and 1 Linux. Makes me think that problem doesn't reside in an operating system.
Keep smilin'.
ID: 70948 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
AquaBoy

Send message
Joined: 30 Apr 18
Posts: 4
Credit: 26,042,464
RAC: 0
Message 70951 - Posted: 10 Jul 2021, 9:02:23 UTC

Maybe, better if you eventually assign credits for workunits having validation errors? The point is that our PCs spend their time to calculate results so they deserve credits anyway. As far as I know, some projects support this feature.
ID: 70951 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
paris
Avatar

Send message
Joined: 26 Apr 08
Posts: 87
Credit: 64,801,496
RAC: 0
Message 70952 - Posted: 10 Jul 2021, 13:16:04 UTC

My RAC has decreased from around 33,500 to under 25,000. I'm running about 20% invalid. I hope something can be done and that this is not going to be the "cost of doing business". Nevertheless, I enjoy running the project but I would like to see all of my work count.


Plus SETI Classic = 21,082 WUs
ID: 70952 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 70953 - Posted: 10 Jul 2021, 17:14:53 UTC

I have made changes to the validator and it is currently being run on the test server. I apologize for the high rate of invalids in the meantime, but a fix is coming. I would rather make sure that this new validator works than put a broken validator up on the server.
ID: 70953 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jimbocous
Avatar

Send message
Joined: 7 Mar 20
Posts: 22
Credit: 104,948,317
RAC: 12,248
Message 70957 - Posted: 11 Jul 2021, 8:21:07 UTC
Last modified: 11 Jul 2021, 8:23:33 UTC

Seeing about a 20% invalid rate on my running hosts as well
(1 linux, 1 win10, both running CPU tasks only).
Good to see in the forum here the issue's being worked.
I'll take another look when I hear the change hits production.
Thanks!
ID: 70957 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cavalary
Avatar

Send message
Joined: 23 Aug 11
Posts: 33
Credit: 11,062,253
RAC: 0
Message 70959 - Posted: 13 Jul 2021, 2:16:30 UTC

Sure got worried when I saw my RAC drop by some 40% and a whole bunch of validate errors. But maybe it's just this issue and not my computer...
ID: 70959 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Siran d'Vel'nahr
Avatar

Send message
Joined: 1 Jul 08
Posts: 88
Credit: 25,079,058
RAC: 0
Message 70960 - Posted: 13 Jul 2021, 9:15:27 UTC - in response to Message 70951.  

Maybe, better if you eventually assign credits for workunits having validation errors? The point is that our PCs spend their time to calculate results so they deserve credits anyway. As far as I know, some projects support this feature.

Hi AB,

I'm beginning to agree with you in that our PCs do the work without errors just to have the validation system err for whatever reason and we should be compensated for the work done. I don't remember the project I was on some time ago, but they had the system set to credit us for the work done on tasks that had validation errors. My RAC here was finally back up to 121K and now it has dropped to just over 102K, soon to drop under that. :-(

Have a great day! :)

Siran
CAPT Siran d'Vel'nahr XO - L L & P _\\//
USS Vre'kasht NCC-33187
Winders 10 OS? "What a piece of junk!" - L. Skywalker
"Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath
ID: 70960 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : News : New Separation Runs 6/9/2021

©2024 Astroinformatics Group