Welcome to MilkyWay@home

Large surge of Invalid results and Validate errors on ALL machines

Message boards : Number crunching : Large surge of Invalid results and Validate errors on ALL machines
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Vortac

Send message
Joined: 22 Apr 09
Posts: 95
Credit: 4,808,181,963
RAC: 0
Message 68754 - Posted: 19 May 2019, 7:27:22 UTC
Last modified: 19 May 2019, 7:28:06 UTC

I've been checking the top hosts and noticed thousands of Invalid results on ALL machines, including mine. They started to appear in numbers four days ago and have been only increasing since then. Since it's impossible that all machines have gone off at the same time, I guess it's a validation problem of some sort which needs an urgent fix - we are wasting a lot of computation power now.
ID: 68754 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
PoppaGeek
Avatar

Send message
Joined: 23 Feb 09
Posts: 4
Credit: 270,159,721
RAC: 0
Message 68755 - Posted: 19 May 2019, 10:00:31 UTC - in response to Message 68754.  

Same here. Thought something gone wrong on mine.
ID: 68755 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
JHMarshall

Send message
Joined: 24 Jul 12
Posts: 40
Credit: 7,123,301,054
RAC: 0
Message 68757 - Posted: 19 May 2019, 19:50:46 UTC

Ditto!! Validate errors on both Nvidia and AMD (ati) tasks.
ID: 68757 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ulrich Metzner
Avatar

Send message
Joined: 11 Apr 15
Posts: 58
Credit: 63,291,127
RAC: 0
Message 68767 - Posted: 21 May 2019, 13:07:49 UTC

ID: 68767 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
CoalZombik

Send message
Joined: 17 Apr 19
Posts: 1
Credit: 8,487,423
RAC: 0
Message 68768 - Posted: 21 May 2019, 14:41:42 UTC

I have a same problem - some of results of Separation is invalid.
I tested stability my Intel CPU and nVidia GPU, but no errors during the test.
ID: 68768 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 68769 - Posted: 21 May 2019, 17:14:18 UTC
Last modified: 21 May 2019, 17:32:58 UTC

Have noticed for some time that the number of invalides fluctuates. I had thought it was the driver I was using.
Tried finding where the "results" are calculated so as to do my own validation
used the sources that I found here
Could one of the developers comment on my last item at the bottom?

My Analysis---

the program "separation_main" calls a worker that iterates through 4 "work units".
---right there is an indication there should be 4 results, not just a single item to validate.

the worker calls "evaluate" before cleaning up and exiting.

evaluate, toward its end, calculates a likehood and then does a "printSeparationResults"
before cleaning up and exiting.

Out of curiosity I looked at my wingmen's results (the work unit).
There were 2 invalid (one was mine, the 3rd one) and 2 valid work tasks for the work unit
This will be gone from database eventually but the workunit is here

I was surprised to see that the output of the "printSeparationResults" for all 4 systems, differed only after the 12th decimal point in every result or was identical to all digits.


task 224802410, nvidia 1080TI VALID ===================

Running likelihood with 31815 stars
Likelihood time = 0.856660 s
Non-finite result: setting likelihood to -999
<background_integral3> 0.000008466303075 </background_integral3>
<stream_integral3> 136.269058837911870 43.219889454989044 -0.000000000000001 2.115804393012327 </stream_integral3>
<background_likelihood3> -3.680025259872701 </background_likelihood3>
<stream_only_likelihood3> -3.541128263071147 -3.118475664249611 -1.#IND00000000000 -109.139579663895260 </stream_only_likelihood3>


task 224932122 ATI RX560 INVALID ====================

Running likelihood with 31815 stars
Likelihood time = 1.268224 s
Non-finite result: setting likelihood to -999
<background_integral3> 0.000008466303075 </background_integral3>
<stream_integral3> 136.269058837911757 43.219889454989008 -0.000000000000001 2.115804393012326 </stream_integral3>
<background_likelihood3> -3.680025259872701 </background_likelihood3>
<stream_only_likelihood3> -3.541128263071147 -3.118475664249611 nan -109.139579663895262 </stream_only_likelihood3>


task 224990755 ATI S9000 INVALID======================

Running likelihood with 31815 stars
Likelihood time = 1.024323 s
Non-finite result: setting likelihood to -999
<background_integral3> 0.000008466303075 </background_integral3>
<stream_integral3> 136.269058837911870 43.219889454989044 -0.000000000000001 2.115804393012327 </stream_integral3>
<background_likelihood3> -3.680025259872701 </background_likelihood3>
<stream_only_likelihood3> -3.541128263071147 -3.118475664249611 -1.#IND00000000000 -109.139579663895260 </stream_only_likelihood3>


task 225048095 ATI RX470 VALID=======================

Running likelihood with 31815 stars
Likelihood time = 0.988171 s
Non-finite result: setting likelihood to -999
<background_integral3> 0.000008466303075 </background_integral3>
<stream_integral3> 136.269058837911870 43.219889454989044 -0.000000000000001 2.115804393012327 </stream_integral3>
<background_likelihood3> -3.680025259872701 </background_likelihood3>
<stream_only_likelihood3> -3.541128263071147 -3.118475664249611 -1.#IND00000000000 -109.139579663895260 </stream_only_likelihood3>


=================IN CONCLUSION FOR WHAT IT IS WORTH===========
Results 3 & 4 above are identical exactly to all decimal digits but only the last one is valid
Results 1 & 2 differ at only the 12 or 13th decimal digit but only the first one is valid.

Since there seem to be 4 "work units" in each "work unit" maybe there is additional testing at the server end when the result arrives


====================SOME FOOD FOR THOUGHT===============

There is a "WTF" moment in program "prob_ok" file "separation_utils"
/* FIXME: WTF? */
/* FIXME: lack of else leads to possibility of returned garbage */

According to github this file has not been changed in 7 years.
...one can make the following conclusions...
(1) Not looked at since the comment was made 7 years ago
(2) Looked at and analyzed but didn't make a difference in outcome and not worth trouble to change the comment.
(3) Looked at but unable to figure out WTF was going on so left it for another grad student to fix
ID: 68769 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 213
Credit: 108,372,750
RAC: 3,985
Message 68770 - Posted: 22 May 2019, 2:35:14 UTC - in response to Message 68769.  

=================IN CONCLUSION FOR WHAT IT IS WORTH===========
Results 3 & 4 above are identical exactly to all decimal digits but only the last one is valid
Results 1 & 2 differ at only the 12 or 13th decimal digit but only the first one is valid.

Since there seem to be 4 "work units" in each "work unit" maybe there is additional testing at the server end when the result arrives



Not quite, I'm afraid... If you look a little earlier in the two invalid results, you'll discover significant discrepancies in the third stream_only_likelihood values as indicated below (I've only cited one of the validated results...)

task 224802410, nvidia 1080TI VALID ===================

<stream_only_likelihood> -4.108396938608555 -3.238645743416023 -224.702400078412690 -59.273665947044705 </stream_only_likelihood>
<search_likelihood> -2.699214438485444 </search_likelihood>
...
<stream_only_likelihood1> -3.569118173756952 -3.226581525684845 -1.#IND00000000000 -88.910516593562207 </stream_only_likelihood1>
<search_likelihood1> -999.000000000000000 </search_likelihood1>

task 224932122 ATI RX560 INVALID ====================

<stream_only_likelihood1> -3.569118173756952 -3.226581525684845 -223.633156025982856 -88.910516593562207 </stream_only_likelihood1>
<search_likelihood1> -2.696904747021611 </search_likelihood1>

task 224990755 ATI S9000 INVALID======================

<stream_only_likelihood> -4.108396938608555 -3.238645743416023 -224.852854109725940 -59.273665947044705 </stream_only_likelihood>
<search_likelihood> -2.700466143836502 </search_likelihood>

That is consistent with what I've been finding in my recent cluster of invalid results -- it always seems to be a significant difference in that third stream_only_likelihood value; in some cases (as in one of the above), one result has been recognized as non-finite and another hasn't, whilst in others that value is typically up around the -227 to -230 level and differs significantly.

I suspect we have data and parameters that are producing results on the edge of what can be calculated; some of them diverge and not all GPUs diverge at quite the same rate, possibly because of different chunk sizes in use on cards with different amounts of available memory(?) If that is the case, I have no idea whether anything can be done about it :-(

Cheers - Al.
ID: 68770 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 68772 - Posted: 22 May 2019, 3:24:36 UTC - in response to Message 68770.  
Last modified: 22 May 2019, 4:12:11 UTC


Not quite, I'm afraid... If you look a little earlier in the two invalid results, you'll discover significant discrepancies in the third stream_only_likelihood values as indicated below (I've only cited one of the validated results...)


Thanks, I missed that, I did not purposely omit it. However the bottom two are identical, my S9000 (invalid) and then valid RX470

Your comment about the chunk size and different convergences is correct, especially since the algorithm uses random numbers as part of its attempts to predict a likelihood.
ID: 68772 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile JohnMD
Avatar

Send message
Joined: 11 Jul 08
Posts: 13
Credit: 10,015,444
RAC: 0
Message 68778 - Posted: 22 May 2019, 20:24:12 UTC

I have only CPU tasks - about 15% failure.
One common factor is that ALL my affected WUs give 244 credits.
In addition, many of the WUs have other error tasks - mostly other GPUs but also CPUs.
This leads me to the conclusion that the application is not consistent for the larger WUs in different environments, where
differing precision in intermediate work fields could be the culprit.
ID: 68778 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 68779 - Posted: 24 May 2019, 16:47:35 UTC

Hey all,

I don't have any answers for you at the moment as to what the problem with the separation application is. The current separation runs have been up for weeks, so the issue that they're only now throwing invalid responses is a bit confusing. I haven't changed anything in the source code or the runs since they were put up 24 days ago.

Please know that I'm looking into it and hopefully will have a solution or explanation of the problem soon. All of your work is extremely appreciated and your assistance debugging helps a lot.

Best,
Tom
ID: 68779 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kksplace

Send message
Joined: 12 May 19
Posts: 4
Credit: 5,938,419
RAC: 0
Message 68780 - Posted: 25 May 2019, 11:03:39 UTC

My validate errors seems to be of one specific work unit type -- the de_modfit_84_xxxxx series (crunching GPU only for MilkyWay). The vast majority of these work units result in validate errors on both of my hosts. The only other consistency I see (I am not a coder) is that both my hosts are Linux (Mint) and when looking at the wingmen who end up validating the WU, they are all Windows. I have seen several cases where another Linux host has failed validation on the same work unit. All the other MilkyWay GPU WUs seem to be doing fine.

Hopefully this helps, or someone can point me towards an individual solution.

For the moment, I am attempting to abort the xxx_84_xxx WUs when I see them; lots of computing time not useful otherwise.
ID: 68780 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
PoppaGeek
Avatar

Send message
Joined: 23 Feb 09
Posts: 4
Credit: 270,159,721
RAC: 0
Message 68781 - Posted: 25 May 2019, 17:40:29 UTC

I've been getting a lot more than normal and I am running GPU tasks on Windows 8.1.
ID: 68781 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ulrich Metzner
Avatar

Send message
Joined: 11 Apr 15
Posts: 58
Credit: 63,291,127
RAC: 0
Message 68782 - Posted: 25 May 2019, 19:53:13 UTC - in response to Message 68780.  

(...)
For the moment, I am attempting to abort the xxx_84_xxx WUs when I see them; lots of computing time not useful otherwise.

That's the way to go since nobody intervenes!
Aloha, Uli

ID: 68782 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 715
Credit: 555,620,146
RAC: 41,860
Message 68783 - Posted: 26 May 2019, 0:17:14 UTC - in response to Message 68780.  

My validate errors seems to be of one specific work unit type -- the de_modfit_84_xxxxx series (crunching GPU only for MilkyWay). The vast majority of these work units result in validate errors on both of my hosts. The only other consistency I see (I am not a coder) is that both my hosts are Linux (Mint) and when looking at the wingmen who end up validating the WU, they are all Windows. I have seen several cases where another Linux host has failed validation on the same work unit. All the other MilkyWay GPU WUs seem to be doing fine.

Hopefully this helps, or someone can point me towards an individual solution.

For the moment, I am attempting to abort the xxx_84_xxx WUs when I see them; lots of computing time not useful otherwise.

Have we found another corner case where the parameter string is too long for the BOINC client? This was the case for BOINC versions earlier than 7.6.31. I had to abort all 4 bundle tasks and only run 6 bundle tasks when I was running BOINC version 7.4.44 or all the 4 bundle tasks would fail. It got to be too much work managing aborting work so I just acquiesced and updated to BOINC 7.8.3.
It was explained to me that the problem was discovered a long time ago because the parameter string was too long for the older clients. I posted first about this error in New Linux system trashes all tasks and the reason why they fail was provided by AlanB https://milkyway.cs.rpi.edu/milkyway/show_user.php?userid=94054
in message https://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4288&postid=67510
ID: 68783 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kksplace

Send message
Joined: 12 May 19
Posts: 4
Credit: 5,938,419
RAC: 0
Message 68785 - Posted: 26 May 2019, 11:01:07 UTC - in response to Message 68783.  
Last modified: 26 May 2019, 11:14:21 UTC

Have we found another corner case where the parameter string is too long for the BOINC client? This was the case for BOINC versions earlier than 7.6.31. I had to abort all 4 bundle tasks and only run 6 bundle tasks when I was running BOINC version 7.4.44 or all the 4 bundle tasks would fail. It got to be too much work managing aborting work so I just acquiesced and updated to BOINC 7.8.3.


I am using BOINC 7.9.3, which is the 'current' one in the Mint Software Manager repository. Is there another later version to install?
ID: 68785 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 68786 - Posted: 26 May 2019, 11:47:56 UTC - in response to Message 68785.  
Last modified: 26 May 2019, 11:50:19 UTC



I am using BOINC 7.9.3, which is the 'current' one in the Mint Software Manager repository. Is there another later version to install?


https://boinc.berkeley.edu/forum_thread.php?id=12973

but watch for problems with AMD driver install if using recent boards
ID: 68786 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ulrich Metzner
Avatar

Send message
Joined: 11 Apr 15
Posts: 58
Credit: 63,291,127
RAC: 0
Message 68788 - Posted: 26 May 2019, 20:39:47 UTC - in response to Message 68782.  

(...)
For the moment, I am attempting to abort the xxx_84_xxx WUs when I see them; lots of computing time not useful otherwise.

That's the way to go since nobody intervenes!

...and it works, down to 13 invalids and further going down!
Please do something about it!
Aloha, Uli

ID: 68788 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kksplace

Send message
Joined: 12 May 19
Posts: 4
Credit: 5,938,419
RAC: 0
Message 68790 - Posted: 27 May 2019, 20:37:50 UTC - in response to Message 68786.  

Thank you BeemerBiker for pointing me at the ppa for the BOINC package.

I updated to BOINC 7.14.2 (Linux) and let everything run for about a day now.

Bottom line: not every de_modfit_84_xxxx WU is invalidated, but still a high percentage are. It helped a little, but hasn't 'fixed it'.

Keith Meyers, it looks like you are running BOINC 7.15, and with no problems. I do not see 7.15 at the ppa. Can you tell me where you got it from?
ID: 68790 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 68791 - Posted: 28 May 2019, 15:49:59 UTC

There's a possibility that the command line is being barely overflowed by the de_modfit_84_xxxx workunits. When we release runs we estimate the number of characters that the program will use in a typical command, then divide the total number of characters that can go in a command line by that estimate. This is why when we bundled 5 workunits it invalidated many workunits, but nobody had problems with 4 bundled workunits for these runs (until now). We might have reached some strange point in the optimization where the command line is being just barely overflowed for the 84th stripe (why results are off by only a couple decimal places).

I will be taking these runs down soon (they're fairly optimized by this point), which will solve any problems we are having at the moment. In the future I will bundle fewer workunits together (expect quicker runtimes and a corresponding drop in credits per bundle) and see if that resolves the issue.

My goal is to be as quick and transparent with these issues as possible. Thank you for your help debugging and your continued support.

- Tom
ID: 68791 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 68792 - Posted: 28 May 2019, 17:40:11 UTC - in response to Message 68790.  

Thank you BeemerBiker for pointing me at the ppa for the BOINC package.

I updated to BOINC 7.14.2 (Linux) and let everything run for about a day now.

Bottom line: not every de_modfit_84_xxxx WU is invalidated, but still a high percentage are. It helped a little, but hasn't 'fixed it'.

Keith Meyers, it looks like you are running BOINC 7.15, and with no problems. I do not see 7.15 at the ppa. Can you tell me where you got it from?



I used a binary editor on "boinc.exe", looked for and changed 7.14.2 into 7.15.2. The program ran correctly but it still showed 7.14 so Keith must know someone special.
ID: 68792 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Large surge of Invalid results and Validate errors on ALL machines

©2024 Astroinformatics Group