Large surge of Invalid results and Validate errors on ALL machines

Author	Message
Vortac Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,808,181,963 RAC: 0	Message 68754 - Posted: 19 May 2019, 7:27:22 UTC Last modified: 19 May 2019, 7:28:06 UTC I've been checking the top hosts and noticed thousands of Invalid results on ALL machines, including mine. They started to appear in numbers four days ago and have been only increasing since then. Since it's impossible that all machines have gone off at the same time, I guess it's a validation problem of some sort which needs an urgent fix - we are wasting a lot of computation power now. ID: 68754 · Rating: 0 · rate: / Reply Quote

PoppaGeek Send message Joined: 23 Feb 09 Posts: 4 Credit: 270,159,721 RAC: 0	Message 68755 - Posted: 19 May 2019, 10:00:31 UTC - in response to Message 68754. Same here. Thought something gone wrong on mine. ID: 68755 · Rating: 0 · rate: / Reply Quote

JHMarshall Send message Joined: 24 Jul 12 Posts: 40 Credit: 7,123,301,054 RAC: 0	Message 68757 - Posted: 19 May 2019, 19:50:46 UTC Ditto!! Validate errors on both Nvidia and AMD (ati) tasks. ID: 68757 · Rating: 0 · rate: / Reply Quote

Ulrich Metzner Send message Joined: 11 Apr 15 Posts: 58 Credit: 63,291,127 RAC: 0	Message 68767 - Posted: 21 May 2019, 13:07:49 UTC me 2! https://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4461 Aloha, Uli ID: 68767 · Rating: 0 · rate: / Reply Quote

CoalZombik Send message Joined: 17 Apr 19 Posts: 1 Credit: 8,487,423 RAC: 0	Message 68768 - Posted: 21 May 2019, 14:41:42 UTC I have a same problem - some of results of Separation is invalid. I tested stability my Intel CPU and nVidia GPU, but no errors during the test. ID: 68768 · Rating: 0 · rate: / Reply Quote

Joseph Stateson Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,464,005,508 RAC: 1,920	Message 68769 - Posted: 21 May 2019, 17:14:18 UTC Last modified: 21 May 2019, 17:32:58 UTC Have noticed for some time that the number of invalides fluctuates. I had thought it was the driver I was using. Tried finding where the "results" are calculated so as to do my own validation used the sources that I found here Could one of the developers comment on my last item at the bottom? My Analysis--- the program "separation_main" calls a worker that iterates through 4 "work units". ---right there is an indication there should be 4 results, not just a single item to validate. the worker calls "evaluate" before cleaning up and exiting. evaluate, toward its end, calculates a likehood and then does a "printSeparationResults" before cleaning up and exiting. Out of curiosity I looked at my wingmen's results (the work unit). There were 2 invalid (one was mine, the 3rd one) and 2 valid work tasks for the work unit This will be gone from database eventually but the workunit is here I was surprised to see that the output of the "printSeparationResults" for all 4 systems, differed only after the 12th decimal point in every result or was identical to all digits. task 224802410, nvidia 1080TI VALID =================== Running likelihood with 31815 stars Likelihood time = 0.856660 s Non-finite result: setting likelihood to -999 <background_integral3> 0.000008466303075 </background_integral3> <stream_integral3> 136.269058837911870 43.219889454989044 -0.000000000000001 2.115804393012327 </stream_integral3> <background_likelihood3> -3.680025259872701 </background_likelihood3> <stream_only_likelihood3> -3.541128263071147 -3.118475664249611 -1.#IND00000000000 -109.139579663895260 </stream_only_likelihood3> task 224932122 ATI RX560 INVALID ==================== Running likelihood with 31815 stars Likelihood time = 1.268224 s Non-finite result: setting likelihood to -999 <background_integral3> 0.000008466303075 </background_integral3> <stream_integral3> 136.269058837911757 43.219889454989008 -0.000000000000001 2.115804393012326 </stream_integral3> <background_likelihood3> -3.680025259872701 </background_likelihood3> <stream_only_likelihood3> -3.541128263071147 -3.118475664249611 nan -109.139579663895262 </stream_only_likelihood3> task 224990755 ATI S9000 INVALID====================== Running likelihood with 31815 stars Likelihood time = 1.024323 s Non-finite result: setting likelihood to -999 <background_integral3> 0.000008466303075 </background_integral3> <stream_integral3> 136.269058837911870 43.219889454989044 -0.000000000000001 2.115804393012327 </stream_integral3> <background_likelihood3> -3.680025259872701 </background_likelihood3> <stream_only_likelihood3> -3.541128263071147 -3.118475664249611 -1.#IND00000000000 -109.139579663895260 </stream_only_likelihood3> task 225048095 ATI RX470 VALID======================= Running likelihood with 31815 stars Likelihood time = 0.988171 s Non-finite result: setting likelihood to -999 <background_integral3> 0.000008466303075 </background_integral3> <stream_integral3> 136.269058837911870 43.219889454989044 -0.000000000000001 2.115804393012327 </stream_integral3> <background_likelihood3> -3.680025259872701 </background_likelihood3> <stream_only_likelihood3> -3.541128263071147 -3.118475664249611 -1.#IND00000000000 -109.139579663895260 </stream_only_likelihood3> =================IN CONCLUSION FOR WHAT IT IS WORTH=========== Results 3 & 4 above are identical exactly to all decimal digits but only the last one is valid Results 1 & 2 differ at only the 12 or 13th decimal digit but only the first one is valid. Since there seem to be 4 "work units" in each "work unit" maybe there is additional testing at the server end when the result arrives ====================SOME FOOD FOR THOUGHT=============== There is a "WTF" moment in program "prob_ok" file "separation_utils" /* FIXME: WTF? / / FIXME: lack of else leads to possibility of returned garbage */ According to github this file has not been changed in 7 years. ...one can make the following conclusions... (1) Not looked at since the comment was made 7 years ago (2) Looked at and analyzed but didn't make a difference in outcome and not worth trouble to change the comment. (3) Looked at but unable to figure out WTF was going on so left it for another grad student to fix ID: 68769 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 218 Credit: 110,624,597 RAC: 1,723	Message 68770 - Posted: 22 May 2019, 2:35:14 UTC - in response to Message 68769. =================IN CONCLUSION FOR WHAT IT IS WORTH=========== Results 3 & 4 above are identical exactly to all decimal digits but only the last one is valid Results 1 & 2 differ at only the 12 or 13th decimal digit but only the first one is valid. Since there seem to be 4 "work units" in each "work unit" maybe there is additional testing at the server end when the result arrives Not quite, I'm afraid... If you look a little earlier in the two invalid results, you'll discover significant discrepancies in the third stream_only_likelihood values as indicated below (I've only cited one of the validated results...) task 224802410, nvidia 1080TI VALID =================== <stream_only_likelihood> -4.108396938608555 -3.238645743416023 -224.702400078412690 -59.273665947044705 </stream_only_likelihood> <search_likelihood> -2.699214438485444 </search_likelihood> ... <stream_only_likelihood1> -3.569118173756952 -3.226581525684845 -1.#IND00000000000 -88.910516593562207 </stream_only_likelihood1> <search_likelihood1> -999.000000000000000 </search_likelihood1> task 224932122 ATI RX560 INVALID ==================== <stream_only_likelihood1> -3.569118173756952 -3.226581525684845 -223.633156025982856 -88.910516593562207 </stream_only_likelihood1> <search_likelihood1> -2.696904747021611 </search_likelihood1> task 224990755 ATI S9000 INVALID====================== <stream_only_likelihood> -4.108396938608555 -3.238645743416023 -224.852854109725940 -59.273665947044705 </stream_only_likelihood> <search_likelihood> -2.700466143836502 </search_likelihood> That is consistent with what I've been finding in my recent cluster of invalid results -- it always seems to be a significant difference in that third stream_only_likelihood value; in some cases (as in one of the above), one result has been recognized as non-finite and another hasn't, whilst in others that value is typically up around the -227 to -230 level and differs significantly. I suspect we have data and parameters that are producing results on the edge of what can be calculated; some of them diverge and not all GPUs diverge at quite the same rate, possibly because of different chunk sizes in use on cards with different amounts of available memory(?) If that is the case, I have no idea whether anything can be done about it :-( Cheers - Al. ID: 68770 · Rating: 0 · rate: / Reply Quote

Joseph Stateson Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,464,005,508 RAC: 1,920	Message 68772 - Posted: 22 May 2019, 3:24:36 UTC - in response to Message 68770. Last modified: 22 May 2019, 4:12:11 UTC Not quite, I'm afraid... If you look a little earlier in the two invalid results, you'll discover significant discrepancies in the third stream_only_likelihood values as indicated below (I've only cited one of the validated results...) Thanks, I missed that, I did not purposely omit it. However the bottom two are identical, my S9000 (invalid) and then valid RX470 Your comment about the chunk size and different convergences is correct, especially since the algorithm uses random numbers as part of its attempts to predict a likelihood. ID: 68772 · Rating: 0 · rate: / Reply Quote

JohnMD Send message Joined: 11 Jul 08 Posts: 13 Credit: 10,015,444 RAC: 0	Message 68778 - Posted: 22 May 2019, 20:24:12 UTC I have only CPU tasks - about 15% failure. One common factor is that ALL my affected WUs give 244 credits. In addition, many of the WUs have other error tasks - mostly other GPUs but also CPUs. This leads me to the conclusion that the application is not consistent for the larger WUs in different environments, where differing precision in intermediate work fields could be the culprit. ID: 68778 · Rating: 0 · rate: / Reply Quote

Tom Donlon Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0	Message 68779 - Posted: 24 May 2019, 16:47:35 UTC Hey all, I don't have any answers for you at the moment as to what the problem with the separation application is. The current separation runs have been up for weeks, so the issue that they're only now throwing invalid responses is a bit confusing. I haven't changed anything in the source code or the runs since they were put up 24 days ago. Please know that I'm looking into it and hopefully will have a solution or explanation of the problem soon. All of your work is extremely appreciated and your assistance debugging helps a lot. Best, Tom ID: 68779 · Rating: 0 · rate: / Reply Quote

kksplace Send message Joined: 12 May 19 Posts: 4 Credit: 7,109,937 RAC: 0	Message 68780 - Posted: 25 May 2019, 11:03:39 UTC My validate errors seems to be of one specific work unit type -- the de_modfit_84_xxxxx series (crunching GPU only for MilkyWay). The vast majority of these work units result in validate errors on both of my hosts. The only other consistency I see (I am not a coder) is that both my hosts are Linux (Mint) and when looking at the wingmen who end up validating the WU, they are all Windows. I have seen several cases where another Linux host has failed validation on the same work unit. All the other MilkyWay GPU WUs seem to be doing fine. Hopefully this helps, or someone can point me towards an individual solution. For the moment, I am attempting to abort the xxx_84_xxx WUs when I see them; lots of computing time not useful otherwise. ID: 68780 · Rating: 0 · rate: / Reply Quote

PoppaGeek Send message Joined: 23 Feb 09 Posts: 4 Credit: 270,159,721 RAC: 0	Message 68781 - Posted: 25 May 2019, 17:40:29 UTC I've been getting a lot more than normal and I am running GPU tasks on Windows 8.1. ID: 68781 · Rating: 0 · rate: / Reply Quote

Ulrich Metzner Send message Joined: 11 Apr 15 Posts: 58 Credit: 63,291,127 RAC: 0	Message 68782 - Posted: 25 May 2019, 19:53:13 UTC - in response to Message 68780. (...) For the moment, I am attempting to abort the xxx_84_xxx WUs when I see them; lots of computing time not useful otherwise. That's the way to go since nobody intervenes! Aloha, Uli ID: 68782 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 739 Credit: 567,108,362 RAC: 38,370	Message 68783 - Posted: 26 May 2019, 0:17:14 UTC - in response to Message 68780. My validate errors seems to be of one specific work unit type -- the de_modfit_84_xxxxx series (crunching GPU only for MilkyWay). The vast majority of these work units result in validate errors on both of my hosts. The only other consistency I see (I am not a coder) is that both my hosts are Linux (Mint) and when looking at the wingmen who end up validating the WU, they are all Windows. I have seen several cases where another Linux host has failed validation on the same work unit. All the other MilkyWay GPU WUs seem to be doing fine. Hopefully this helps, or someone can point me towards an individual solution. For the moment, I am attempting to abort the xxx_84_xxx WUs when I see them; lots of computing time not useful otherwise. Have we found another corner case where the parameter string is too long for the BOINC client? This was the case for BOINC versions earlier than 7.6.31. I had to abort all 4 bundle tasks and only run 6 bundle tasks when I was running BOINC version 7.4.44 or all the 4 bundle tasks would fail. It got to be too much work managing aborting work so I just acquiesced and updated to BOINC 7.8.3. It was explained to me that the problem was discovered a long time ago because the parameter string was too long for the older clients. I posted first about this error in New Linux system trashes all tasks and the reason why they fail was provided by AlanB https://milkyway.cs.rpi.edu/milkyway/show_user.php?userid=94054 in message https://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4288&postid=67510 ID: 68783 · Rating: 0 · rate: / Reply Quote

kksplace Send message Joined: 12 May 19 Posts: 4 Credit: 7,109,937 RAC: 0	Message 68785 - Posted: 26 May 2019, 11:01:07 UTC - in response to Message 68783. Last modified: 26 May 2019, 11:14:21 UTC Have we found another corner case where the parameter string is too long for the BOINC client? This was the case for BOINC versions earlier than 7.6.31. I had to abort all 4 bundle tasks and only run 6 bundle tasks when I was running BOINC version 7.4.44 or all the 4 bundle tasks would fail. It got to be too much work managing aborting work so I just acquiesced and updated to BOINC 7.8.3. I am using BOINC 7.9.3, which is the 'current' one in the Mint Software Manager repository. Is there another later version to install? ID: 68785 · Rating: 0 · rate: / Reply Quote

Joseph Stateson Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,464,005,508 RAC: 1,920	Message 68786 - Posted: 26 May 2019, 11:47:56 UTC - in response to Message 68785. Last modified: 26 May 2019, 11:50:19 UTC I am using BOINC 7.9.3, which is the 'current' one in the Mint Software Manager repository. Is there another later version to install? https://boinc.berkeley.edu/forum_thread.php?id=12973 but watch for problems with AMD driver install if using recent boards ID: 68786 · Rating: 0 · rate: / Reply Quote

Ulrich Metzner Send message Joined: 11 Apr 15 Posts: 58 Credit: 63,291,127 RAC: 0	Message 68788 - Posted: 26 May 2019, 20:39:47 UTC - in response to Message 68782. (...) For the moment, I am attempting to abort the xxx_84_xxx WUs when I see them; lots of computing time not useful otherwise. That's the way to go since nobody intervenes! ...and it works, down to 13 invalids and further going down! Please do something about it! Aloha, Uli ID: 68788 · Rating: 0 · rate: / Reply Quote

kksplace Send message Joined: 12 May 19 Posts: 4 Credit: 7,109,937 RAC: 0	Message 68790 - Posted: 27 May 2019, 20:37:50 UTC - in response to Message 68786. Thank you BeemerBiker for pointing me at the ppa for the BOINC package. I updated to BOINC 7.14.2 (Linux) and let everything run for about a day now. Bottom line: not every de_modfit_84_xxxx WU is invalidated, but still a high percentage are. It helped a little, but hasn't 'fixed it'. Keith Meyers, it looks like you are running BOINC 7.15, and with no problems. I do not see 7.15 at the ppa. Can you tell me where you got it from? ID: 68790 · Rating: 0 · rate: / Reply Quote

Tom Donlon Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0	Message 68791 - Posted: 28 May 2019, 15:49:59 UTC There's a possibility that the command line is being barely overflowed by the de_modfit_84_xxxx workunits. When we release runs we estimate the number of characters that the program will use in a typical command, then divide the total number of characters that can go in a command line by that estimate. This is why when we bundled 5 workunits it invalidated many workunits, but nobody had problems with 4 bundled workunits for these runs (until now). We might have reached some strange point in the optimization where the command line is being just barely overflowed for the 84th stripe (why results are off by only a couple decimal places). I will be taking these runs down soon (they're fairly optimized by this point), which will solve any problems we are having at the moment. In the future I will bundle fewer workunits together (expect quicker runtimes and a corresponding drop in credits per bundle) and see if that resolves the issue. My goal is to be as quick and transparent with these issues as possible. Thank you for your help debugging and your continued support. - Tom ID: 68791 · Rating: 0 · rate: / Reply Quote

Joseph Stateson Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,464,005,508 RAC: 1,920	Message 68792 - Posted: 28 May 2019, 17:40:11 UTC - in response to Message 68790. Thank you BeemerBiker for pointing me at the ppa for the BOINC package. I updated to BOINC 7.14.2 (Linux) and let everything run for about a day now. Bottom line: not every de_modfit_84_xxxx WU is invalidated, but still a high percentage are. It helped a little, but hasn't 'fixed it'. Keith Meyers, it looks like you are running BOINC 7.15, and with no problems. I do not see 7.15 at the ppa. Can you tell me where you got it from? I used a binary editor on "boinc.exe", looked for and changed 7.14.2 into 7.15.2. The program ran correctly but it still showed 7.14 so Keith must know someone special. ID: 68792 · Rating: 0 · rate: / Reply Quote