Large surge of Invalid results and Validate errors on ALL machines

Author	Message
Keith Myers Send message Joined: 24 Jan 11 Posts: 708 Credit: 543,910,003 RAC: 126,205	Message 68793 - Posted: 28 May 2019, 18:27:07 UTC - in response to Message 68792. I compile the current BOINC master from source. The master is staging the development 7.15.0 platform. BOINC does not release odd number versions. The developers are testing out all the fixes in 7.15.0 for eventual freeze to be released to public as the next 7.16.0 release. I needed the latest master because it contains the fix for my documented problem of max_concurrent being incompatible with gpu_exclude. I discovered that when I needed to block my Turing cards from GPUGrid.net since they don't have an app that works with Turing. But that fix broke work fetch so then work fetch had to be fixed. The latest master also contains the much needed fix for the "finish file present too long" error which has been around forever. Everything working fine now. Finally I also modify the code to spoof more gpus than physical to get a larger cache for Seti. ID: 68793 · Rating: 0 · rate: / Reply Quote

kksplace Send message Joined: 12 May 19 Posts: 4 Credit: 5,938,419 RAC: 0	Message 68794 - Posted: 28 May 2019, 21:45:29 UTC - in response to Message 68791. I will be taking these runs down soon (they're fairly optimized by this point), which will solve any problems we are having at the moment....My goal is to be as quick and transparent with these issues as possible. Thank you for your help debugging and your continued support. Thank you for the response! I just want to make sure that I am not somehow part of the problem and continue being useful to the project. BOINC does not release odd number versions. The developers are testing out all the fixes in 7.15.0 for eventual freeze to be released to public as the next 7.16.0 release. That makes sense. Thank you Keith for testing it out -- looking forward to 7.16 release. ID: 68794 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 210 Credit: 106,079,614 RAC: 24,023	Message 68795 - Posted: 29 May 2019, 1:55:06 UTC - in response to Message 68791. Last modified: 29 May 2019, 2:05:26 UTC There's a possibility that the command line is being barely overflowed by the de_modfit_84_xxxx workunits. When we release runs we estimate the number of characters that the program will use in a typical command, then divide the total number of characters that can go in a command line by that estimate. This is why when we bundled 5 workunits it invalidated many workunits, but nobody had problems with 4 bundled workunits for these runs (until now). We might have reached some strange point in the optimization where the command line is being just barely overflowed for the 84th stripe (why results are off by only a couple decimal places). I will be taking these runs down soon (they're fairly optimized by this point), which will solve any problems we are having at the moment. In the future I will bundle fewer workunits together (expect quicker runtimes and a corresponding drop in credits per bundle) and see if that resolves the issue. My goal is to be as quick and transparent with these issues as possible. Thank you for your help debugging and your continued support. - Tom Tom, I'm confused... I thought the command-line parameters were in sets (of 26 in this case) per task (first 26 for sub-task "0", next 26 for "1" and so on. If so, a command line issue doesn't seem to explain why it isn't always the last sub-task that has the result mismatches. As an example, consider this work unit (1762004863) which I found in my "Validation Inconclusive" group. I noted the values that were subject to significant variation in the table below: Task name de_modfit_84_bundle4_4s_south4s_0_1556550902_9326578 ============================================================== Results for almost every field were in agreement to almost every digit, except the third item in <stream_only_likelihood> section and (as a result) the <search_likelihood> value. Third items in stream_only_likelihood for workunit 1762004863 Task # sub-task 0 sub-task 1 sub-task 2 sub-task 3 228606523 (linux_nv) -227.210679809332987 -226.058958276256107 -226.637965999408493 -225.448975231683107 228734898 (win_ati) -227.031087670723000 -225.861406271732960 -226.378959825809260 -224.653666491325540 228769453 (win_ati) -227.031087670723000 -225.861406271732960 -226.378959825809260 -225.219521513209800 228880978 (mac_cpu) -227.366355524707785 -226.327009414887982 -226.773215523370538 -225.661657335596914 229879322 (win_nv) -227.031087670723000 -225.861406271732960 -226.378959825809260 -224.906795326118360 230050421 (win cpu) -227.210679809332990 -226.149044739582390 -226.378959825809260 -225.331140445405400 Search likelihoods sub-task 0 sub-task 1 sub-task 2 sub-task 3 228606523 (linux_nv) -2.701811909377543 -2.697248226571820 -2.698127277409402 -2.699543526455799 228734898 (win_ati) -2.700467703517479 -2.696294728154231 -2.696355810233853 -2.693811168003535 228769453 (win_ati) -2.700467703517479 -2.696294728154231 -2.696355810233853 -2.698124073672830 228880978 (mac_cpu) -2.702729910785179 -2.699130448070281 -2.699309267449726 -2.700545829750069 229879322 (win_nv) -2.700467703517479 -2.696294728154231 -2.696355810233853 -2.695800028583053 230050421 (win cpu) -2.701811909377544 -2.697998681670587 -2.696355810233853 -2.698904228937406 All of the above were on client 7.14.2 except the Windows CPU one (7.12.1). Both Windows 7 and Windows 10 were in evidence. Note how the Linux NVIDIA one (mine!) doesn't agree with ANY of the others except the Windows CPU one, where it agrees on sub-task 0 only. The Mac CPU job doesn't agree with any of the others at all. The three Windows GPU jobs agree on all but sub-task 3 (on which nobody agrees!) The Windows CPU job agrees with mine on sub-task 0 and with the Windows GPU jobs on sub-task 2. No agreement with anyone on sub-task 1 or 3. There's another Windows CPU job out there but IT isn't likely to resolve anything unless it agrees wholeheartedly with the other Windows CPU job... As I said, I'd've thought that if this was a minor command-line issue the errors would only manifest on the last sub-task, but maybe it doesn't use the parameters the way I think it does, so I'm willing to be put right about that! I have several tasks from the offending group on my machine at the moment, and their command line parameter lists have between 880 and 920 characters (so shouldn't cause any problems, I'd've thought) - I'll keep an eye on these when they run and see how they do... If it isn't a command-line issue causing the problems, it would be a shame to shorten the parameter lists and hence increase the number of workunits - it rather defeats the original purpose of batching the work, after all :-). And I do wonder if these errors have only started to show up when the batch in question is getting close to a finish, in which case perhaps it's just a part of getting near the boundaries of what's computable (the butterfly effect?) Hoping the above helps in some way, and thanking you for your efforts - Al. [Edited for typos] ID: 68795 · Rating: 0 · rate: / Reply Quote

Tom Donlon Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0	Message 68797 - Posted: 29 May 2019, 15:37:41 UTC - in response to Message 68795. Last modified: 29 May 2019, 15:38:04 UTC Tom, I'm confused... I thought the command-line parameters were in sets (of 26 in this case) per task (first 26 for sub-task "0", next 26 for "1" and so on. If so, a command line issue doesn't seem to explain why it isn't always the last sub-task that has the result mismatches. [...] As I said, I'd've thought that if this was a minor command-line issue the errors would only manifest on the last sub-task, but maybe it doesn't use the parameters the way I think it does, so I'm willing to be put right about that! I have several tasks from the offending group on my machine at the moment, and their command line parameter lists have between 880 and 920 characters (so shouldn't cause any problems, I'd've thought) - I'll keep an eye on these when they run and see how they do... If it isn't a command-line issue causing the problems, it would be a shame to shorten the parameter lists and hence increase the number of workunits - it rather defeats the original purpose of batching the work, after all :-). And I do wonder if these errors have only started to show up when the batch in question is getting close to a finish, in which case perhaps it's just a part of getting near the boundaries of what's computable (the butterfly effect?) Hoping the above helps in some way, and thanking you for your efforts - Al. [Edited for typos] Al, Thank you for your reply, it helps a lot. From what others had been saying it sounded like it was just one subtask (de_modfit_84_xxx) that was being consistently invalidated, so I thought maybe either there was a problem with that specific task. From your data it looks like there are mismatches between all workunits and machines. I think you might be correct with your analysis, that we may just be reaching the optimization point. We know that the likelihood surface is rather volatile and sensitive to small changes when close to optimization. Perhaps this is a computer precision issue, which I thought had been resolved before I joined the project, but maybe not. I'll look into this some more. - Tom ID: 68797 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 210 Credit: 106,079,614 RAC: 24,023	Message 68800 - Posted: 30 May 2019, 3:00:16 UTC - in response to Message 68797. Last modified: 30 May 2019, 3:32:07 UTC Al, Thank you for your reply, it helps a lot. From what others had been saying it sounded like it was just one subtask (de_modfit_84_xxx) that was being consistently invalidated, so I thought maybe either there was a problem with that specific task. From your data it looks like there are mismatches between all workunits and machines. I think you might be correct with your analysis, that we may just be reaching the optimization point. We know that the likelihood surface is rather volatile and sensitive to small changes when close to optimization. Perhaps this is a computer precision issue, which I thought had been resolved before I joined the project, but maybe not. I'll look into this some more. - Tom Tom, Firstly, to eliminate any possible confusion, I've only seen errors for de_modfit_84 tasks, but the mismatches don't seem to have a pattern (e.g. same GPU type on same O/S might produce discrepancies, though usually pairs of "identical" Windows jobs seem to validate if anything does... I think I can safely eliminate the command line length now as I've been watching all the tasks I've been processing over the last 24 hours and I've got validated de_modfit_81/2/3 tasks with more characters (over 890) in their parameter lists than some of the de_modfit_84 ones that have been failing out (as few as 880). Also, I'd only been looking at the parts of the invalid tasks that showed big differences (always that third stream-only likelihood and the search likelihood) and hadn't been paying an awful lot of attention to the actual integrals. When I paid a bit more attention to the stream integral values I noticed that all the sub-tasks I had looked at that showed this problem had the third stream integral value zero (or very close to zero) - if that is a characteristic of de_modfit_84 reaching the optimization point, perhaps there may be precision error problems down in the small numbers. None of the non-84 tasks seem to have such small integral values or such large stream-only likelihood values, and none of those seem to be failing to validate. I seem to recall there were a lot of validation errors around the time of the server upgrade too, and I've just found a few results I scraped from that time. Where there was a discrepancy then, it was typically in the fourth integral and likelihood, not the third; I've just been looking at one where four out of five sub-tasks matched nicely but the last one had fourth stream integral zero and my result showed a -227.xxxx value whilst another reported Not-A-Number... Another work-unit from back then had all five sub-tasks showing that fourth integral as zero, and all the likelihoods were around -227.xxx (with two Win ATI tasks validating, and a Win NVIDIA task and my Linux NVIDIA task going invalid.) So it has happened before. Unfortunately, I can't tell what the task names were, as it doesn't seem to get written to the log, though I do have the task numbers if they're of any use. I'm not sure how you might be able to resolve this (and I have to confess I've only ever looked at the source code to try to work out why it was having parameter issues [that problem with old clients]). That said, if there's anything I can do to help (even if it's just looking at results like I have been doing!) let me know; what's more, I'm sure there are others here who will be equally willing to pitch in (and some of them process far more work units a day than I do!) Good luck - Al. [Edited to fix the typos I've spotted :-) and to include (limited) information about earlier failures] ID: 68800 · Rating: 0 · rate: / Reply Quote