Validation inconclusive

Author	Message
GolfSierra Send message Joined: 11 Mar 22 Posts: 42 Credit: 21,902,543 RAC: 0	Message 73043 - Posted: 20 Apr 2022, 14:54:11 UTC - in response to Message 73041. Last modified: 20 Apr 2022, 14:55:37 UTC We cannot send out 2 or 3 WUs immediately, because (i) not every WU gets checked against wingmen, Are you serious? How would you ever be able to verify if this single result returned is OK or wrong? To prove so, you need at least 3 returns for each WU. There is a very high chance that two identical results ar OK while a third one might differ, therefore wrong. One result can be wrong or OK, you'll never now. (ii) The process that oversees this is semi-random and based on the user's statistics (how often they return invalid WUs), so it cannot be determined ahead of time, This makes absolutely no sense. I'm talking about WU packages here (2 or more copies of the same WU sent out in parrallel). This has nothing to do with the clients/users statistic of invalid or timed out results. I don't mean that the same user should get the same WU double or triple times, I mean that a WU should be sent out in multiple copies to different users. and (iii) There is no guarantee that all of those WUs would get returned in a timely manner to be checked against each other, and we would often have to send out more WUs in addition to those initial 2 or 3 anyways. Right, for all results that don't make it in time this is true and standard, but there is a difference in the setup for MW NBody and MW separation WUs. Packages containing just one WU is only the case for NBody - which to me indicates a mis-configuration. MW separation tasks are sent out with two copies, 2 days apart. Why isn't this possible for NBody WUs? ID: 73043 · Rating: 0 · rate: / Reply Quote

San-Fernando-Valley Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0	Message 73044 - Posted: 20 Apr 2022, 15:35:42 UTC - in response to Message 73043. Last modified: 20 Apr 2022, 15:38:36 UTC We cannot send out 2 or 3 WUs immediately, because (i) not every WU gets checked against wingmen, Are you serious? How would you ever be able to verify if this single result returned is OK or wrong? To prove so, you need at least 3 returns for each WU. There is a very high chance that two identical results ar OK while a third one might differ, therefore wrong. One result can be wrong or OK, you'll never now. (ii) The process that oversees this is semi-random and based on the user's statistics (how often they return invalid WUs), so it cannot be determined ahead of time, This makes absolutely no sense. I'm talking about WU packages here (2 or more copies of the same WU sent out in parrallel). This has nothing to do with the clients/users statistic of invalid or timed out results. I don't mean that the same user should get the same WU double or triple times, I mean that a WU should be sent out in multiple copies to different users. and (iii) There is no guarantee that all of those WUs would get returned in a timely manner to be checked against each other, and we would often have to send out more WUs in addition to those initial 2 or 3 anyways. Right, for all results that don't make it in time this is true and standard, but there is a difference in the setup for MW NBody and MW separation WUs. Packages containing just one WU is only the case for NBody - which to me indicates a mis-configuration. MW separation tasks are sent out with two copies, 2 days apart. Why isn't this possible for NBody WUs? This is all STANDARD procedure, as far as I can tell at the moment, for ALL projects. This "way of doing things" has also been discussed several times - see the other interesting threads/posts. AND it makes sense or are you suggesting that this project has been running "wrong" for years ?? I'm sure our experts can explain to you in detail how things work. Have a nice day ... ID: 73044 · Rating: 0 · rate: / Reply Quote

Kiska Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,225 RAC: 0	Message 73045 - Posted: 20 Apr 2022, 16:25:39 UTC - in response to Message 73043. We cannot send out 2 or 3 WUs immediately, because (i) not every WU gets checked against wingmen, Are you serious? How would you ever be able to verify if this single result returned is OK or wrong? To prove so, you need at least 3 returns for each WU. There is a very high chance that two identical results ar OK while a third one might differ, therefore wrong. One result can be wrong or OK, you'll never now. (ii) The process that oversees this is semi-random and based on the user's statistics (how often they return invalid WUs), so it cannot be determined ahead of time, This makes absolutely no sense. I'm talking about WU packages here (2 or more copies of the same WU sent out in parrallel). This has nothing to do with the clients/users statistic of invalid or timed out results. I don't mean that the same user should get the same WU double or triple times, I mean that a WU should be sent out in multiple copies to different users. and (iii) There is no guarantee that all of those WUs would get returned in a timely manner to be checked against each other, and we would often have to send out more WUs in addition to those initial 2 or 3 anyways. Right, for all results that don't make it in time this is true and standard, but there is a difference in the setup for MW NBody and MW separation WUs. Packages containing just one WU is only the case for NBody - which to me indicates a mis-configuration. MW separation tasks are sent out with two copies, 2 days apart. Why isn't this possible for NBody WUs? Since it has been said multiple times already. I'll point you towards the BOINC server documentation: https://boinc.berkeley.edu/trac/wiki/AdaptiveReplication this project employs this replication strategy ID: 73045 · Rating: 0 · rate: / Reply Quote

AndreyOR Send message Joined: 13 Oct 21 Posts: 44 Credit: 230,443,468 RAC: 21,819	Message 73046 - Posted: 20 Apr 2022, 19:10:51 UTC - in response to Message 73045. I read the Adaptive Replication server documentation link from Kiska above and, I may be missing something, but it seems like Adaptive Replication is on for Separation but off for N-Body. Almost all of my Separation (GPU) tasks get validated without a second task but N-Body ones need a second one. According to the docs if a host (a PC) returns at least 10 consecutive valid results the server trusts that host with 90+% probability so most of my N-Body tasks shouldn't need checking as I have almost 1000 consecutive valids https://milkyway.cs.rpi.edu/milkyway/host_app_versions.php?hostid=925487. Isn't that right? ID: 73046 · Rating: 0 · rate: / Reply Quote

Robert Coplin Send message Joined: 23 Sep 13 Posts: 19 Credit: 36,224,642 RAC: 1	Message 73047 - Posted: 20 Apr 2022, 20:08:35 UTC In the last 2 1/2 days i have almost 1000 NBody work units that are validated and almost another 15,000 that are in the validation inconclusive list ID: 73047 · Rating: 0 · rate: / Reply Quote

GolfSierra Send message Joined: 11 Mar 22 Posts: 42 Credit: 21,902,543 RAC: 0	Message 73048 - Posted: 20 Apr 2022, 21:20:21 UTC - in response to Message 73044. Last modified: 20 Apr 2022, 21:36:02 UTC deleted ID: 73048 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 218 Credit: 110,420,422 RAC: 3,848	Message 73051 - Posted: 20 Apr 2022, 23:06:08 UTC @GolfSierra - regarding sending multiple tasks per work unit... There are many different sorts of project using BOINC -- for some of them, lots of work units are created with the same data set and [most] parameters, the only difference between them being one or more seeds for randomization, whilst others may only have one work unit (and no randomization) per data-set/parameters combination... The need for validation by comparison differs in these two cases. For the first type of project, if there are enough work units with only seed differences there may be no need of result-matching for validation at all, especially if the project is CPU-only. Examples of this include climateprediction.net and some of the projects at WCG. CPDN never issues a retry unless the previous task failed, and the WCG projects of this type use a variant of adaptive replication. The goal in all cases is to reduce the number of tasks required to get to a canonical result. For the second type of project, verification is mandatory (especially if GPU-based!) Examples of this type would include the Africa Rainfall Project at WCG, where each work unit represents a separate area of Africa and a two-day simulation with "fixed" parameters... Such projects will usually ship out two (or more) tasks at once. For projects that use adaptive replication, new clients have to earn "trusted status" before they can be considered for validation without replication; "trusted" clients are expected to produce results that are unlikely to fail validation anyway! However, every so often, that assumption is tested by making them wait for a wingman anyway... We saw an unwanted side-effect of "trusted status" when loads of tasks became "ghosts" because of the connectivity issues around 20th March, and as soon as the ghosts were flagged as errors (Timed out - no response) trusted status was lost (as it should have been if these had been client errors!) So every task required a wingman and we ended up with all those "needs three to validate" tasks! Unfortunately, it appears that with the default adaptive replication mechanism one can only have one retry out for processing at once; however, this is only a problem if there's a long delay in releasing tasks, as there had been here since the disk crash. Now normal service seems to have been resumed here, retries are [usually] being served within half a day of the returned result that triggered the retry request, which is [surely] more than adequate. Cheers - Al. P.S. The adaptive replication mechanism is very simplistic! It tracks how many consecutive valid results have been returned and if that number exceeds a partly-randomized threshold the initial task will be declared valid; otherwise the quorum is boosted by 1 and that should persuade the transitioner to queue a retry. ID: 73051 · Rating: 0 · rate: / Reply Quote

Weyland Send message Joined: 16 Dec 17 Posts: 2 Credit: 23,773,776 RAC: 0	Message 73053 - Posted: 21 Apr 2022, 5:30:39 UTC I have a little over 1000 tasks in the Validation inconclusive pile from March 20 to April 17 period. All the WUs were sent to me only and judging by "Didn't need" status they aren't going to be sent to anyone else. Does this mean all those results are getting discarded? Not pointing fingers or anything, just curious about what has happened and whether it works as intended or not. ID: 73053 · Rating: 0 · rate: / Reply Quote

GolfSierra Send message Joined: 11 Mar 22 Posts: 42 Credit: 21,902,543 RAC: 0	Message 73055 - Posted: 21 Apr 2022, 6:00:47 UTC - in response to Message 73053. I have a little over 1000 tasks in the Validation inconclusive pile from March 20 to April 17 period. All the WUs were sent to me only and judging by "Didn't need" status they aren't going to be sent to anyone else. Does this mean all those results are getting discarded? Not pointing fingers or anything, just curious about what has happened and whether it works as intended or not. And did you receive any credits for them? I guess, no. ID: 73055 · Rating: 0 · rate: / Reply Quote

Septimus Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,905,403 RAC: 0	Message 73056 - Posted: 21 Apr 2022, 7:04:55 UTC - in response to Message 73055. I have a little over 1000 tasks in the Validation inconclusive pile from March 20 to April 17 period. All the WUs were sent to me only and judging by "Didn't need" status they aren't going to be sent to anyone else. Does this mean all those results are getting discarded? Not pointing fingers or anything, just curious about what has happened and whether it works as intended or not. And did you receive any credits for them? I guess, no. I have a few hundred as well with the same flag, no credit. ID: 73056 · Rating: 0 · rate: / Reply Quote

San-Fernando-Valley Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0	Message 73057 - Posted: 21 Apr 2022, 7:26:15 UTC As long as the status of a task is "Validation inconclusive" there is obviously basically no credit awardable. ID: 73057 · Rating: 0 · rate: / Reply Quote

GolfSierra Send message Joined: 11 Mar 22 Posts: 42 Credit: 21,902,543 RAC: 0	Message 73058 - Posted: 21 Apr 2022, 7:54:21 UTC - in response to Message 73057. As long as the status of a task is "Validation inconclusive" there is obviously basically no credit awardable. True, and as Tom posted We cannot send out 2 or 3 WUs immediately, because (i) not every WU gets checked against wingmen, it will never happen. But I was told that this is standard. ID: 73058 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 218 Credit: 110,420,422 RAC: 3,848	Message 73059 - Posted: 21 Apr 2022, 8:06:37 UTC - in response to Message 73053. I have a little over 1000 tasks in the Validation inconclusive pile from March 20 to April 17 period. All the WUs were sent to me only and judging by "Didn't need" status they aren't going to be sent to anyone else. Does this mean all those results are getting discarded? Not pointing fingers or anything, just curious about what has happened and whether it works as intended or not. Being curious about this, I've just been trying to make sense of the code of the transitioner to see what happens to work units which are in the state these tasks are in... I looked at the code-base for the server version cited on the MW Server status page, and I'm assuming that the MW folks haven't got a customized version thereof :-) I'm not sure about this, but it looks as if the transitioner will get round to looking at these work units again about 10 days after it last looked at them (or perhaps a little longer depending on some system parameters which are only known to the SysAdmin!) That said, I also can't be sure that it will actually generate another task then -- it looks as if it should, but I don't have the full picture as to flags and state values in the work-unit and result database entries for these items to confirm it (and I am highly unlikely to get a chance to query the MW database as a SysAdmin!) So I guess we may be waiting until around the end of April to find out what happens next! Cheers - Al. P.S. I am not a BOINC SysAdmin... ID: 73059 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0	Message 73062 - Posted: 21 Apr 2022, 10:40:28 UTC - in response to Message 73058. As long as the status of a task is "Validation inconclusive" there is obviously basically no credit awardable. True, and as Tom posted We cannot send out 2 or 3 WUs immediately, because (i) not every WU gets checked against wingmen, it will never happen. But I was told that this is standard. That may not be true, if the Server can get it's act in gear like before the crash then alot of tasks could be validated very quickly because since people have a large backlog of them it could be really easy to get to 10 valid tasks in a row meaning a 'trusted host' and boom most of their tasks get validated. ID: 73062 · Rating: 0 · rate: / Reply Quote

San-Fernando-Valley Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0	Message 73064 - Posted: 21 Apr 2022, 10:44:04 UTC - in response to Message 73062. @mikey: +1 ID: 73064 · Rating: 0 · rate: / Reply Quote

Kiska Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,225 RAC: 0	Message 73065 - Posted: 21 Apr 2022, 15:40:30 UTC - in response to Message 73046. I read the Adaptive Replication server documentation link from Kiska above and, I may be missing something, but it seems like Adaptive Replication is on for Separation but off for N-Body. Almost all of my Separation (GPU) tasks get validated without a second task but N-Body ones need a second one. Going to slightly dispute that. Going by documentation, tasks are set for minimum quorum of 1. So using that logic, separation(going off my task list) are not using adaptive replication as seen in these examples: https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=419779533 vs https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=419779593 Whereas nbody is using adaptive replication as evidenced here: https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=425993147 and inconclusive WU https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=414659426 Notice how the initial replication number changes when a result is marked inconclusive? ID: 73065 · Rating: 0 · rate: / Reply Quote

Weyland Send message Joined: 16 Dec 17 Posts: 2 Credit: 23,773,776 RAC: 0	Message 73066 - Posted: 21 Apr 2022, 18:07:15 UTC - in response to Message 73055. I have a little over 1000 tasks in the Validation inconclusive pile from March 20 to April 17 period. All the WUs were sent to me only and judging by "Didn't need" status they aren't going to be sent to anyone else. Does this mean all those results are getting discarded? Not pointing fingers or anything, just curious about what has happened and whether it works as intended or not. And did you receive any credits for them? I guess, no. I haven't received any credits for those tasks, but that's fair enough if the validator needs to double-check the results. I am more concerned with their limbo status. From what I have read in this thread this should not be happening. ID: 73066 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 218 Credit: 110,420,422 RAC: 3,848	Message 73067 - Posted: 21 Apr 2022, 18:23:58 UTC - in response to Message 73065. Last modified: 21 Apr 2022, 18:44:35 UTC Kiska, The key place to look at quorum and replication numbers is probably In progress, as it is indicative of the most recent task being out in the field... I've just looked at some of my In progress tasks for both Separation and NBody, considering only those that are _0 tasks and hence candidates for adaptive replication instant validation. In both cases the work unit web page says minimum quorum 1 initial replication 1 If I then consider only those that are _1 tasks (i.e. the initial task was marked inconclusive), the web page says minimum quorum 2 initial replication 2 And if I look at a _2 task that has two previous results marked inconclusive it has replication 3 as one would expect, but the quorum remains at 2. (I can only check this for Separation - it appears that NBody always validates on two successful returns so any second or third retries would be because of task failure of some form...) All those make sense as is, I think, so what of adaptive replication and validation without wingmen?... I have lots of validated Separation tasks which have no wingman, and the only way that can happen with the MW validator code someone linked to earlier is if it uses the adaptive replication logic. The code has a distinct path for handling the first result returned; it will set the result_state to the "Valid" value and identify the result as canonical if it passes the "trusted status" test but it also increments the minimum quorum whether it has marked the first result valid or not! So the work unit web page for all these "validated without a wingman" tasks says minimum quorum 2 initial replication 1 which is potential cause for confusion :-) Of course, the ones that aren't validated get an extra replication when the transitioner sees that there is not yet a canonical result, so the quorum bump is right for them. However, I haven't had a single NBody task validate without a wingman. I've returned enough consecutive valid tasks that I should be getting some validations without a wingman if it's using adaptive replication, so I have to conclude that it isn't for one reason or another (or, of course, that it's using a very different algorithm, but I couldn't find another validator...) Now, I'm not a BOINC SysAdmin so I could be completely misunderstanding things, but... Cheers - Al. [Edit] In the above I ignored Separation work units where the _0 task failed for some reason. I've just checked up on a few of those and they behave as if the _1 task was the first task instead - some validated _1 at once (and showed quorum 2, replications 1) whilst others went on to need three non-error results before validating... ID: 73067 · Rating: 0 · rate: / Reply Quote

GolfSierra Send message Joined: 11 Mar 22 Posts: 42 Credit: 21,902,543 RAC: 0	Message 73070 - Posted: 22 Apr 2022, 7:52:01 UTC - in response to Message 73067. Well, WCG over at Krembil announced that the restart was postponed to May 9, what a pity. Looks like I stay her for another couple of days. ID: 73070 · Rating: 0 · rate: / Reply Quote

San-Fernando-Valley Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0	Message 73073 - Posted: 22 Apr 2022, 12:03:21 UTC - in response to Message 73070. Everything is working fine here on MW ... ID: 73073 · Rating: 0 · rate: / Reply Quote