testing new validator

Author	Message
Gary Roberts Send message Joined: 1 Mar 09 Posts: 56 Credit: 1,984,937,499 RAC: 0	Message 38066 - Posted: 5 Apr 2010, 9:55:08 UTC - in response to Message 38062. Oh well...NNT until the problems with the validator have been overcome. I think that's probably a very wise move. I'm now seeing quite a lot of examples of quorums that are giving the "Too many total results" error message when there are 6 successful results that apparently don't agree closely enough. There has got to be some sort of a problem with the validator, I would guess. I've seen a few examples where the 6 results have been split 3/3 (or 4/2 or 2/4) between 48xx and 58xx GPUs and yet the validator can't seem to find 3 that agree closely enough. There's something wrong with the validation process somewhere. Here's an example of a 3/3 split that can't validate and gives the 'Too many total results'. Cheers, Gary. ID: 38066 · Rating: 0 · rate: / Reply Quote

Emanuel Send message Joined: 18 Nov 07 Posts: 280 Credit: 2,442,757 RAC: 0	Message 38067 - Posted: 5 Apr 2010, 10:00:29 UTC I think all these examples are definitely helping, at least. Other than that, it may be best to wait for the new application versions if you hate losing crunching time (if you don't, I'm sure Travis would appreciate your continued help testing). ID: 38067 · Rating: 0 · rate: / Reply Quote

Gary Roberts Send message Joined: 1 Mar 09 Posts: 56 Credit: 1,984,937,499 RAC: 0	Message 38068 - Posted: 5 Apr 2010, 10:09:22 UTC - in response to Message 38065. I'm still gathering up a lot of "can't validate" messages. What does "check skipped" mean anyway? The validator will skip the validation check if any of the limits are exceeded. In your case I think you will find that the IR has gone to 7 and the WU as a whole has errored out with the 'Too many total results' error message. Click on the WU ID and look at what the error message for the WU as a whole actually says. Cheers, Gary. ID: 38068 · Rating: 0 · rate: / Reply Quote

Arif Mert Kapicioglu Send message Joined: 14 Dec 09 Posts: 161 Credit: 589,318,064 RAC: 0	Message 38069 - Posted: 5 Apr 2010, 10:16:22 UTC - in response to Message 38068. Last modified: 5 Apr 2010, 10:21:40 UTC http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90246907 This happens in stock application? Interesting ID: 38069 · Rating: 0 · rate: / Reply Quote

Gary Roberts Send message Joined: 1 Mar 09 Posts: 56 Credit: 1,984,937,499 RAC: 0	Message 38070 - Posted: 5 Apr 2010, 10:24:07 UTC - in response to Message 38069. http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=96603214 It says wu error and lists a series of app_info.xml . I'm running stock applications. That's the task ID and NOT the WU ID. If you have the task ID open you can actually see the WU ID on the second line of that particular page of output. However, when you are looking at all your results on the website you can see both task ID and WU ID side by side. Click on the WU ID to see what I'm talking about. Cheers, Gary. ID: 38070 · Rating: 0 · rate: / Reply Quote

Arif Mert Kapicioglu Send message Joined: 14 Dec 09 Posts: 161 Credit: 589,318,064 RAC: 0	Message 38071 - Posted: 5 Apr 2010, 10:29:32 UTC - in response to Message 38070. Ok. I edited the previous post as you asked me to do. It says too many total results. Status: Completed, can't validate ID: 38071 · Rating: 0 · rate: / Reply Quote

uwe Send message Joined: 6 Nov 09 Posts: 2 Credit: 1,500,164 RAC: 0	Message 38073 - Posted: 5 Apr 2010, 10:35:46 UTC What they have in common is ignoring unknown input argument in app_info.xml. Normal behaviour? ID: 38073 · Rating: 0 · rate: / Reply Quote

Gary Roberts Send message Joined: 1 Mar 09 Posts: 56 Credit: 1,984,937,499 RAC: 0	Message 38077 - Posted: 5 Apr 2010, 10:55:25 UTC - in response to Message 38071. Ok. I edited the previous post ... That's OK. I obviously captured for posterity what you wrote before editing :-). as you asked me to do. I certainly didn't ask you to edit your post. I was trying to explain to you the difference between a 'task' and a 'WU'. There is no error in any of the six tasks that make up the complete WU but the WU has errored out simply because the IR went to 7 - ie there were nore than six tasks in total making up the workunit. If you look at any of the six tasks in the WU, none of them actually have a problem that's visible. However we can deduce that the validator couldn't find three that agreed closely enough, before the IR was bumped from 6 to 7. As soon as it was bumped, the limit of 6 total tasks was exceeded and the whole WU was marked as an error and any further validation checks were skipped. The problem is more likely to be with validation rather than how your machine crunched your particular task. It says too many total results. Status: Completed, can't validate Which is exactly what it should say seeing as Travis has set the total tasks (results) limit for a WU to be 6. The real question is why the hell the validator can't find three results that agree when it has six to choose from? Cheers, Gary. ID: 38077 · Rating: 0 · rate: / Reply Quote

Gary Roberts Send message Joined: 1 Mar 09 Posts: 56 Credit: 1,984,937,499 RAC: 0	Message 38079 - Posted: 5 Apr 2010, 11:15:56 UTC - in response to Message 38073. What they have in common is ignoring unknown input argument in app_info.xml. Normal behaviour? I'm seeing that on all recent tasks that do validate and are being crunched by stock apps anyway (no app_info.xml file) so I'm assuming it is something in new tasks that is not right but otherwise is actually harmless. Only Travis can sort that out and he's obviously not listening at the moment - probably in bed. Cheers, Gary. ID: 38079 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 38089 - Posted: 5 Apr 2010, 14:08:06 UTC - in response to Message 38057. Here's another example of the 'Too many success results' bug. Note that one of the victims actually invested over a day of CPU time for no reward. I guess he wont be particularly impressed. I wonder why people still persist with slow CPUs on a project like this? And here's one that is rather more important that I've just noticed. Looks like Travis has set 3,6,6 for errors/total/success and this quorum has failed with the error message "Too many total results". However there are only 6 tasks listed in the quorum, one of which is a 'client detached'. Perhaps that triggered an attempt to send out a 7th copy which junked the whole quorum. Because of the conflict between 48xx and 58xx, there mustn't have been 3 agreeing results at the time the attempt was made to send out the 7th copy. Until things are sorted regarding validation, perhaps it should be 3,9,6 rather than 3,6,6 to prevent this problem. EDIT: If you think about it, it makes sense to have the 'total' equal to the sum of 'errors' and 'success' so that all bases are covered. Yeah, I bumped it up to 3,9,6 for the time being. ID: 38089 · Rating: 0 · rate: / Reply Quote

Furlozza Send message Joined: 7 Feb 09 Posts: 9 Credit: 25,983,618 RAC: 0	Message 38090 - Posted: 5 Apr 2010, 14:18:41 UTC - in response to Message 38079. Dumb Question time: 1) as multiple validations are now required, are duplicate/triplicate wus being issued, or is just one being issued, returned and then re-issued until the result quorum is met? 2) As some of the errors/failure to validate appear to be 'clashes' between 47xx/48xx and 58xx series cards, is there anyway to send wus to the same version of cards? 3) What is the precise reason for the failure of the above cards to actually agree. Could it be something as simple as using older drivers/CAL runtime 1.4.3/4xx 4)I have noticed in the 20-odd validations I have based these questions on that nVidia 260s and 285s are erroring out or completing the wu but not being granted credit. In the <stderr_txt>, the wu knows where the GPU is located but can't use it, and yet in some cases there are times given in the job summary (where all returned results are recorded) Dumb statement: 58xx cards validated regardless of GPU or Memory frequencies or App used or CAL Runtime: 1.4.5xx have been predominant. The above may not be useful, but am looking at it from the what has been validated, not the what hasn't or is in pending; the majority of pending giving rise to Q 1). As for pending, it appears to be mainly the clash between cards (nVidia/ATI and ATI series) and CPU produced results and the Validator trying to get two or three matching results. And yet, I've had results Validated on just my own wu, without any correlation from another source. So maybe there is a "magic number". As I said above, dumb questions...... ID: 38090 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 38091 - Posted: 5 Apr 2010, 14:27:35 UTC - in response to Message 38089. I've updated the validator so it will add the fitness that the results reported to the server at the end of the standard output. This way you guys can check the fitness you're results are reporting (and compare them to other results). ID: 38091 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 38092 - Posted: 5 Apr 2010, 14:28:17 UTC - in response to Message 38073. What they have in common is ignoring unknown input argument in app_info.xml. Normal behaviour? That's just getting ready for the new version of the application. The new application will take the parameters it's using from the command line (that way we don't have to generate a new parameter file for each workunit). ID: 38092 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 38093 - Posted: 5 Apr 2010, 14:36:26 UTC - in response to Message 38090. Dumb Question time: 1) as multiple validations are now required, are duplicate/triplicate wus being issued, or is just one being issued, returned and then re-issued until the result quorum is met? One workunit is being used, and multiple results are generated for each workunit until it reaches a quorum. Right now the way it works is one copy of the workunit is initially issued, when it comes back, we check to see if it needs validation. If it does we send out 2 more copies (to try and get a quorum of 3). If those 2 come back and there is no quorum, we send out a 4th copy, then a 5th, etc until we have a quorum of 3. 2) As some of the errors/failure to validate appear to be 'clashes' between 47xx/48xx and 58xx series cards, is there anyway to send wus to the same version of cards? This doesn't solve the problem because we need the accuracy specified. We need to find out which cards are sending back incorrect results and update the applications accordingly. 3) What is the precise reason for the failure of the above cards to actually agree. Could it be something as simple as using older drivers/CAL runtime 1.4.3/4xx That's what we're trying to figure out. I just updated the validator to append information about the fitness returned from your results to the std_err field -- so you can see it when you look at a task. Hopefully this will help. 4)I have noticed in the 20-odd validations I have based these questions on that nVidia 260s and 285s are erroring out or completing the wu but not being granted credit. In the , the wu knows where the GPU is located but can't use it, and yet in some cases there are times given in the job summary (where all returned results are recorded) Not quite sure what this issue is... seems maybe BOINC client related? And yet, I've had results Validated on just my own wu, without any correlation from another source. So maybe there is a "magic number". This is because we're still not validating EVERY workunit. We validate every workunit that will improve the searches we're running (if there's a better fitness found that what we currently know about). If the fitness isn't going to improve the search, we're still validating those workunits 50% of the time. This is so we can get these accuracy issues worked out and so people can't scam the server for credit using single precision GPU applications or other things. ID: 38093 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 38095 - Posted: 5 Apr 2010, 15:00:35 UTC - in response to Message 38077. Which is exactly what it should say seeing as Travis has set the total tasks (results) limit for a WU to be 6. The real question is why the hell the validator can't find three results that agree when it has six to choose from? Yeah I was passed out. But I found part of the problem. The validator was actually trying to get a quorum of 4. I was looking for matches == getMinQuorum(), and since I wasn't comparing a workunit to itself, what I actually needed was matches == getMinQuorum()-1. Debugging at 4am is bad news, lol. Was pretty obvious with a good nights sleep. ID: 38095 · Rating: 0 · rate: / Reply Quote

Furlozza Send message Joined: 7 Feb 09 Posts: 9 Credit: 25,983,618 RAC: 0	Message 38096 - Posted: 5 Apr 2010, 15:18:10 UTC - in response to Message 38095. Well, here's the first "magic number" failure I have in my records: Job No 90254161 - completed - can't Validate 3,6,6 47xx/47xx 1.4.415 -3.169546804554361 standard app 58xx 1.4.533 -3.169546898027065 standard app 58xx 1.4.553 -3.169546898027031 standard app 47xx/48xx 1.4.515 -3.169546804554371 Anon app 58xx 1.4.515 -3.169546898027031 Standard app 47xx/48xx 1.4.553 -3.169546804554361 standard app (hoping it makes sense after posting) BAsically of the six issues, there were two agreements in both 47xx/48xx and also 58xx. In both cases, the drivers made no difference as both "valid" and 'invalid' results within same series missed out for some other reason whilst using the same CAL runtime as one of the 'valid' wus. Or to put it another way..... it may not be directly related to drivers, but the architecture within the actual cards. You should see the world from my eyes. ID: 38096 · Rating: 0 · rate: / Reply Quote

magyarficko Send message Joined: 22 Jan 09 Posts: 35 Credit: 46,731,190 RAC: 0	Message 38098 - Posted: 5 Apr 2010, 15:28:54 UTC - in response to Message 38053. Last modified: 5 Apr 2010, 15:30:14 UTC 82% valid tasks, is not going to work in the long run, obviously. But I'll hang around for the shakedown. Well I'm out of here for the time being as 82% is not satisfactory for me! I realize that MilkyWay is still classed (as far as I know) as an Alpha project, but IMHO it is mature enough that they shouldn't be running tests in a production environment - at least some of these bugs (if not the majority of them) SHOULD have been caught in testing before releasing this new version validator into the wild. See y'all later. Right now it looks like the problem isn't the validator but the (optimized?) GPU applications. I'm not using an optimized app, I'm using what is given to me by the project. I have NEVER before had invalid results, after the "upgrade" I was getting many. ID: 38098 · Rating: 0 · rate: / Reply Quote

Brian Silvers Send message Joined: 21 Aug 08 Posts: 625 Credit: 558,425 RAC: 0	Message 38100 - Posted: 5 Apr 2010, 16:18:46 UTC Last modified: 5 Apr 2010, 16:19:58 UTC Since I don't know if purges are running real quick, and I don't know what the major amount of noise is in this thread since I last read it, I'm going ahead and posting this. It will likely be formatted badly, and may already be covered by the numerous postings, but I just wanted to state that it's quite unfair to me to have an app that is known to be working fine and spend 4.5 hours on a task for zip, zap, zero... Oh, and in case the 4.5 hours didn't tell you which system is mine, it's the non-GPU system, the first one in the quorum... name de_s11_3s_free_6_1544383_1270347650 application MilkyWay@Home created 4 Apr 2010 2:20:50 UTC minimum quorum 3 initial replication 4 max # of error/total/success tasks 3, 6, 1 errors Too many success results This is displayed on the workunit pageTask ID click for details Computer Sent Time reported or deadline explain Status Run time (sec) CPU time (sec) Claimed credit Granted credit Application 95656877 26452 4 Apr 2010 2:22:25 UTC 5 Apr 2010 5:49:34 UTC Completed, can't validate 0.00 16,347.27 70.47 0.00 Anonymous platform 96480761 26133 5 Apr 2010 5:50:54 UTC 5 Apr 2010 7:17:50 UTC Completed, can't validate 216.20 212.47 1.11 0.00 MilkyWay@Home v0.21 (ati13ati) 96480762 141414 5 Apr 2010 5:50:38 UTC 5 Apr 2010 6:27:01 UTC Completed, can't validate 89.52 87.25 0.61 0.00 MilkyWay@Home v0.21 (ati13ati) ID: 38100 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 38103 - Posted: 5 Apr 2010, 16:27:53 UTC - in response to Message 38100. Since I don't know if purges are running real quick, and I don't know what the major amount of noise is in this thread since I last read it, I'm going ahead and posting this. It will likely be formatted badly, and may already be covered by the numerous postings, but I just wanted to state that it's quite unfair to me to have an app that is known to be working fine and spend 4.5 hours on a task for zip, zap, zero... Oh, and in case the 4.5 hours didn't tell you which system is mine, it's the non-GPU system, the first one in the quorum... name de_s11_3s_free_6_1544383_1270347650 application MilkyWay@Home created 4 Apr 2010 2:20:50 UTC minimum quorum 3 initial replication 4 max # of error/total/success tasks 3, 6, 1 errors Too many success results This is displayed on the workunit pageTask ID click for details Computer Sent Time reported or deadline explain Status Run time (sec) CPU time (sec) Claimed credit Granted credit Application 95656877 26452 4 Apr 2010 2:22:25 UTC 5 Apr 2010 5:49:34 UTC Completed, can't validate 0.00 16,347.27 70.47 0.00 Anonymous platform 96480761 26133 5 Apr 2010 5:50:54 UTC 5 Apr 2010 7:17:50 UTC Completed, can't validate 216.20 212.47 1.11 0.00 MilkyWay@Home v0.21 (ati13ati) 96480762 141414 5 Apr 2010 5:50:38 UTC 5 Apr 2010 6:27:01 UTC Completed, can't validate 89.52 87.25 0.61 0.00 MilkyWay@Home v0.21 (ati13ati) This was one of the older WUs sent out with bad values for max error/total/success: max # of error/total/success tasks 3, 6, 1 That issue shouldn't happen anymore. I've also loosened up the validation a little bit which may help some workunits not being flagged invalid. If we can't figure out a good solution to the 48xx vs 58xx ATI GPUs issue, I'll probably lower the validation to having fitness within 10e-10 (or 10e-9) to see if that helps. The new application will be 10e-11 however. ID: 38103 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 38104 - Posted: 5 Apr 2010, 16:29:48 UTC - in response to Message 38103. On another note, does anyone know if the 48xx or the 58xx ATI GPU is the one validating correctly vs the stock application? ID: 38104 · Rating: 0 · rate: / Reply Quote