Message boards :
News :
testing new validator
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next
Author | Message |
---|---|
![]() Send message Joined: 1 Mar 09 Posts: 56 Credit: 1,984,937,499 RAC: 0 ![]() ![]() |
Oh well...NNT until the problems with the validator have been overcome. I think that's probably a very wise move. I'm now seeing quite a lot of examples of quorums that are giving the "Too many total results" error message when there are 6 successful results that apparently don't agree closely enough. There has got to be some sort of a problem with the validator, I would guess. I've seen a few examples where the 6 results have been split 3/3 (or 4/2 or 2/4) between 48xx and 58xx GPUs and yet the validator can't seem to find 3 that agree closely enough. There's something wrong with the validation process somewhere. Here's an example of a 3/3 split that can't validate and gives the 'Too many total results'. Cheers, Gary. |
Emanuel Send message Joined: 18 Nov 07 Posts: 280 Credit: 2,442,757 RAC: 0 ![]() ![]() |
I think all these examples are definitely helping, at least. Other than that, it may be best to wait for the new application versions if you hate losing crunching time (if you don't, I'm sure Travis would appreciate your continued help testing). |
![]() Send message Joined: 1 Mar 09 Posts: 56 Credit: 1,984,937,499 RAC: 0 ![]() ![]() |
I'm still gathering up a lot of "can't validate" messages. What does "check skipped" mean anyway? The validator will skip the validation check if any of the limits are exceeded. In your case I think you will find that the IR has gone to 7 and the WU as a whole has errored out with the 'Too many total results' error message. Click on the WU ID and look at what the error message for the WU as a whole actually says. Cheers, Gary. |
![]() Send message Joined: 14 Dec 09 Posts: 161 Credit: 589,318,064 RAC: 0 ![]() ![]() |
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90246907 This happens in stock application? Interesting |
![]() Send message Joined: 1 Mar 09 Posts: 56 Credit: 1,984,937,499 RAC: 0 ![]() ![]() |
http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=96603214 That's the task ID and NOT the WU ID. If you have the task ID open you can actually see the WU ID on the second line of that particular page of output. However, when you are looking at all your results on the website you can see both task ID and WU ID side by side. Click on the WU ID to see what I'm talking about. Cheers, Gary. |
![]() Send message Joined: 14 Dec 09 Posts: 161 Credit: 589,318,064 RAC: 0 ![]() ![]() |
Ok. I edited the previous post as you asked me to do. It says too many total results. Status: Completed, can't validate |
![]() Send message Joined: 6 Nov 09 Posts: 2 Credit: 1,500,164 RAC: 0 ![]() ![]() |
What they have in common is ignoring unknown input argument in app_info.xml. Normal behaviour? |
![]() Send message Joined: 1 Mar 09 Posts: 56 Credit: 1,984,937,499 RAC: 0 ![]() ![]() |
Ok. I edited the previous post ... That's OK. I obviously captured for posterity what you wrote before editing :-). as you asked me to do. I certainly didn't ask you to edit your post. I was trying to explain to you the difference between a 'task' and a 'WU'. There is no error in any of the six tasks that make up the complete WU but the WU has errored out simply because the IR went to 7 - ie there were nore than six tasks in total making up the workunit. If you look at any of the six tasks in the WU, none of them actually have a problem that's visible. However we can deduce that the validator couldn't find three that agreed closely enough, before the IR was bumped from 6 to 7. As soon as it was bumped, the limit of 6 total tasks was exceeded and the whole WU was marked as an error and any further validation checks were skipped. The problem is more likely to be with validation rather than how your machine crunched your particular task. It says too many total results. Status: Completed, can't validate Which is exactly what it should say seeing as Travis has set the total tasks (results) limit for a WU to be 6. The real question is why the hell the validator can't find three results that agree when it has six to choose from? Cheers, Gary. |
![]() Send message Joined: 1 Mar 09 Posts: 56 Credit: 1,984,937,499 RAC: 0 ![]() ![]() |
What they have in common is ignoring unknown input argument in app_info.xml. Normal behaviour? I'm seeing that on all recent tasks that do validate and are being crunched by stock apps anyway (no app_info.xml file) so I'm assuming it is something in new tasks that is not right but otherwise is actually harmless. Only Travis can sort that out and he's obviously not listening at the moment - probably in bed. Cheers, Gary. |
![]() Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 ![]() ![]() |
Here's another example of the 'Too many success results' bug. Note that one of the victims actually invested over a day of CPU time for no reward. I guess he wont be particularly impressed. Yeah, I bumped it up to 3,9,6 for the time being. ![]() |
![]() ![]() Send message Joined: 7 Feb 09 Posts: 9 Credit: 25,983,618 RAC: 0 ![]() ![]() |
Dumb Question time: 1) as multiple validations are now required, are duplicate/triplicate wus being issued, or is just one being issued, returned and then re-issued until the result quorum is met? 2) As some of the errors/failure to validate appear to be 'clashes' between 47xx/48xx and 58xx series cards, is there anyway to send wus to the same version of cards? 3) What is the precise reason for the failure of the above cards to actually agree. Could it be something as simple as using older drivers/CAL runtime 1.4.3/4xx 4)I have noticed in the 20-odd validations I have based these questions on that nVidia 260s and 285s are erroring out or completing the wu but not being granted credit. In the <stderr_txt>, the wu knows where the GPU is located but can't use it, and yet in some cases there are times given in the job summary (where all returned results are recorded) Dumb statement: 58xx cards validated regardless of GPU or Memory frequencies or App used or CAL Runtime: 1.4.5xx have been predominant. The above may not be useful, but am looking at it from the what has been validated, not the what hasn't or is in pending; the majority of pending giving rise to Q 1). As for pending, it appears to be mainly the clash between cards (nVidia/ATI and ATI series) and CPU produced results and the Validator trying to get two or three matching results. And yet, I've had results Validated on just my own wu, without any correlation from another source. So maybe there is a "magic number". As I said above, dumb questions...... |
![]() Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 ![]() ![]() |
I've updated the validator so it will add the fitness that the results reported to the server at the end of the standard output. This way you guys can check the fitness you're results are reporting (and compare them to other results). ![]() |
![]() Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 ![]() ![]() |
What they have in common is ignoring unknown input argument in app_info.xml. Normal behaviour? That's just getting ready for the new version of the application. The new application will take the parameters it's using from the command line (that way we don't have to generate a new parameter file for each workunit). ![]() |
![]() Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 ![]() ![]() |
Dumb Question time: One workunit is being used, and multiple results are generated for each workunit until it reaches a quorum. Right now the way it works is one copy of the workunit is initially issued, when it comes back, we check to see if it needs validation. If it does we send out 2 more copies (to try and get a quorum of 3). If those 2 come back and there is no quorum, we send out a 4th copy, then a 5th, etc until we have a quorum of 3.
This doesn't solve the problem because we need the accuracy specified. We need to find out which cards are sending back incorrect results and update the applications accordingly.
That's what we're trying to figure out. I just updated the validator to append information about the fitness returned from your results to the std_err field -- so you can see it when you look at a task. Hopefully this will help.
Not quite sure what this issue is... seems maybe BOINC client related?
This is because we're still not validating EVERY workunit. We validate every workunit that will improve the searches we're running (if there's a better fitness found that what we currently know about). If the fitness isn't going to improve the search, we're still validating those workunits 50% of the time. This is so we can get these accuracy issues worked out and so people can't scam the server for credit using single precision GPU applications or other things. ![]() |
![]() Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 ![]() ![]() |
Yeah I was passed out. But I found part of the problem. The validator was actually trying to get a quorum of 4. I was looking for matches == getMinQuorum(), and since I wasn't comparing a workunit to itself, what I actually needed was matches == getMinQuorum()-1. Debugging at 4am is bad news, lol. Was pretty obvious with a good nights sleep. ![]() |
![]() ![]() Send message Joined: 7 Feb 09 Posts: 9 Credit: 25,983,618 RAC: 0 ![]() ![]() |
Well, here's the first "magic number" failure I have in my records: Job No 90254161 - completed - can't Validate 3,6,6 47xx/47xx 1.4.415 -3.169546804554361 standard app 58xx 1.4.533 -3.169546898027065 standard app 58xx 1.4.553 -3.169546898027031 standard app 47xx/48xx 1.4.515 -3.169546804554371 Anon app 58xx 1.4.515 -3.169546898027031 Standard app 47xx/48xx 1.4.553 -3.169546804554361 standard app (hoping it makes sense after posting) BAsically of the six issues, there were two agreements in both 47xx/48xx and also 58xx. In both cases, the drivers made no difference as both "valid" and 'invalid' results within same series missed out for some other reason whilst using the same CAL runtime as one of the 'valid' wus. Or to put it another way..... it may not be directly related to drivers, but the architecture within the actual cards. You should see the world from my eyes. |
![]() Send message Joined: 22 Jan 09 Posts: 35 Credit: 46,731,190 RAC: 0 ![]() ![]() |
82% valid tasks, is not going to work in the long run, obviously. But I'll hang around for the shakedown. I'm not using an optimized app, I'm using what is given to me by the project. I have NEVER before had invalid results, after the "upgrade" I was getting many. ![]() |
Brian Silvers Send message Joined: 21 Aug 08 Posts: 625 Credit: 558,425 RAC: 0 ![]() ![]() |
Since I don't know if purges are running real quick, and I don't know what the major amount of noise is in this thread since I last read it, I'm going ahead and posting this. It will likely be formatted badly, and may already be covered by the numerous postings, but I just wanted to state that it's quite unfair to me to have an app that is known to be working fine and spend 4.5 hours on a task for zip, zap, zero... Oh, and in case the 4.5 hours didn't tell you which system is mine, it's the non-GPU system, the first one in the quorum... name de_s11_3s_free_6_1544383_1270347650 application MilkyWay@Home created 4 Apr 2010 2:20:50 UTC minimum quorum 3 initial replication 4 max # of error/total/success tasks 3, 6, 1 errors Too many success results This is displayed on the workunit pageTask ID click for details Computer Sent Time reported or deadline explain Status Run time (sec) CPU time (sec) Claimed credit Granted credit Application 95656877 26452 4 Apr 2010 2:22:25 UTC 5 Apr 2010 5:49:34 UTC Completed, can't validate 0.00 16,347.27 70.47 0.00 Anonymous platform 96480761 26133 5 Apr 2010 5:50:54 UTC 5 Apr 2010 7:17:50 UTC Completed, can't validate 216.20 212.47 1.11 0.00 MilkyWay@Home v0.21 (ati13ati) 96480762 141414 5 Apr 2010 5:50:38 UTC 5 Apr 2010 6:27:01 UTC Completed, can't validate 89.52 87.25 0.61 0.00 MilkyWay@Home v0.21 (ati13ati) |
![]() Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 ![]() ![]() |
Since I don't know if purges are running real quick, and I don't know what the major amount of noise is in this thread since I last read it, I'm going ahead and posting this. It will likely be formatted badly, and may already be covered by the numerous postings, but I just wanted to state that it's quite unfair to me to have an app that is known to be working fine and spend 4.5 hours on a task for zip, zap, zero... Oh, and in case the 4.5 hours didn't tell you which system is mine, it's the non-GPU system, the first one in the quorum... This was one of the older WUs sent out with bad values for max error/total/success:
That issue shouldn't happen anymore. I've also loosened up the validation a little bit which may help some workunits not being flagged invalid. If we can't figure out a good solution to the 48xx vs 58xx ATI GPUs issue, I'll probably lower the validation to having fitness within 10e-10 (or 10e-9) to see if that helps. The new application will be 10e-11 however. ![]() |
![]() Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 ![]() ![]() |
On another note, does anyone know if the 48xx or the 58xx ATI GPU is the one validating correctly vs the stock application? ![]() |
©2023 Astroinformatics Group