Welcome to MilkyWay@home

testing new validator


Advanced search

Message boards : News : testing new validator
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next

AuthorMessage
ProfileGary Roberts

Send message
Joined: 1 Mar 09
Posts: 56
Credit: 1,984,937,272
RAC: 0
1 billion credit badge10 year member badge
Message 38066 - Posted: 5 Apr 2010, 9:55:08 UTC - in response to Message 38062.  

Oh well...NNT until the problems with the validator have been overcome.

I think that's probably a very wise move.

I'm now seeing quite a lot of examples of quorums that are giving the "Too many total results" error message when there are 6 successful results that apparently don't agree closely enough. There has got to be some sort of a problem with the validator, I would guess. I've seen a few examples where the 6 results have been split 3/3 (or 4/2 or 2/4) between 48xx and 58xx GPUs and yet the validator can't seem to find 3 that agree closely enough. There's something wrong with the validation process somewhere.

Here's an example of a 3/3 split that can't validate and gives the 'Too many total results'.

Cheers,
Gary.
ID: 38066 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Emanuel

Send message
Joined: 18 Nov 07
Posts: 280
Credit: 2,442,757
RAC: 0
2 million credit badge10 year member badge
Message 38067 - Posted: 5 Apr 2010, 10:00:29 UTC

I think all these examples are definitely helping, at least. Other than that, it may be best to wait for the new application versions if you hate losing crunching time (if you don't, I'm sure Travis would appreciate your continued help testing).
ID: 38067 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileGary Roberts

Send message
Joined: 1 Mar 09
Posts: 56
Credit: 1,984,937,272
RAC: 0
1 billion credit badge10 year member badge
Message 38068 - Posted: 5 Apr 2010, 10:09:22 UTC - in response to Message 38065.  

I'm still gathering up a lot of "can't validate" messages. What does "check skipped" mean anyway?

The validator will skip the validation check if any of the limits are exceeded. In your case I think you will find that the IR has gone to 7 and the WU as a whole has errored out with the 'Too many total results' error message. Click on the WU ID and look at what the error message for the WU as a whole actually says.

Cheers,
Gary.
ID: 38068 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileArif Mert Kapicioglu

Send message
Joined: 14 Dec 09
Posts: 161
Credit: 589,318,064
RAC: 10,014
500 million credit badge9 year member badge
Message 38069 - Posted: 5 Apr 2010, 10:16:22 UTC - in response to Message 38068.  
Last modified: 5 Apr 2010, 10:21:40 UTC

http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90246907

This happens in stock application? Interesting
ID: 38069 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileGary Roberts

Send message
Joined: 1 Mar 09
Posts: 56
Credit: 1,984,937,272
RAC: 0
1 billion credit badge10 year member badge
Message 38070 - Posted: 5 Apr 2010, 10:24:07 UTC - in response to Message 38069.  

http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=96603214

It says wu error and lists a series of app_info.xml . I'm running stock applications.

That's the task ID and NOT the WU ID. If you have the task ID open you can actually see the WU ID on the second line of that particular page of output. However, when you are looking at all your results on the website you can see both task ID and WU ID side by side. Click on the WU ID to see what I'm talking about.

Cheers,
Gary.
ID: 38070 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileArif Mert Kapicioglu

Send message
Joined: 14 Dec 09
Posts: 161
Credit: 589,318,064
RAC: 10,014
500 million credit badge9 year member badge
Message 38071 - Posted: 5 Apr 2010, 10:29:32 UTC - in response to Message 38070.  

Ok. I edited the previous post as you asked me to do. It says too many total results. Status: Completed, can't validate
ID: 38071 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profileuwe

Send message
Joined: 6 Nov 09
Posts: 2
Credit: 1,500,164
RAC: 0
1 million credit badge10 year member badge
Message 38073 - Posted: 5 Apr 2010, 10:35:46 UTC

What they have in common is ignoring unknown input argument in app_info.xml. Normal behaviour?
ID: 38073 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileGary Roberts

Send message
Joined: 1 Mar 09
Posts: 56
Credit: 1,984,937,272
RAC: 0
1 billion credit badge10 year member badge
Message 38077 - Posted: 5 Apr 2010, 10:55:25 UTC - in response to Message 38071.  

Ok. I edited the previous post ...

That's OK. I obviously captured for posterity what you wrote before editing :-).

as you asked me to do.

I certainly didn't ask you to edit your post. I was trying to explain to you the difference between a 'task' and a 'WU'. There is no error in any of the six tasks that make up the complete WU but the WU has errored out simply because the IR went to 7 - ie there were nore than six tasks in total making up the workunit. If you look at any of the six tasks in the WU, none of them actually have a problem that's visible. However we can deduce that the validator couldn't find three that agreed closely enough, before the IR was bumped from 6 to 7. As soon as it was bumped, the limit of 6 total tasks was exceeded and the whole WU was marked as an error and any further validation checks were skipped. The problem is more likely to be with validation rather than how your machine crunched your particular task.

It says too many total results. Status: Completed, can't validate

Which is exactly what it should say seeing as Travis has set the total tasks (results) limit for a WU to be 6. The real question is why the hell the validator can't find three results that agree when it has six to choose from?

Cheers,
Gary.
ID: 38077 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileGary Roberts

Send message
Joined: 1 Mar 09
Posts: 56
Credit: 1,984,937,272
RAC: 0
1 billion credit badge10 year member badge
Message 38079 - Posted: 5 Apr 2010, 11:15:56 UTC - in response to Message 38073.  

What they have in common is ignoring unknown input argument in app_info.xml. Normal behaviour?

I'm seeing that on all recent tasks that do validate and are being crunched by stock apps anyway (no app_info.xml file) so I'm assuming it is something in new tasks that is not right but otherwise is actually harmless. Only Travis can sort that out and he's obviously not listening at the moment - probably in bed.
Cheers,
Gary.
ID: 38079 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTravis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
10 thousand credit badge10 year member badge
Message 38089 - Posted: 5 Apr 2010, 14:08:06 UTC - in response to Message 38057.  

Here's another example of the 'Too many success results' bug. Note that one of the victims actually invested over a day of CPU time for no reward. I guess he wont be particularly impressed.

I wonder why people still persist with slow CPUs on a project like this?

And here's one that is rather more important that I've just noticed. Looks like Travis has set 3,6,6 for errors/total/success and this quorum has failed with the error message "Too many total results". However there are only 6 tasks listed in the quorum, one of which is a 'client detached'. Perhaps that triggered an attempt to send out a 7th copy which junked the whole quorum. Because of the conflict between 48xx and 58xx, there mustn't have been 3 agreeing results at the time the attempt was made to send out the 7th copy. Until things are sorted regarding validation, perhaps it should be 3,9,6 rather than 3,6,6 to prevent this problem.

EDIT: If you think about it, it makes sense to have the 'total' equal to the sum of 'errors' and 'success' so that all bases are covered.


Yeah, I bumped it up to 3,9,6 for the time being.
ID: 38089 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileFurlozza
Avatar

Send message
Joined: 7 Feb 09
Posts: 9
Credit: 25,983,618
RAC: 0
20 million credit badge10 year member badge
Message 38090 - Posted: 5 Apr 2010, 14:18:41 UTC - in response to Message 38079.  

Dumb Question time:

1) as multiple validations are now required, are duplicate/triplicate wus being issued, or is just one being issued, returned and then re-issued until the result quorum is met?

2) As some of the errors/failure to validate appear to be 'clashes' between 47xx/48xx and 58xx series cards, is there anyway to send wus to the same version of cards?

3) What is the precise reason for the failure of the above cards to actually agree. Could it be something as simple as using older drivers/CAL runtime 1.4.3/4xx

4)I have noticed in the 20-odd validations I have based these questions on that nVidia 260s and 285s are erroring out or completing the wu but not being granted credit. In the <stderr_txt>, the wu knows where the GPU is located but can't use it, and yet in some cases there are times given in the job summary (where all returned results are recorded)

Dumb statement: 58xx cards validated regardless of GPU or Memory frequencies or App used or CAL Runtime: 1.4.5xx have been predominant.

The above may not be useful, but am looking at it from the what has been validated, not the what hasn't or is in pending; the majority of pending giving rise to Q 1). As for pending, it appears to be mainly the clash between cards (nVidia/ATI and ATI series) and CPU produced results and the Validator trying to get two or three matching results.

And yet, I've had results Validated on just my own wu, without any correlation from another source. So maybe there is a "magic number".

As I said above, dumb questions......
ID: 38090 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTravis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
10 thousand credit badge10 year member badge
Message 38091 - Posted: 5 Apr 2010, 14:27:35 UTC - in response to Message 38089.  

I've updated the validator so it will add the fitness that the results reported to the server at the end of the standard output.

This way you guys can check the fitness you're results are reporting (and compare them to other results).
ID: 38091 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTravis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
10 thousand credit badge10 year member badge
Message 38092 - Posted: 5 Apr 2010, 14:28:17 UTC - in response to Message 38073.  

What they have in common is ignoring unknown input argument in app_info.xml. Normal behaviour?


That's just getting ready for the new version of the application. The new application will take the parameters it's using from the command line (that way we don't have to generate a new parameter file for each workunit).
ID: 38092 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTravis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
10 thousand credit badge10 year member badge
Message 38093 - Posted: 5 Apr 2010, 14:36:26 UTC - in response to Message 38090.  

Dumb Question time:

1) as multiple validations are now required, are duplicate/triplicate wus being issued, or is just one being issued, returned and then re-issued until the result quorum is met?


One workunit is being used, and multiple results are generated for each workunit until it reaches a quorum. Right now the way it works is one copy of the workunit is initially issued, when it comes back, we check to see if it needs validation. If it does we send out 2 more copies (to try and get a quorum of 3). If those 2 come back and there is no quorum, we send out a 4th copy, then a 5th, etc until we have a quorum of 3.



2) As some of the errors/failure to validate appear to be 'clashes' between 47xx/48xx and 58xx series cards, is there anyway to send wus to the same version of cards?


This doesn't solve the problem because we need the accuracy specified. We need to find out which cards are sending back incorrect results and update the applications accordingly.


3) What is the precise reason for the failure of the above cards to actually agree. Could it be something as simple as using older drivers/CAL runtime 1.4.3/4xx


That's what we're trying to figure out. I just updated the validator to append information about the fitness returned from your results to the std_err field -- so you can see it when you look at a task. Hopefully this will help.


4)I have noticed in the 20-odd validations I have based these questions on that nVidia 260s and 285s are erroring out or completing the wu but not being granted credit. In the , the wu knows where the GPU is located but can't use it, and yet in some cases there are times given in the job summary (where all returned results are recorded)


Not quite sure what this issue is... seems maybe BOINC client related?


And yet, I've had results Validated on just my own wu, without any correlation from another source. So maybe there is a "magic number".


This is because we're still not validating EVERY workunit. We validate every workunit that will improve the searches we're running (if there's a better fitness found that what we currently know about). If the fitness isn't going to improve the search, we're still validating those workunits 50% of the time. This is so we can get these accuracy issues worked out and so people can't scam the server for credit using single precision GPU applications or other things.
ID: 38093 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTravis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
10 thousand credit badge10 year member badge
Message 38095 - Posted: 5 Apr 2010, 15:00:35 UTC - in response to Message 38077.  


Which is exactly what it should say seeing as Travis has set the total tasks (results) limit for a WU to be 6. The real question is why the hell the validator can't find three results that agree when it has six to choose from?


Yeah I was passed out. But I found part of the problem. The validator was actually trying to get a quorum of 4. I was looking for matches == getMinQuorum(), and since I wasn't comparing a workunit to itself, what I actually needed was matches == getMinQuorum()-1.

Debugging at 4am is bad news, lol. Was pretty obvious with a good nights sleep.
ID: 38095 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileFurlozza
Avatar

Send message
Joined: 7 Feb 09
Posts: 9
Credit: 25,983,618
RAC: 0
20 million credit badge10 year member badge
Message 38096 - Posted: 5 Apr 2010, 15:18:10 UTC - in response to Message 38095.  

Well, here's the first "magic number" failure I have in my records:

Job No 90254161 - completed - can't Validate 3,6,6

47xx/47xx 1.4.415 -3.169546804554361 standard app
58xx 1.4.533 -3.169546898027065 standard app
58xx 1.4.553 -3.169546898027031 standard app
47xx/48xx 1.4.515 -3.169546804554371 Anon app
58xx 1.4.515 -3.169546898027031 Standard app
47xx/48xx 1.4.553 -3.169546804554361 standard app

(hoping it makes sense after posting)

BAsically of the six issues, there were two agreements in both 47xx/48xx and also 58xx. In both cases, the drivers made no difference as both "valid" and 'invalid' results within same series missed out for some other reason whilst using the same CAL runtime as one of the 'valid' wus.

Or to put it another way..... it may not be directly related to drivers, but the architecture within the actual cards.
You should see the world from my eyes.
ID: 38096 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilemagyarficko

Send message
Joined: 22 Jan 09
Posts: 35
Credit: 46,731,190
RAC: 0
30 million credit badge10 year member badge
Message 38098 - Posted: 5 Apr 2010, 15:28:54 UTC - in response to Message 38053.  
Last modified: 5 Apr 2010, 15:30:14 UTC

82% valid tasks, is not going to work in the long run, obviously. But I'll hang around for the shakedown.


Well I'm out of here for the time being as 82% is not satisfactory for me! I realize that MilkyWay is still classed (as far as I know) as an Alpha project, but IMHO it is mature enough that they shouldn't be running tests in a production environment - at least some of these bugs (if not the majority of them) SHOULD have been caught in testing before releasing this new version validator into the wild. See y'all later.


Right now it looks like the problem isn't the validator but the (optimized?) GPU applications.


I'm not using an optimized app, I'm using what is given to me by the project. I have NEVER before had invalid results, after the "upgrade" I was getting many.
ID: 38098 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brian Silvers

Send message
Joined: 21 Aug 08
Posts: 625
Credit: 558,425
RAC: 0
500 thousand credit badge10 year member badge
Message 38100 - Posted: 5 Apr 2010, 16:18:46 UTC
Last modified: 5 Apr 2010, 16:19:58 UTC

Since I don't know if purges are running real quick, and I don't know what the major amount of noise is in this thread since I last read it, I'm going ahead and posting this. It will likely be formatted badly, and may already be covered by the numerous postings, but I just wanted to state that it's quite unfair to me to have an app that is known to be working fine and spend 4.5 hours on a task for zip, zap, zero... Oh, and in case the 4.5 hours didn't tell you which system is mine, it's the non-GPU system, the first one in the quorum...

name de_s11_3s_free_6_1544383_1270347650
application MilkyWay@Home
created 4 Apr 2010 2:20:50 UTC
minimum quorum 3
initial replication 4
max # of error/total/success tasks 3, 6, 1
errors Too many success results
This is displayed on the workunit pageTask ID
click for details Computer Sent Time reported
or deadline
explain Status Run time
(sec) CPU time
(sec) Claimed credit Granted credit Application
95656877 26452 4 Apr 2010 2:22:25 UTC 5 Apr 2010 5:49:34 UTC Completed, can't validate 0.00 16,347.27 70.47 0.00 Anonymous platform
96480761 26133 5 Apr 2010 5:50:54 UTC 5 Apr 2010 7:17:50 UTC Completed, can't validate 216.20 212.47 1.11 0.00 MilkyWay@Home v0.21 (ati13ati)
96480762 141414 5 Apr 2010 5:50:38 UTC 5 Apr 2010 6:27:01 UTC Completed, can't validate 89.52 87.25 0.61 0.00 MilkyWay@Home v0.21 (ati13ati)
ID: 38100 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTravis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
10 thousand credit badge10 year member badge
Message 38103 - Posted: 5 Apr 2010, 16:27:53 UTC - in response to Message 38100.  

Since I don't know if purges are running real quick, and I don't know what the major amount of noise is in this thread since I last read it, I'm going ahead and posting this. It will likely be formatted badly, and may already be covered by the numerous postings, but I just wanted to state that it's quite unfair to me to have an app that is known to be working fine and spend 4.5 hours on a task for zip, zap, zero... Oh, and in case the 4.5 hours didn't tell you which system is mine, it's the non-GPU system, the first one in the quorum...

name de_s11_3s_free_6_1544383_1270347650
application MilkyWay@Home
created 4 Apr 2010 2:20:50 UTC
minimum quorum 3
initial replication 4
max # of error/total/success tasks 3, 6, 1
errors Too many success results
This is displayed on the workunit pageTask ID
click for details Computer Sent Time reported
or deadline
explain Status Run time
(sec) CPU time
(sec) Claimed credit Granted credit Application
95656877 26452 4 Apr 2010 2:22:25 UTC 5 Apr 2010 5:49:34 UTC Completed, can't validate 0.00 16,347.27 70.47 0.00 Anonymous platform
96480761 26133 5 Apr 2010 5:50:54 UTC 5 Apr 2010 7:17:50 UTC Completed, can't validate 216.20 212.47 1.11 0.00 MilkyWay@Home v0.21 (ati13ati)
96480762 141414 5 Apr 2010 5:50:38 UTC 5 Apr 2010 6:27:01 UTC Completed, can't validate 89.52 87.25 0.61 0.00 MilkyWay@Home v0.21 (ati13ati)



This was one of the older WUs sent out with bad values for max error/total/success:


max # of error/total/success tasks 3, 6, 1


That issue shouldn't happen anymore. I've also loosened up the validation a little bit which may help some workunits not being flagged invalid. If we can't figure out a good solution to the 48xx vs 58xx ATI GPUs issue, I'll probably lower the validation to having fitness within 10e-10 (or 10e-9) to see if that helps.

The new application will be 10e-11 however.
ID: 38103 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTravis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 30 Aug 07
Posts: 2046
Credit: 26,480
RAC: 0
10 thousand credit badge10 year member badge
Message 38104 - Posted: 5 Apr 2010, 16:29:48 UTC - in response to Message 38103.  

On another note, does anyone know if the 48xx or the 58xx ATI GPU is the one validating correctly vs the stock application?
ID: 38104 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next

Message boards : News : testing new validator

©2019 Astroinformatics Group