Welcome to MilkyWay@home

Disecting the new validator.

Message boards : Number crunching : Disecting the new validator.
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile The Gas Giant
Avatar

Send message
Joined: 24 Dec 07
Posts: 1947
Credit: 240,884,648
RAC: 0
Message 38081 - Posted: 5 Apr 2010, 11:22:46 UTC

Quoting Travis from the front page.. "Validation will now work as follows: Every result that could improve one of our searches will be validated (with a min quorum of 3 -- and the accuracy of the fitness reported must be within 10e-11 of the quorum results, this means that single precision GPU results will be flagged invalid). Results that won't improve a search will be validated 50% of the time until the error rates of hosts stabilizes in the database (this will probably take a couple weeks). Afterwards, for the results that don't improve our searches, we'll be using BOINC's adaptive validation based on hosts error rates (which will be between 10% and 100% depending on how many errors the host typically has)."

Point 1. .. ”Every result that could improve one of our searches will be validated (with a min quorum of 3 -- and the accuracy of the fitness reported must be within 10e-11 of the quorum results, this means that single precision GPU results will be flagged invalid).”

Excellent. Get rid of the cheating scum!

Point 2. ...”Results that won't improve a search will be validated 50% of the time until the error rates of hosts stabilizes in the database (this will probably take a couple weeks).”

Does this mean that work that is completed with a valid app on valid hardware, will not be granted credit 50% of the time due to no fault of its own, but solely because the parameters assigned to it didn’t ‘improve’ the genetic mutation towards a better outcome? (hopefully I got the gist of how this all works basically right). Surely this is totally unfair!

Point 3. ....”Afterwards, for the results that don't improve our searches, we'll be using BOINC's adaptive validation based on hosts error rates (which will be between 10% and 100% depending on how many errors the host typically has)."

OK, this is slightly better than the above, but it's still unfair.

Do you really want people to drop MW like a hot potato once their credit starts dropping like a stone?

I can’t agree more with Point 1, but Points 2 and 3 are over the top!

ID: 38081 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile David Glogau*
Avatar

Send message
Joined: 12 Aug 09
Posts: 172
Credit: 645,240,165
RAC: 0
Message 38084 - Posted: 5 Apr 2010, 11:38:05 UTC

Great post Gas Giant.

I am running with an error rate of about 20%. If it is only for a couple of weeks I can live with it.

If it runs on longer, I will also live with it, but many others won't. For me we are talking about a loss of a ½ million credits per day, enough to be noticed!

It seems only fair to grant, all valid crunching, credits. After all, we are not setting a precedent here, this what CPDN does, whether the WU is used or not.

Regards
ID: 38084 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Chris S
Avatar

Send message
Joined: 20 Sep 08
Posts: 1391
Credit: 203,563,566
RAC: 0
Message 38085 - Posted: 5 Apr 2010, 12:41:41 UTC

All of my rigs have suddenly got these error messages ..... Anyone know why?


<core_client_version>6.10.18</core_client_version>
<![CDATA[
<stderr_txt>
Running Milkyway@home ATI GPU application version 0.20b (Win32, SSE2, CAL 1.3) by Gipsel
ignoring unknown input argument in app_info.xml: -np
ignoring unknown input argument in app_info.xml: 20
ignoring unknown input argument in app_info.xml: -p
ignoring unknown input argument in app_info.xml: 0.8204714877234080000000000
ignoring unknown input argument in app_info.xml: 6.2644417249787670000000000
ignoring unknown input argument in app_info.xml: -1.1275059800827940000000000
ignoring unknown input argument in app_info.xml: 171.3489450424705200000000000
ignoring unknown input argument in app_info.xml: 25.5968204295114100000000000
ignoring unknown input argument in app_info.xml: 0.4638059718261400000000000
ignoring unknown input argument in app_info.xml: 6.2831853071795860000000000
ignoring unknown input argument in app_info.xml: 6.7755620233591070000000000
ignoring unknown input argument in app_info.xml: -7.1501757118781240000000000
ignoring unknown input argument in app_info.xml: 179.3048252880004400000000000
ignoring unknown input argument in app_info.xml: 38.0862970666705200000000000
ignoring unknown input argument in app_info.xml: 2.5868948022107680000000000
ignoring unknown input argument in app_info.xml: 4.7261433922311420000000000
ignoring unknown input argument in app_info.xml: 7.0656503094651130000000000
ignoring unknown input argument in app_info.xml: -13.5765285909904500000000000
ignoring unknown input argument in app_info.xml: 211.2459262577329200000000000
ignoring unknown input argument in app_info.xml: 15.0673173424509500000000000
ignoring unknown input argument in app_info.xml: 0.0000000000000000000000000
ignoring unknown input argument in app_info.xml: 6.2592519338112480000000000
ignoring unknown input argument in app_info.xml: 12.94646831325726500000

Don't drink water, that's the stuff that rusts pipes
ID: 38085 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
UBT - Ben

Send message
Joined: 8 Mar 08
Posts: 17
Credit: 4,411,459
RAC: 0
Message 38086 - Posted: 5 Apr 2010, 12:43:45 UTC

I'm having that problem also.. :S
ID: 38086 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Emanuel

Send message
Joined: 18 Nov 07
Posts: 280
Credit: 2,442,757
RAC: 0
Message 38088 - Posted: 5 Apr 2010, 13:53:01 UTC - in response to Message 38081.  
Last modified: 5 Apr 2010, 13:57:19 UTC

Point 2. ...”Results that won't improve a search will be validated 50% of the time until the error rates of hosts stabilizes in the database (this will probably take a couple weeks).”

Does this mean that work that is completed with a valid app on valid hardware, will not be granted credit 50% of the time due to no fault of its own, but solely because the parameters assigned to it didn’t ‘improve’ the genetic mutation towards a better outcome? (hopefully I got the gist of how this all works basically right). Surely this is totally unfair!

I think any successfully completed WU will get credit as before. It may take a bit longer before getting credit because it has to be checked against other results, but it will get its credits eventually. Think of it this way - currently, WUs that don't improve their search aren't used at all - but they still get credits. So why would the extra validation requirement change this? (if they could predict in advance which parameters would yield better results they wouldn't need WUs in the first place, so it's not like the work is wasted)

Point 3. ....”Afterwards, for the results that don't improve our searches, we'll be using BOINC's adaptive validation based on hosts error rates (which will be between 10% and 100% depending on how many errors the host typically has)."

OK, this is slightly better than the above, but it's still unfair.

Do you really want people to drop MW like a hot potato once their credit starts dropping like a stone?

Again, I don't think this will affect which WUs get credits - it merely means that if your host has a large error rate, its results will be checked against others' more often. Now, if you were getting credits for invalid results before due to the lax validation, obviously you won't be getting credits for those anymore - but that's the whole point of the new validator.

Of course, I don't speak for Travis; but I don't think your interpretation of his post makes sense.
ID: 38088 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Werkstatt

Send message
Joined: 19 Feb 08
Posts: 350
Credit: 141,284,369
RAC: 0
Message 38094 - Posted: 5 Apr 2010, 14:48:11 UTC - in response to Message 38088.  

Many of my WU's have that problem too.
I'm runnung mw 0.20b and 0.22 from the optimized apps site. Is something wrong with these apps? Which apps work perfect?
ID: 38094 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
CTAPbIi

Send message
Joined: 4 Jan 10
Posts: 86
Credit: 51,753,924
RAC: 0
Message 38097 - Posted: 5 Apr 2010, 15:21:41 UTC - in response to Message 38094.  

Is something wrong with these apps?

I think smth wrong with validator.
ID: 38097 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Emanuel

Send message
Joined: 18 Nov 07
Posts: 280
Credit: 2,442,757
RAC: 0
Message 38102 - Posted: 5 Apr 2010, 16:24:36 UTC
Last modified: 5 Apr 2010, 16:25:15 UTC

The validator is still very much in flux. Things should get better in the next few days as more of the issues are tracked down and as new application versions are readied for release. Remember that Travis has to sleep too - but at least some of the issues reported last night have now been fixed.

Additionally, there appears to be a problem with the ATI GPU apps which is holding things up - at this point it's not clear whether the issue can be worked around on the application level or if it is due to driver/SDK bugs that ATI will have to fix; let's hope for the former.

If you're worried about losing crunching time, you may want to set Milkyway to No New Tasks until things stabilize. If you want to help out with testing, just keep crunching as normal and report any oddities you see (if they haven't been reported already).
ID: 38102 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile uwe

Send message
Joined: 6 Nov 09
Posts: 2
Credit: 1,500,164
RAC: 0
Message 38110 - Posted: 5 Apr 2010, 17:31:10 UTC - in response to Message 38085.  

All of my rigs have suddenly got these error messages ..... Anyone know why?


<core_client_version>6.10.18</core_client_version>
<![CDATA[
<stderr_txt>
Running Milkyway@home ATI GPU application version 0.20b (Win32, SSE2, CAL 1.3) by Gipsel
ignoring unknown input argument in app_info.xml: -np
...
ignoring unknown input argument in app_info.xml: 12.94646831325726500000


In the thread news:testing new validator Travis answered to this question:
That's just getting ready for the new version of the application. The new application will take the parameters it's using from the command line (that way we don't have to generate a new parameter file for each workunit).


ID: 38110 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Werkstatt

Send message
Joined: 19 Feb 08
Posts: 350
Credit: 141,284,369
RAC: 0
Message 38112 - Posted: 5 Apr 2010, 18:09:10 UTC - in response to Message 38102.  

Additionally, there appears to be a problem with the ATI GPU apps which is holding things up - at this point it's not clear whether the issue can be worked around on the application level or if it is due to driver/SDK bugs that ATI will have to fix; let's hope for the former.

If you're worried about losing crunching time, you may want to set Milkyway to No New Tasks until things stabilize.


Travis wrote somewhere, that it looks like 48xx and 58xx cards produce different results. I use both of them and wu's from both cards produce results not granted with credits. Both cards are in the same machine, so cal is the same. If it is a driver problem, that can be solved by an update. If it is within cal, we might have a problem for the next weeks or so.
I've seen some posts about apps running in single precision. So my question is: are the apps 0.20b and 0.22 tested correct apps?
It is nice to have credit, but it is much nicer to have a deeper understanding of our milky way. The main goal is to have a working computer grid. And sometimes this means: try and error. No, i will not stop running wu's.

Regards,
Alexander
ID: 38112 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
SkyeHunter

Send message
Joined: 6 Mar 09
Posts: 41
Credit: 38,856,291
RAC: 0
Message 38115 - Posted: 5 Apr 2010, 18:56:15 UTC

In order to be as 'compliant' as possible, I migrated my 2 systems with a 4870 to stock application (v21) (from optimized v22). Resulted into 160 invalid, 300 valid and 100 pendings. Besides some practical issues (the quadXPC system is highly unstable since I returned airflow to normal), Credits are about a third what they used to be...

I hope we may expect the situation to return to 'normal'; somewhere this week ...
ID: 38115 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BarryAZ

Send message
Joined: 1 Sep 08
Posts: 520
Credit: 302,524,931
RAC: 15
Message 38118 - Posted: 5 Apr 2010, 19:03:55 UTC

As I see these 'failed validation results' popping up on all of my workstations (4850 GPU and CPU MW clients as well), at this point I figure to mark as 'no new work' for ALL of my MW workstations, letting the current queues flush out.

When the project has a need for processed work, (the new validation schema suggests they have more data than they need or want and by imposing the new schema seek to migrate folks out of the project), hopefully they will post a news update.


ID: 38118 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile banditwolf
Avatar

Send message
Joined: 12 Nov 07
Posts: 2425
Credit: 524,164
RAC: 0
Message 38122 - Posted: 5 Apr 2010, 19:23:53 UTC

I'm thinking it's validating whatever result comes first and then rejecting the latter.

This example I finished and then the others were sent out, 4800 came in second, 5800, 4800. The 5800 was rejected.
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90067556
Doesn't expecting the unexpected make the unexpected the expected?
If it makes sense, DON'T do it.
ID: 38122 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Blurf
Volunteer moderator
Project administrator

Send message
Joined: 13 Mar 08
Posts: 804
Credit: 26,380,161
RAC: 0
Message 38167 - Posted: 6 Apr 2010, 3:05:25 UTC

Quorum Down to 2

The database is having a bit of trouble keeping up with all the new results due to a quorum of 3, so for the time being I'm dropping it to a quorum of 2.
On another note, we should have source code for the new application available tomorrow. 6 Apr 2010 2:11:10 UTC


ID: 38167 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Emanuel

Send message
Joined: 18 Nov 07
Posts: 280
Credit: 2,442,757
RAC: 0
Message 38180 - Posted: 6 Apr 2010, 10:58:41 UTC

It also looks like the problem with HD5800 series cards has been tracked down. It's not a problem with the cards or the SDK, but a poorly publicized change in recommended programming practices for how to properly load floating point values (hope I'm saying that right) that only applies to the newer cards. It'll hopefully be fixed soon (I'm also hoping the CUDA applications will get the same accuracy from project-side, but they're at least well within the required range).
ID: 38180 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Disecting the new validator.

©2024 Astroinformatics Group