testing new validator
log in

Advanced search

Message boards : News : testing new validator

Author Message
Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38016 - Posted: 4 Apr 2010 | 23:53:12 UTC
Last modified: 5 Apr 2010 | 1:47:07 UTC

I've started up the new validator, so please be patient as I get all the kinks worked out over the next few days. Validation will now work as follows: Every result that could improve one of our searches will be validated (with a min quorum of 3 -- and the accuracy of the fitness reported must be within 10e-11 of the quorum results, this means that single precision GPU results will be flagged invalid). Results that won't improve a search will be validated 50% of the time until the error rates of hosts stabilizes in the database (this will probably take a couple weeks). Afterwards, for the results that don't improve our searches, we'll be using BOINC's adaptive validation based on hosts error rates (which will be between 10% and 100% depending on how many errors the host typically has).

<br>

On a side note, we'll also be updating the applications this week. We've made new background models for the milky way that we want to test. Additionally, there are some server related performance improvements that should help the server response time. I'm hoping to have the source code available by tuesday so people can compile their own applications, then make the full swap over to the new application sometime early next week.
____________

Donnie
Avatar
Send message
Joined: 19 Jul 08
Posts: 67
Credit: 272,086,462
RAC: 56
Message 38019 - Posted: 5 Apr 2010 | 0:42:20 UTC

Are these only test units just to verify the validator? I've had 2 issues with them - Completed, validation inconclusive and when the consensus does come in, I get - Completed, can't validate due to the settings of - max # of error/total/success tasks 1, 6, 1 errors Too many success results

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38020 - Posted: 5 Apr 2010 | 0:45:43 UTC - in response to Message 38019.

Are these only test units just to verify the validator? I've had 2 issues with them - Completed, validation inconclusive and when the consensus does come in, I get - Completed, can't validate due to the settings of - max # of error/total/success tasks 1, 6, 1 errors Too many success results


Ahh, thats the issue. I was wondering why WUs weren't coming back for additional validation. This should be fixed with new WUs.
____________

Donnie
Avatar
Send message
Joined: 19 Jul 08
Posts: 67
Credit: 272,086,462
RAC: 56
Message 38021 - Posted: 5 Apr 2010 | 0:54:02 UTC - in response to Message 38020.

Are these only test units just to verify the validator? I've had 2 issues with them - Completed, validation inconclusive and when the consensus does come in, I get - Completed, can't validate due to the settings of - max # of error/total/success tasks 1, 6, 1 errors Too many success results


Ahh, thats the issue. I was wondering why WUs weren't coming back for additional validation. This should be fixed with new WUs.


Thanks, I'll give 'em a try after I quit banging on Slicker's server.

zombie67 [MM]
Avatar
Send message
Joined: 29 Aug 07
Posts: 112
Credit: 205,877,087
RAC: 18,862
Message 38022 - Posted: 5 Apr 2010 | 1:53:34 UTC - in response to Message 38016.
Last modified: 5 Apr 2010 | 1:54:54 UTC

Validation will now work as follows: Every result that could improve one of our searches will be validated (with a min quorum of 3[...]


Why a quorum of 3? Why not two, and if they don't match closely enough, a third task is sent? It seems like a waste of resources.
____________

Profile banditwolf
Avatar
Send message
Joined: 12 Nov 07
Posts: 2425
Credit: 295,133
RAC: 0
Message 38023 - Posted: 5 Apr 2010 | 1:57:50 UTC

So what about the wus that say "Checked, but no consensus yet" also come up as pending. Will they be granted credit eventually?
____________
Doesn't expecting the unexpected make the unexpected the expected?
If it makes sense, DON'T do it.

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38024 - Posted: 5 Apr 2010 | 1:58:25 UTC - in response to Message 38022.

Validation will now work as follows: Every result that could improve one of our searches will be validated (with a min quorum of 3[...]


Why a quorum of 3? Why not two, and if they don't match closely enough, a third task is sent? It seems like a waste of resources.


Depending on how the validation goes I'll probably bring it down to 2. But right now I want to flush out all the clients returning bad results, which means a higher quorum -- so there's less chance of two bad clients returning results for the same WU and getting credit.
____________

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38025 - Posted: 5 Apr 2010 | 1:59:26 UTC - in response to Message 38023.

So what about the wus that say "Checked, but no consensus yet" also come up as pending. Will they be granted credit eventually?


They'll be granted credit when there's a quorum of 3. Checked but no consensus yet means that the result was looked at but there weren't 2 other similar results to validate it.
____________

zombie67 [MM]
Avatar
Send message
Joined: 29 Aug 07
Posts: 112
Credit: 205,877,087
RAC: 18,862
Message 38027 - Posted: 5 Apr 2010 | 2:31:48 UTC
Last modified: 5 Apr 2010 | 2:35:42 UTC

Most of mine (edit: many, not most) are ending up with "Completed, validation inconclusive". 4 results, with a max of 4, and the no one gets any credits. Something doesn't seem right with this scheme.

http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=82607813

http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=87913982
____________

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38028 - Posted: 5 Apr 2010 | 2:38:34 UTC - in response to Message 38027.

Most of mine (edit: many, not most) are ending up with "Completed, validation inconclusive". 4 results, with a max of 4, and the no one gets any credits. Something doesn't seem right with this scheme.

http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=82607813

http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=87913982


The max results should be 6. These must be from some of the older workunits (from the old validator), I'll update the database so hopefully they'll be fixed.
____________

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38029 - Posted: 5 Apr 2010 | 2:45:41 UTC - in response to Message 38028.

There also seems to be a set of applications out there which are giving close, but not close enough results -- accurate to about ~10e-8, when we really need ~10e-11 or more. I'm not sure if this is due to overclocking or single precision GPUs or maybe older optimized versions of the application which need to be updated.

Not quite sure what to do about this, as it's going to throw off validation of some results. In the meantime I'm hoping people running the offending applications will update to more recent versions.

As a better fix I'll be putting out updated application code this week and we're going to release it as a different application. So people are going to have to update to use that new application (either the stock version or a new optimized application), which should fix these validation issues.
____________

zombie67 [MM]
Avatar
Send message
Joined: 29 Aug 07
Posts: 112
Credit: 205,877,087
RAC: 18,862
Message 38030 - Posted: 5 Apr 2010 | 4:37:51 UTC
Last modified: 5 Apr 2010 | 4:39:24 UTC

Okay, I've run across a few with:

minimum quorum 3
initial replication 5
max # of error/total/success tasks 1, 6, 6

http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90198132
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90151888

Max # of error 1....is that really what you want? Also, why is my machine marked as invalid? It is using the stock app, with a 5870.
____________

Profile kashi
Send message
Joined: 30 Dec 07
Posts: 309
Credit: 148,432,104
RAC: 0
Message 38031 - Posted: 5 Apr 2010 | 4:55:50 UTC - in response to Message 38029.
Last modified: 5 Apr 2010 | 4:58:20 UTC

Looks to be more hardware related than application related to me. Many of the results marked invalid are using the stock application that is automatically downloaded.

Cypress class 58xx cards are validating against other Cypress if they are in the majority reported first and 48xx cards are being marked as invalid. If 48xx cards are the majority reported first then they validate against other 48xx and 58xx are marked invalid. Uncertain about NVIDIA, but I think they validate against 48xx but not 58xx.

Not sure if this is exactly correct in all details but it's something like that.

zombie67 [MM]
Avatar
Send message
Joined: 29 Aug 07
Posts: 112
Credit: 205,877,087
RAC: 18,862
Message 38033 - Posted: 5 Apr 2010 | 5:06:50 UTC
Last modified: 5 Apr 2010 | 5:18:03 UTC

Ouch. If correct, that's going to make this messy. Does HR (homogeneous redundancy) exist for GPUs?

Edit: From what I have just learned, the answer is maybe.
____________

Profile Crunch3r
Volunteer developer
Avatar
Send message
Joined: 17 Feb 08
Posts: 358
Credit: 256,958,638
RAC: 2,724
Message 38034 - Posted: 5 Apr 2010 | 5:11:26 UTC - in response to Message 38031.
Last modified: 5 Apr 2010 | 5:13:31 UTC

Looks to be more hardware related than application related to me. Many of the results marked invalid are using the stock application that is automatically downloaded.

Cypress class 58xx cards are validating against other Cypress if they are in the majority reported first and 48xx cards are being marked as invalid. If 48xx cards are the majority reported first then they validate against other 48xx and 58xx are marked invalid. Uncertain about NVIDIA, but I think they validate against 48xx but not 58xx.

Not sure if this is exactly correct in all details but it's something like that.


Now that you mention it... i think i read somewhere that the 58xx cards give incorrect results with the latest SDK.




As stated in corresponding thread, 5xxx ability to work is very questionable right now.
Bugs in ATI's OpenCL SDK implementation. They promised to fix those in new SDK release, will see...
For now 5xxx cards (both with and w/o double precision support) can be used with Brook-based ATI AP.


Source -> http://setiathome.berkeley.edu/forum_thread.php?id=59506&nowrap=true#986347

Since OpenCL is just some sort of wrapper for CAL/brook...

Seems to me that someone should do a standalone test and compare results between 48xx and 58xx again, just to make sure everything works properly.
____________

Join BOINC United now!

zombie67 [MM]
Avatar
Send message
Joined: 29 Aug 07
Posts: 112
Credit: 205,877,087
RAC: 18,862
Message 38035 - Posted: 5 Apr 2010 | 5:19:07 UTC - in response to Message 38034.

Seems to me that someone should do a standalone test and compare results between 48xx and 58xx again, just to make sure everything works properly.


And if they do produce the same results, then perhaps the validator needs to be tested as well.
____________

Profile Crunch3r
Volunteer developer
Avatar
Send message
Joined: 17 Feb 08
Posts: 358
Credit: 256,958,638
RAC: 2,724
Message 38036 - Posted: 5 Apr 2010 | 5:25:51 UTC - in response to Message 38035.

Seems to me that someone should do a standalone test and compare results between 48xx and 58xx again, just to make sure everything works properly.


And if they do produce the same results, then perhaps the validator needs to be tested as well.


I'd guess comparing a few numbers against each other shouldn't be that hard to do... But you never know ;)

____________

Join BOINC United now!

Brian Priebe
Send message
Joined: 27 Nov 09
Posts: 98
Credit: 172,335,484
RAC: 80,913
Message 38037 - Posted: 5 Apr 2010 | 5:27:08 UTC - in response to Message 38034.

As stated in corresponding thread, 5xxx ability to work is very questionable right now. Bugs in ATI's OpenCL SDK implementation. They promised to fix those in new SDK release, will see...
I recall GPUGRID was saying that ATI OpenCL was completely unusable. Kept locking up the machine at random. Also major problems with 4xxx performance that rendered them useless for any purpose.

Profile Gary Roberts
Send message
Joined: 1 Mar 09
Posts: 52
Credit: 1,085,070,246
RAC: 812,678
Message 38039 - Posted: 5 Apr 2010 | 5:58:02 UTC - in response to Message 38031.
Last modified: 5 Apr 2010 | 6:09:14 UTC

Looks to be more hardware related than application related to me. Many of the results marked invalid are using the stock application that is automatically downloaded.

Cypress class 58xx cards are validating against other Cypress if they are in the majority reported first and 48xx cards are being marked as invalid. If 48xx cards are the majority reported first then they validate against other 48xx and 58xx are marked invalid. Uncertain about NVIDIA, but I think they validate against 48xx but not 58xx.

Not sure if this is exactly correct in all details but it's something like that.

My own observations seem to tie in pretty much with the above.

I've checked a number of my results (48xx series - stock app) and everytime so far that I'm teamed up with non-58xx GPUs or even CPUs, the results are valid. If there are three 58xx GPUs, my result is always invalid.

I've not yet seen a quorum where both 48xx and 58xx GPUs validate against each other. It does take time to check so I haven't looked at enough quorums yet to be absolutely sure.

Here's a quorum that is a bit strange. There are two 48xx results that validate against each other and there are three 58xx results that have been declared invalid. These three did come in last but how did the two manage to trump them when there are supposed to be three for a quorum?

Also, the use of 1,6,6 for the error/total/success numbers is a bit strange. If the min quorum is 3 then the max errors should really be 3 also since you could still get 3 successful results and form a quorum. By leaving the errors at 1, a second error will immediately junk an otherwise potentially successful quorum.

EDIT: Does anyone know if this is the bit of the returned data that is used for validation purposes?

probability calculation (stars)
Calculated about 3.34818e+009 floatingpoint ops on FPU.


If not, what exactly is used?
____________
Cheers,
Gary.

Profile Furlozza
Avatar
Send message
Joined: 7 Feb 09
Posts: 9
Credit: 25,983,618
RAC: 0
Message 38040 - Posted: 5 Apr 2010 | 6:08:04 UTC - in response to Message 38039.

HD5870 running ati13ati app. factory oc (875, 1250).

Getting a 1 invalid/1 pending/1 valid split at the moment (roughly).

Also noticing the 48xx/cpu relationship as well.

IS validator looking at time taken?

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38041 - Posted: 5 Apr 2010 | 6:11:26 UTC - in response to Message 38037.

As stated in corresponding thread, 5xxx ability to work is very questionable right now. Bugs in ATI's OpenCL SDK implementation. They promised to fix those in new SDK release, will see...
I recall GPUGRID was saying that ATI OpenCL was completely unusable. Kept locking up the machine at random. Also major problems with 4xxx performance that rendered them useless for any purpose.


We have an OpenCL version of the MW@Home GPU application... and its about 10x slower on both NVIDIA and ATI cards. OpenCL still needs a lot of work it seems...

If someone with both cards could do some comparison the numbers would be very helpful.

When I release the code for the new application I'll have some real-sized workunit examples and the output that will be required (it will have to be within at least 10e-11). Hopefully this will help us either figure out the problem.
____________

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38042 - Posted: 5 Apr 2010 | 6:13:21 UTC - in response to Message 38039.


Also, the use of 1,6,6 for the error/total/success numbers is a bit strange. If the min quorum is 3 then the max errors should really be 3 also since you could still get 3 successful results and form a quorum. By leaving the errors at 1, a second error will immediately junk an otherwise potentially successful quorum.


The 1 max error is because our application really shouldn't error out. Chances are if there's an error it was our fault (ie, a badly generated or specified workunit), and we don't want to send out more bad WUs.

I don't mind upping it to 3 if people would prefer that, however.
____________

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38043 - Posted: 5 Apr 2010 | 6:14:07 UTC - in response to Message 38040.


IS validator looking at time taken?


It knows the time taken but doesn't use this for validation. I'm not quite sure how that would be helpful.
____________

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38044 - Posted: 5 Apr 2010 | 6:16:00 UTC - in response to Message 38039.


Here's a quorum that is a bit strange. There are two 48xx results that validate against each other and there are three 58xx results that have been declared invalid. These three did come in last but how did the two manage to trump them when there are supposed to be three for a quorum?


Good catch. There was a small bug in the check_set code for the validator. This shouldn't happen anymore.



EDIT: Does anyone know if this is the bit of the returned data that is used for validation purposes?

probability calculation (stars)
Calculated about 3.34818e+009 floatingpoint ops on FPU.


If not, what exactly is used?


The only thing used is the fitness value reported by the application. If the fitness returned is within 10e-11 of 2 other fitnesses for the quorum, it's valid.
____________

Microcruncher*
Avatar
Send message
Joined: 1 Jul 09
Posts: 7
Credit: 1,734,500
RAC: 0
Message 38045 - Posted: 5 Apr 2010 | 6:26:14 UTC
Last modified: 5 Apr 2010 | 6:32:13 UTC

Shouldn't it read:

Completed, waiting for validation

instead of

Completed, validation inconclusive

if one result is returned and no other results are reported?

[EDIT]: Fixed a typo...

zombie67 [MM]
Avatar
Send message
Joined: 29 Aug 07
Posts: 112
Credit: 205,877,087
RAC: 18,862
Message 38046 - Posted: 5 Apr 2010 | 6:29:15 UTC

Just counting pages at 20 tasks each, I am currently at 9 valid and 2 invalid. 82% valid tasks, is not going to work in the long run, obviously. But I'll hang around for the shakedown.
____________

Profile Gary Roberts
Send message
Joined: 1 Mar 09
Posts: 52
Credit: 1,085,070,246
RAC: 812,678
Message 38048 - Posted: 5 Apr 2010 | 6:40:55 UTC - in response to Message 38044.

If not, what exactly is used?


The only thing used is the fitness value reported by the application. If the fitness returned is within 10e-11 of 2 other fitnesses for the quorum, it's valid.

Thanks very much for the reply.

All we can see in the data returned is what's shown below. This is one of the invalids from the quorum I linked previously. Can't see any 'fitness' value in there so can you advise if it's possible to get that value from somewhere? I imagine you could trawl the slot directory and find it there for your own host before the result is uploaded but that doesn't help with finding the fitness for each of your wingmen.

Device 0: ATI Radeon HD5800 series (Cypress) 1024 MB local RAM (remote 2047 MB cached + 2047 MB uncached)
GPU core clock: 850 MHz, memory clock: 1200 MHz
1600 shader units organized in 20 SIMDs with 16 VLIW units (5-issue), wavefront size 64 threads
supporting double precision

Starting WU on GPU 0

main integral, 640 iterations
predicted runtime per iteration is 123 ms (33.3333 ms are allowed), dividing each iteration in 4 parts
borders of the domains at 0 400 800 1200 1600
Calculated about 3.28897e+013 floatingpoint ops on GPU, 2.47165e+008 on FPU. Approximate GPU time 84.7168 seconds.

probability calculation (stars)
Calculated about 3.34818e+009 floatingpoint ops on FPU.

WU completed.
CPU time: 3.04202 seconds, GPU time: 84.7168 seconds, wall clock time: 86.535 seconds, CPU frequency: 2.87056 GHz

</stderr_txt>


____________
Cheers,
Gary.

Microcruncher*
Avatar
Send message
Joined: 1 Jul 09
Posts: 7
Credit: 1,734,500
RAC: 0
Message 38049 - Posted: 5 Apr 2010 | 6:45:53 UTC
Last modified: 5 Apr 2010 | 6:49:09 UTC

Here is another one:

http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=89444466

errors: Too many success results

One 0.19 result on a CPU and two 0.20b results on a HD 47xx/48xx and on a HD 58xx lead to this weird result.

Profile Gary Roberts
Send message
Joined: 1 Mar 09
Posts: 52
Credit: 1,085,070,246
RAC: 812,678
Message 38050 - Posted: 5 Apr 2010 | 6:59:01 UTC - in response to Message 38049.

Here is another one:

http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=89444466

errors: Too many success results

Bug already fixed.

Check the third post in this thread.

____________
Cheers,
Gary.

Microcruncher*
Avatar
Send message
Joined: 1 Jul 09
Posts: 7
Credit: 1,734,500
RAC: 0
Message 38051 - Posted: 5 Apr 2010 | 7:02:43 UTC - in response to Message 38050.

Here is another one:

http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=89444466

errors: Too many success results

Bug already fixed.

Check the third post in this thread.


Thank you. I think I need more coffee...

Profile magyarficko
Send message
Joined: 22 Jan 09
Posts: 35
Credit: 46,731,190
RAC: 0
Message 38052 - Posted: 5 Apr 2010 | 7:14:18 UTC - in response to Message 38046.

82% valid tasks, is not going to work in the long run, obviously. But I'll hang around for the shakedown.


Well I'm out of here for the time being as 82% is not satisfactory for me! I realize that MilkyWay is still classed (as far as I know) as an Alpha project, but IMHO it is mature enough that they shouldn't be running tests in a production environment - at least some of these bugs (if not the majority of them) SHOULD have been caught in testing before releasing this new version validator into the wild. See y'all later.
____________

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38053 - Posted: 5 Apr 2010 | 7:22:40 UTC - in response to Message 38052.
Last modified: 5 Apr 2010 | 7:25:39 UTC

82% valid tasks, is not going to work in the long run, obviously. But I'll hang around for the shakedown.


Well I'm out of here for the time being as 82% is not satisfactory for me! I realize that MilkyWay is still classed (as far as I know) as an Alpha project, but IMHO it is mature enough that they shouldn't be running tests in a production environment - at least some of these bugs (if not the majority of them) SHOULD have been caught in testing before releasing this new version validator into the wild. See y'all later.


Right now it looks like the problem isn't the validator but the (optimized?) GPU applications. I don't think it will take us too long to sort this out.

And honestly, I put the new validator out tonight only screwing up a few workunits. I don't think that's too bad :P There's a lot of things you just can't catch until you put that kind of thing out in the wild anyways.

Like I mentioned in the previous post, I rewrote the assimilator/validator code from the ground up in Java. This is going to make debugging and testing a LOT easier (yay garbage collection, exceptions and no more segmentation faults), and the validator much more stable (no memory leaks, writing to bad areas of memory).

Oddly enough, it seems to be using significantly less CPU than the older version (which was c/c++).
____________

Profile Gary Roberts
Send message
Joined: 1 Mar 09
Posts: 52
Credit: 1,085,070,246
RAC: 812,678
Message 38054 - Posted: 5 Apr 2010 | 7:27:41 UTC - in response to Message 38042.

The 1 max error is because our application really shouldn't error out. Chances are if there's an error it was our fault (ie, a badly generated or specified workunit), and we don't want to send out more bad WUs.

With an IR of 3, if the whole WU is bad all 3 will be bad and and you'll quickly hit the 3 error results limit.

You shouldn't underestimate the ability of the average cruncher to trash the tasks even if the app itself really shouldn't error out :-).

Also, it's very frustrating to the CPU crunchers to see many hours of work down the drain just because of a second error result in a quorum before the third success result has had a chance to come in. What problem is there in sending out an extra copy or two of the task to see if you can get a quorum?

I don't mind upping it to 3 if people would prefer that, however.

Well, at least make it 2 so as to give a bit more protection to those who have invested their resources (and put a memo on your monitor bezel to "Not send out any bad WUs" :-).

____________
Cheers,
Gary.

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38055 - Posted: 5 Apr 2010 | 7:32:54 UTC - in response to Message 38054.
Last modified: 5 Apr 2010 | 7:33:35 UTC

The 1 max error is because our application really shouldn't error out. Chances are if there's an error it was our fault (ie, a badly generated or specified workunit), and we don't want to send out more bad WUs.

With an IR of 3, if the whole WU is bad all 3 will be bad and and you'll quickly hit the 3 error results limit.

You shouldn't underestimate the ability of the average cruncher to trash the tasks even if the app itself really shouldn't error out :-).

Also, it's very frustrating to the CPU crunchers to see many hours of work down the drain just because of a second error result in a quorum before the third success result has had a chance to come in. What problem is there in sending out an extra copy or two of the task to see if you can get a quorum?

I don't mind upping it to 3 if people would prefer that, however.

Well, at least make it 2 so as to give a bit more protection to those who have invested their resources (and put a memo on your monitor bezel to "Not send out any bad WUs" :-).


Good points. I upped the max error results to 3. This should be reflected in all the current (and new) workunits.
____________

Profile Gary Roberts
Send message
Joined: 1 Mar 09
Posts: 52
Credit: 1,085,070,246
RAC: 812,678
Message 38057 - Posted: 5 Apr 2010 | 7:59:07 UTC
Last modified: 5 Apr 2010 | 8:06:10 UTC

Here's another example of the 'Too many success results' bug. Note that one of the victims actually invested over a day of CPU time for no reward. I guess he wont be particularly impressed.

I wonder why people still persist with slow CPUs on a project like this?

And here's one that is rather more important that I've just noticed. Looks like Travis has set 3,6,6 for errors/total/success and this quorum has failed with the error message "Too many total results". However there are only 6 tasks listed in the quorum, one of which is a 'client detached'. Perhaps that triggered an attempt to send out a 7th copy which junked the whole quorum. Because of the conflict between 48xx and 58xx, there mustn't have been 3 agreeing results at the time the attempt was made to send out the 7th copy. Until things are sorted regarding validation, perhaps it should be 3,9,6 rather than 3,6,6 to prevent this problem.

EDIT: If you think about it, it makes sense to have the 'total' equal to the sum of 'errors' and 'success' so that all bases are covered.
____________
Cheers,
Gary.

Profile kashi
Send message
Joined: 30 Dec 07
Posts: 309
Credit: 148,432,104
RAC: 0
Message 38058 - Posted: 5 Apr 2010 | 8:05:44 UTC

So it is possible that most of these results would be accurate to 10e-11 if compared only against an unoptimised CPU application but the results from ATI 48xx, NVIDIA and optimised CPU applications are on one side of the required fitness value and the results from ATI 58xx and 5970 are on the other side. Therefore the difference between these two sets of hardware is less accurate than 10e-11, even though individual results compared against an unoptimised CPU application may still have the required accuracy.

And this is the reason that some projects that validate results with a quorum need to use homogeneous redundancy to ensure accurate results on different types of hardware?

Profile Furlozza
Avatar
Send message
Joined: 7 Feb 09
Posts: 9
Credit: 25,983,618
RAC: 0
Message 38059 - Posted: 5 Apr 2010 | 8:20:19 UTC - in response to Message 38057.
Last modified: 5 Apr 2010 | 8:23:39 UTC

Is the "Canonical" result used in anyway in determining the validity of results? I haven't checked too many, but have noted that the first result in sometimes determines validity or invalidity.

Strangely enough, the main variance that I can see is if the first in is either 48xx or 57xx/58xx and made the Canonical result, then all wus returned with that series card is validated whereas higher (or lower) cards are invalidated. All other data showing in the text file we get to see is usually the same. It is expecially annoying when the seen calcs return 10e-13 on the GPU FPU, the required figure on the GPU, as with the canonical whereas we are ruled invalid. Stars are usually at 10-e9

Possibly invalid opinion, but sometimes a coincidence ......

Profile Gary Roberts
Send message
Joined: 1 Mar 09
Posts: 52
Credit: 1,085,070,246
RAC: 812,678
Message 38061 - Posted: 5 Apr 2010 | 8:45:56 UTC - in response to Message 38059.

Is the "Canonical" result used in anyway in determining the validity of results? I haven't checked too many, but have noted that the first result in sometimes determines validity or invalidity.

I think it's the other way around. The validator selects those results that agree (within specification) and one of them (perhaps the first one) is nominated as 'canonical'. Maybe it's the one whose answer is the closest to the average of all valid results for that quorum. I guess it depends on how the validator has been written.

....
It is expecially annoying when the seen calcs return 10e-13 on the GPU FPU, the required figure on the GPU, as with the canonical whereas we are ruled invalid. Stars are usually at 10-e9.

Take a look more closely. The numbers you are quoting are 'flops' not 'fitness' and they are e+013 and e+009 rather than 'minus'. Travis has already said that only 'fitness' is used for validation but he hasn't answered (yet) about where we might be able to observe the actual 'fitness' values for results in a quorum. I suspect we can't access those values which will make it rather unsatisfactory for anyone trying to understand why results are being deemed invalid.

Seeing as the program is being modified at the moment, it might be a good opportunity to add some code to display on the website the fitness value returned by each successful task.


____________
Cheers,
Gary.

Profile The Gas Giant
Avatar
Send message
Joined: 24 Dec 07
Posts: 1947
Credit: 240,865,573
RAC: 0
Message 38062 - Posted: 5 Apr 2010 | 8:57:10 UTC
Last modified: 5 Apr 2010 | 8:57:42 UTC

Oh well...NNT until the problems with the validator have been overcome.

Profile Arif Mert Kapicioglu
Send message
Joined: 14 Dec 09
Posts: 158
Credit: 572,866,449
RAC: 2,430
Message 38065 - Posted: 5 Apr 2010 | 9:47:20 UTC

I'm still gathering up a lot of "can't validate" messages. What does "check skipped" mean anyway?

Profile Gary Roberts
Send message
Joined: 1 Mar 09
Posts: 52
Credit: 1,085,070,246
RAC: 812,678
Message 38066 - Posted: 5 Apr 2010 | 9:55:08 UTC - in response to Message 38062.

Oh well...NNT until the problems with the validator have been overcome.

I think that's probably a very wise move.

I'm now seeing quite a lot of examples of quorums that are giving the "Too many total results" error message when there are 6 successful results that apparently don't agree closely enough. There has got to be some sort of a problem with the validator, I would guess. I've seen a few examples where the 6 results have been split 3/3 (or 4/2 or 2/4) between 48xx and 58xx GPUs and yet the validator can't seem to find 3 that agree closely enough. There's something wrong with the validation process somewhere.

Here's an example of a 3/3 split that can't validate and gives the 'Too many total results'.

____________
Cheers,
Gary.

Emanuel
Send message
Joined: 18 Nov 07
Posts: 280
Credit: 2,442,757
RAC: 439
Message 38067 - Posted: 5 Apr 2010 | 10:00:29 UTC

I think all these examples are definitely helping, at least. Other than that, it may be best to wait for the new application versions if you hate losing crunching time (if you don't, I'm sure Travis would appreciate your continued help testing).

Profile Gary Roberts
Send message
Joined: 1 Mar 09
Posts: 52
Credit: 1,085,070,246
RAC: 812,678
Message 38068 - Posted: 5 Apr 2010 | 10:09:22 UTC - in response to Message 38065.

I'm still gathering up a lot of "can't validate" messages. What does "check skipped" mean anyway?

The validator will skip the validation check if any of the limits are exceeded. In your case I think you will find that the IR has gone to 7 and the WU as a whole has errored out with the 'Too many total results' error message. Click on the WU ID and look at what the error message for the WU as a whole actually says.

____________
Cheers,
Gary.

Profile Arif Mert Kapicioglu
Send message
Joined: 14 Dec 09
Posts: 158
Credit: 572,866,449
RAC: 2,430
Message 38069 - Posted: 5 Apr 2010 | 10:16:22 UTC - in response to Message 38068.
Last modified: 5 Apr 2010 | 10:21:40 UTC

http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90246907

This happens in stock application? Interesting

Profile Gary Roberts
Send message
Joined: 1 Mar 09
Posts: 52
Credit: 1,085,070,246
RAC: 812,678
Message 38070 - Posted: 5 Apr 2010 | 10:24:07 UTC - in response to Message 38069.

http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=96603214

It says wu error and lists a series of app_info.xml . I'm running stock applications.

That's the task ID and NOT the WU ID. If you have the task ID open you can actually see the WU ID on the second line of that particular page of output. However, when you are looking at all your results on the website you can see both task ID and WU ID side by side. Click on the WU ID to see what I'm talking about.

____________
Cheers,
Gary.

Profile Arif Mert Kapicioglu
Send message
Joined: 14 Dec 09
Posts: 158
Credit: 572,866,449
RAC: 2,430
Message 38071 - Posted: 5 Apr 2010 | 10:29:32 UTC - in response to Message 38070.

Ok. I edited the previous post as you asked me to do. It says too many total results. Status: Completed, can't validate

Profile uwe
Send message
Joined: 6 Nov 09
Posts: 2
Credit: 1,500,164
RAC: 0
Message 38073 - Posted: 5 Apr 2010 | 10:35:46 UTC

What they have in common is ignoring unknown input argument in app_info.xml. Normal behaviour?

Profile Gary Roberts
Send message
Joined: 1 Mar 09
Posts: 52
Credit: 1,085,070,246
RAC: 812,678
Message 38077 - Posted: 5 Apr 2010 | 10:55:25 UTC - in response to Message 38071.

Ok. I edited the previous post ...

That's OK. I obviously captured for posterity what you wrote before editing :-).

as you asked me to do.

I certainly didn't ask you to edit your post. I was trying to explain to you the difference between a 'task' and a 'WU'. There is no error in any of the six tasks that make up the complete WU but the WU has errored out simply because the IR went to 7 - ie there were nore than six tasks in total making up the workunit. If you look at any of the six tasks in the WU, none of them actually have a problem that's visible. However we can deduce that the validator couldn't find three that agreed closely enough, before the IR was bumped from 6 to 7. As soon as it was bumped, the limit of 6 total tasks was exceeded and the whole WU was marked as an error and any further validation checks were skipped. The problem is more likely to be with validation rather than how your machine crunched your particular task.

It says too many total results. Status: Completed, can't validate

Which is exactly what it should say seeing as Travis has set the total tasks (results) limit for a WU to be 6. The real question is why the hell the validator can't find three results that agree when it has six to choose from?

____________
Cheers,
Gary.

Profile Gary Roberts
Send message
Joined: 1 Mar 09
Posts: 52
Credit: 1,085,070,246
RAC: 812,678
Message 38079 - Posted: 5 Apr 2010 | 11:15:56 UTC - in response to Message 38073.

What they have in common is ignoring unknown input argument in app_info.xml. Normal behaviour?

I'm seeing that on all recent tasks that do validate and are being crunched by stock apps anyway (no app_info.xml file) so I'm assuming it is something in new tasks that is not right but otherwise is actually harmless. Only Travis can sort that out and he's obviously not listening at the moment - probably in bed.
____________
Cheers,
Gary.

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38089 - Posted: 5 Apr 2010 | 14:08:06 UTC - in response to Message 38057.

Here's another example of the 'Too many success results' bug. Note that one of the victims actually invested over a day of CPU time for no reward. I guess he wont be particularly impressed.

I wonder why people still persist with slow CPUs on a project like this?

And here's one that is rather more important that I've just noticed. Looks like Travis has set 3,6,6 for errors/total/success and this quorum has failed with the error message "Too many total results". However there are only 6 tasks listed in the quorum, one of which is a 'client detached'. Perhaps that triggered an attempt to send out a 7th copy which junked the whole quorum. Because of the conflict between 48xx and 58xx, there mustn't have been 3 agreeing results at the time the attempt was made to send out the 7th copy. Until things are sorted regarding validation, perhaps it should be 3,9,6 rather than 3,6,6 to prevent this problem.

EDIT: If you think about it, it makes sense to have the 'total' equal to the sum of 'errors' and 'success' so that all bases are covered.


Yeah, I bumped it up to 3,9,6 for the time being.
____________

Profile Furlozza
Avatar
Send message
Joined: 7 Feb 09
Posts: 9
Credit: 25,983,618
RAC: 0
Message 38090 - Posted: 5 Apr 2010 | 14:18:41 UTC - in response to Message 38079.

Dumb Question time:

1) as multiple validations are now required, are duplicate/triplicate wus being issued, or is just one being issued, returned and then re-issued until the result quorum is met?

2) As some of the errors/failure to validate appear to be 'clashes' between 47xx/48xx and 58xx series cards, is there anyway to send wus to the same version of cards?

3) What is the precise reason for the failure of the above cards to actually agree. Could it be something as simple as using older drivers/CAL runtime 1.4.3/4xx

4)I have noticed in the 20-odd validations I have based these questions on that nVidia 260s and 285s are erroring out or completing the wu but not being granted credit. In the <stderr_txt>, the wu knows where the GPU is located but can't use it, and yet in some cases there are times given in the job summary (where all returned results are recorded)

Dumb statement: 58xx cards validated regardless of GPU or Memory frequencies or App used or CAL Runtime: 1.4.5xx have been predominant.

The above may not be useful, but am looking at it from the what has been validated, not the what hasn't or is in pending; the majority of pending giving rise to Q 1). As for pending, it appears to be mainly the clash between cards (nVidia/ATI and ATI series) and CPU produced results and the Validator trying to get two or three matching results.

And yet, I've had results Validated on just my own wu, without any correlation from another source. So maybe there is a "magic number".

As I said above, dumb questions......

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38091 - Posted: 5 Apr 2010 | 14:27:35 UTC - in response to Message 38089.

I've updated the validator so it will add the fitness that the results reported to the server at the end of the standard output.

This way you guys can check the fitness you're results are reporting (and compare them to other results).
____________

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38092 - Posted: 5 Apr 2010 | 14:28:17 UTC - in response to Message 38073.

What they have in common is ignoring unknown input argument in app_info.xml. Normal behaviour?


That's just getting ready for the new version of the application. The new application will take the parameters it's using from the command line (that way we don't have to generate a new parameter file for each workunit).
____________

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38093 - Posted: 5 Apr 2010 | 14:36:26 UTC - in response to Message 38090.

Dumb Question time:

1) as multiple validations are now required, are duplicate/triplicate wus being issued, or is just one being issued, returned and then re-issued until the result quorum is met?


One workunit is being used, and multiple results are generated for each workunit until it reaches a quorum. Right now the way it works is one copy of the workunit is initially issued, when it comes back, we check to see if it needs validation. If it does we send out 2 more copies (to try and get a quorum of 3). If those 2 come back and there is no quorum, we send out a 4th copy, then a 5th, etc until we have a quorum of 3.



2) As some of the errors/failure to validate appear to be 'clashes' between 47xx/48xx and 58xx series cards, is there anyway to send wus to the same version of cards?


This doesn't solve the problem because we need the accuracy specified. We need to find out which cards are sending back incorrect results and update the applications accordingly.


3) What is the precise reason for the failure of the above cards to actually agree. Could it be something as simple as using older drivers/CAL runtime 1.4.3/4xx


That's what we're trying to figure out. I just updated the validator to append information about the fitness returned from your results to the std_err field -- so you can see it when you look at a task. Hopefully this will help.


4)I have noticed in the 20-odd validations I have based these questions on that nVidia 260s and 285s are erroring out or completing the wu but not being granted credit. In the <stderr_txt>, the wu knows where the GPU is located but can't use it, and yet in some cases there are times given in the job summary (where all returned results are recorded)


Not quite sure what this issue is... seems maybe BOINC client related?


And yet, I've had results Validated on just my own wu, without any correlation from another source. So maybe there is a "magic number".


This is because we're still not validating EVERY workunit. We validate every workunit that will improve the searches we're running (if there's a better fitness found that what we currently know about). If the fitness isn't going to improve the search, we're still validating those workunits 50% of the time. This is so we can get these accuracy issues worked out and so people can't scam the server for credit using single precision GPU applications or other things.
____________

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38095 - Posted: 5 Apr 2010 | 15:00:35 UTC - in response to Message 38077.


Which is exactly what it should say seeing as Travis has set the total tasks (results) limit for a WU to be 6. The real question is why the hell the validator can't find three results that agree when it has six to choose from?


Yeah I was passed out. But I found part of the problem. The validator was actually trying to get a quorum of 4. I was looking for matches == getMinQuorum(), and since I wasn't comparing a workunit to itself, what I actually needed was matches == getMinQuorum()-1.

Debugging at 4am is bad news, lol. Was pretty obvious with a good nights sleep.
____________

Profile Furlozza
Avatar
Send message
Joined: 7 Feb 09
Posts: 9
Credit: 25,983,618
RAC: 0
Message 38096 - Posted: 5 Apr 2010 | 15:18:10 UTC - in response to Message 38095.

Well, here's the first "magic number" failure I have in my records:

Job No 90254161 - completed - can't Validate 3,6,6

47xx/47xx 1.4.415 -3.169546804554361 standard app
58xx 1.4.533 -3.169546898027065 standard app
58xx 1.4.553 -3.169546898027031 standard app
47xx/48xx 1.4.515 -3.169546804554371 Anon app
58xx 1.4.515 -3.169546898027031 Standard app
47xx/48xx 1.4.553 -3.169546804554361 standard app

(hoping it makes sense after posting)

BAsically of the six issues, there were two agreements in both 47xx/48xx and also 58xx. In both cases, the drivers made no difference as both "valid" and 'invalid' results within same series missed out for some other reason whilst using the same CAL runtime as one of the 'valid' wus.

Or to put it another way..... it may not be directly related to drivers, but the architecture within the actual cards.
____________
You should see the world from my eyes.

Profile magyarficko
Send message
Joined: 22 Jan 09
Posts: 35
Credit: 46,731,190
RAC: 0
Message 38098 - Posted: 5 Apr 2010 | 15:28:54 UTC - in response to Message 38053.
Last modified: 5 Apr 2010 | 15:30:14 UTC

82% valid tasks, is not going to work in the long run, obviously. But I'll hang around for the shakedown.


Well I'm out of here for the time being as 82% is not satisfactory for me! I realize that MilkyWay is still classed (as far as I know) as an Alpha project, but IMHO it is mature enough that they shouldn't be running tests in a production environment - at least some of these bugs (if not the majority of them) SHOULD have been caught in testing before releasing this new version validator into the wild. See y'all later.


Right now it looks like the problem isn't the validator but the (optimized?) GPU applications.


I'm not using an optimized app, I'm using what is given to me by the project. I have NEVER before had invalid results, after the "upgrade" I was getting many.
____________

Brian Silvers
Send message
Joined: 21 Aug 08
Posts: 625
Credit: 558,425
RAC: 0
Message 38100 - Posted: 5 Apr 2010 | 16:18:46 UTC
Last modified: 5 Apr 2010 | 16:19:58 UTC

Since I don't know if purges are running real quick, and I don't know what the major amount of noise is in this thread since I last read it, I'm going ahead and posting this. It will likely be formatted badly, and may already be covered by the numerous postings, but I just wanted to state that it's quite unfair to me to have an app that is known to be working fine and spend 4.5 hours on a task for zip, zap, zero... Oh, and in case the 4.5 hours didn't tell you which system is mine, it's the non-GPU system, the first one in the quorum...

name de_s11_3s_free_6_1544383_1270347650
application MilkyWay@Home
created 4 Apr 2010 2:20:50 UTC
minimum quorum 3
initial replication 4
max # of error/total/success tasks 3, 6, 1
errors Too many success results
This is displayed on the workunit pageTask ID
click for details Computer Sent Time reported
or deadline
explain Status Run time
(sec) CPU time
(sec) Claimed credit Granted credit Application
95656877 26452 4 Apr 2010 2:22:25 UTC 5 Apr 2010 5:49:34 UTC Completed, can't validate 0.00 16,347.27 70.47 0.00 Anonymous platform
96480761 26133 5 Apr 2010 5:50:54 UTC 5 Apr 2010 7:17:50 UTC Completed, can't validate 216.20 212.47 1.11 0.00 MilkyWay@Home v0.21 (ati13ati)
96480762 141414 5 Apr 2010 5:50:38 UTC 5 Apr 2010 6:27:01 UTC Completed, can't validate 89.52 87.25 0.61 0.00 MilkyWay@Home v0.21 (ati13ati)

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38103 - Posted: 5 Apr 2010 | 16:27:53 UTC - in response to Message 38100.

Since I don't know if purges are running real quick, and I don't know what the major amount of noise is in this thread since I last read it, I'm going ahead and posting this. It will likely be formatted badly, and may already be covered by the numerous postings, but I just wanted to state that it's quite unfair to me to have an app that is known to be working fine and spend 4.5 hours on a task for zip, zap, zero... Oh, and in case the 4.5 hours didn't tell you which system is mine, it's the non-GPU system, the first one in the quorum...

name de_s11_3s_free_6_1544383_1270347650
application MilkyWay@Home
created 4 Apr 2010 2:20:50 UTC
minimum quorum 3
initial replication 4
max # of error/total/success tasks 3, 6, 1
errors Too many success results
This is displayed on the workunit pageTask ID
click for details Computer Sent Time reported
or deadline
explain Status Run time
(sec) CPU time
(sec) Claimed credit Granted credit Application
95656877 26452 4 Apr 2010 2:22:25 UTC 5 Apr 2010 5:49:34 UTC Completed, can't validate 0.00 16,347.27 70.47 0.00 Anonymous platform
96480761 26133 5 Apr 2010 5:50:54 UTC 5 Apr 2010 7:17:50 UTC Completed, can't validate 216.20 212.47 1.11 0.00 MilkyWay@Home v0.21 (ati13ati)
96480762 141414 5 Apr 2010 5:50:38 UTC 5 Apr 2010 6:27:01 UTC Completed, can't validate 89.52 87.25 0.61 0.00 MilkyWay@Home v0.21 (ati13ati)



This was one of the older WUs sent out with bad values for max error/total/success:


max # of error/total/success tasks 3, 6, 1


That issue shouldn't happen anymore. I've also loosened up the validation a little bit which may help some workunits not being flagged invalid. If we can't figure out a good solution to the 48xx vs 58xx ATI GPUs issue, I'll probably lower the validation to having fitness within 10e-10 (or 10e-9) to see if that helps.

The new application will be 10e-11 however.
____________

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38104 - Posted: 5 Apr 2010 | 16:29:48 UTC - in response to Message 38103.

On another note, does anyone know if the 48xx or the 58xx ATI GPU is the one validating correctly vs the stock application?
____________

Profile Kevint
Avatar
Send message
Joined: 22 Nov 07
Posts: 285
Credit: 1,076,786,368
RAC: 0
Message 38105 - Posted: 5 Apr 2010 | 16:30:19 UTC - in response to Message 38098.



I'm not using an optimized app, I'm using what is given to me by the project. I have NEVER before had invalid results, after the "upgrade" I was getting many.



Ditto,

I stopped using the opti apps (except for the CPU) several weeks ago..

I am getting hundreds of invalids.. none of my cards are overclocked everything stock.

So, how do you intend to correct this? If 1 or 2 hosts in the quorum are using stock apps but are returning invalid results, could there be something wrong with the stock app and not the validation?

____________
.

Profile Kevint
Avatar
Send message
Joined: 22 Nov 07
Posts: 285
Credit: 1,076,786,368
RAC: 0
Message 38107 - Posted: 5 Apr 2010 | 16:35:59 UTC - in response to Message 38104.

On another note, does anyone know if the 48xx or the 58xx ATI GPU is the one validating correctly vs the stock application?



Here is as host that I have switched from Stock to opti back to stock.. both apps see to be having problems.
5870 - single card, no overclock.

http://milkyway.cs.rpi.edu/milkyway/show_host_detail.php?hostid=47682

____________
.

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38108 - Posted: 5 Apr 2010 | 16:38:23 UTC - in response to Message 38107.

On another note, does anyone know if the 48xx or the 58xx ATI GPU is the one validating correctly vs the stock application?



Here is as host that I have switched from Stock to opti back to stock.. both apps see to be having problems.
5870 - single card, no overclock.

http://milkyway.cs.rpi.edu/milkyway/show_host_detail.php?hostid=47682


What I'm trying to figure out is if say a stock application result and 2 48xx GPU results come back, do they validate to a quorum? (that would mean the issue is with the 58xx GPU application).

Otherwise, if a stock application result and 2 58xx GPU results make a quorum, that would mean the 48xx GPU application is the problem.
____________

Profile Simplex0
Avatar
Send message
Joined: 11 Nov 07
Posts: 232
Credit: 178,221,048
RAC: 0
Message 38113 - Posted: 5 Apr 2010 | 18:24:59 UTC

Seams to be a lot of wasted computer power.
Why do you keep on sending out new wu's?

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38114 - Posted: 5 Apr 2010 | 18:49:33 UTC - in response to Message 38113.

Seams to be a lot of wasted computer power.
Why do you keep on sending out new wu's?


It won't be as bad once we get the GPU issue sorted out and the hosts error rates updated in the database.

We'll be moving to a quorum of 2 and ~10% validation for results that don't improve our searches (unless that host is known to be a repeat offender when it comes to errors); which will really be minimal duplicate work.

Right now I'm just trying to flush out bad clients that are running single precision GPU applications or scripts.
____________

Profile kashi
Send message
Joined: 30 Dec 07
Posts: 309
Credit: 148,432,104
RAC: 0
Message 38117 - Posted: 5 Apr 2010 | 19:03:16 UTC - in response to Message 38108.
Last modified: 5 Apr 2010 | 19:22:38 UTC

What I'm trying to figure out is if say a stock application result and 2 48xx GPU results come back, do they validate to a quorum? (that would mean the issue is with the 58xx GPU application).

Otherwise, if a stock application result and 2 58xx GPU results make a quorum, that would mean the 48xx GPU application is the problem.


Not all GPUs are the same brand or architecture class but neither are all CPUs.

So what is considered a stock application result, an unoptimised CPU application running on a certain model Intel CPU or an unoptimised CPU application running on a certain model AMD CPU? Or does the CPU architecture make no difference to the result?

I am just trying to make sure that the term "stock application result" is understood in the same way by all reading or contributing to the thread. Many believe that the ATI GPU application automatically sent by the server is the "stock" application and the one that is manually downloaded and installed by contributors is the "optimised" application, but they are often both the same application. In this sense all GPU applications are "optimised" and only the original CPU application automatically downloaded could be considered the stock application.

If unoptimised application CPU results are in between the results of 48xx and 58xx, then is it possible that they would not validate with either GPU class or with both? Hopefully we will find out soon, if there was a test application available I still couldn't work it out myself unless I also had a test validator.

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38119 - Posted: 5 Apr 2010 | 19:12:16 UTC - in response to Message 38117.

What I'm trying to figure out is if say a stock application result and 2 48xx GPU results come back, do they validate to a quorum? (that would mean the issue is with the 58xx GPU application).

Otherwise, if a stock application result and 2 58xx GPU results make a quorum, that would mean the 48xx GPU application is the problem.


Not all GPUs are the same brand or architecture class but neither are all CPUs.

So what is considered a stock application result, an unoptimised CPU application running on a certain model Intel CPU or an unoptimised CPU application running on a certain model AMD CPU? Or does the CPU architecture make no difference to the result?

I am just trying to make sure that the term "stock application result" is understood in the same way by all reading or contributing to the thread. Many believe that the ATI GPU application automatically sent by the server is the "stock" application and the one that is manually downloaded and installed by contributors is the "optimised" application, but they are often both the same application. In this sense all GPU applications are "optimised" and only the original CPU application automatically downloaded could be considered the stock application.


As far as I know, all the applications we provide (the stock applications) provide results with 10e-13 of each other, regardless of them being on a CPU or GPU.
____________

Profile banditwolf
Avatar
Send message
Joined: 12 Nov 07
Posts: 2425
Credit: 295,133
RAC: 0
Message 38120 - Posted: 5 Apr 2010 | 19:19:02 UTC - in response to Message 38119.

As far as I know, all the applications we provide (the stock applications) provide results with 10e-13 of each other, regardless of them being on a CPU or GPU.


Awhile back you said you only needed to the 8th place and would like to get to the 10th. That is what I thought all of these applications were based off of.

____________
Doesn't expecting the unexpected make the unexpected the expected?
If it makes sense, DON'T do it.

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38124 - Posted: 5 Apr 2010 | 19:25:05 UTC - in response to Message 38120.

As far as I know, all the applications we provide (the stock applications) provide results with 10e-13 of each other, regardless of them being on a CPU or GPU.


Awhile back you said you only needed to the 8th place and would like to get to the 10th. That is what I thought all of these applications were based off of.


As far as I remember, we've always wanted 10+ degrees of precision. If we were content with 8 we could be using single precision applications.
____________

Profile Arif Mert Kapicioglu
Send message
Joined: 14 Dec 09
Posts: 158
Credit: 572,866,449
RAC: 2,430
Message 38126 - Posted: 5 Apr 2010 | 19:31:42 UTC

Double precision is accurate to the 12th decimal. What about these "completed, validation inconclusive" ones? Are they waiting for the other quorums to be validated?

Profile kashi
Send message
Joined: 30 Dec 07
Posts: 309
Credit: 148,432,104
RAC: 0
Message 38127 - Posted: 5 Apr 2010 | 19:32:22 UTC - in response to Message 38119.

As far as I know, all the applications we provide (the stock applications) provide results with 10e-13 of each other, regardless of them being on a CPU or GPU.

Thanks for the clarification of what stock means. I thought the GPU applications were compared to a CPU for accuracy when they were developed not compared to other GPU results.

Profile banditwolf
Avatar
Send message
Joined: 12 Nov 07
Posts: 2425
Credit: 295,133
RAC: 0
Message 38128 - Posted: 5 Apr 2010 | 19:34:39 UTC - in response to Message 38124.

As far as I know, all the applications we provide (the stock applications) provide results with 10e-13 of each other, regardless of them being on a CPU or GPU.


Awhile back you said you only needed to the 8th place and would like to get to the 10th. That is what I thought all of these applications were based off of.


As far as I remember, we've always wanted 10+ degrees of precision. If we were content with 8 we could be using single precision applications.


Could have been talking about single precision apps.
____________
Doesn't expecting the unexpected make the unexpected the expected?
If it makes sense, DON'T do it.

Profile kashi
Send message
Joined: 30 Dec 07
Posts: 309
Credit: 148,432,104
RAC: 0
Message 38140 - Posted: 5 Apr 2010 | 21:56:47 UTC

Two comparisons which include unoptimised CPU application in the quorum.

wuid=90266299
-3.16876629868666400000 48xx Anonymous v0.20b Invalid
-3.16876629868695500000 CPU Core2 Duo E6750 v0.19 Invalid
-3.16876639500725200000 58xx Anonymous v0.22 Valid
-3.16876639500725200000 58xx Anonymous v0.22 Valid
-3.16876639500713000000 58xx Anonymous v0.20b Valid

wuid=90249810
-3.17153809050878900000 48xx v0.21 (ati13ati) Valid
-3.17153809050878900000 48xx v0.21 (ati13ati) Valid
-3.17153818333960200000 58xx Anonymous v0.22 Invalid
-3.17153809050889600000 CPU Core2 Duo E6750 v0.19 Valid

48xx results agree with stock CPU application results to 12 decimal places but 58xx results only agree with stock CPU application results to 6 decimal places.

I do not know if results from an unoptimised stock CPU application are considered to have the highest degree of accuracy, so I will not draw any conclusions, except to repeat my original belief that the differences seen are hardware based and not application based. It appears that Cypress based ATI cards give different results to other hardware classes.

[boinc.at] Nowi
Send message
Joined: 22 Mar 09
Posts: 89
Credit: 346,231,241
RAC: 416,187
Message 38142 - Posted: 5 Apr 2010 | 22:13:29 UTC

I´m wondering about the "new" observation that different architectures lead to different results even with the same application. I´m crunching CPDN for about five years and hardly have seen two identical or nearly identical results even with the same application (because the applications are closed). Maybe the goal of accuracy is to challenging.

In my ten years of DC I remember a project (but I don´t know which) that sent WUs only to hosts with the same CPU knowing that AMD and Intel produces different results. Of course remember the pentium bug on the first pentiums ;-)



Cluster Physik
Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 38144 - Posted: 5 Apr 2010 | 22:17:33 UTC

I've posted some results including the optimized CPU and CUDA application in the other thread. Only the HD5800 series GPUs deviate substantially. All others are well within the bounds set by the project.

There is something fishy going on for sure. Bad thing is I have no idea what. I guess the best would be if the HD5800 users take a short break or support another project until we figure out what is going on here.

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38145 - Posted: 5 Apr 2010 | 22:19:09 UTC - in response to Message 38144.

I've posted some results including the optimized CPU and CUDA application in the other thread. Only the HD5800 series GPUs deviate substantially. All others are well within the bounds set by the project.

There is something fishy going on for sure. Bad thing is I have no idea what. I guess the best would be if the HD5800 users take a short break or support another project until we figure out what is going on here.


Yeah I've modified the validator on my end to get more information about what's going on, and the 5800s are the ones giving off results (by about ~9e-8).

4800s validate vs. stock and CUDA just fine.
____________

Cluster Physik
Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 38146 - Posted: 5 Apr 2010 | 22:23:04 UTC - in response to Message 38142.
Last modified: 5 Apr 2010 | 22:24:17 UTC

I´m wondering about the "new" observation that different architectures lead to different results even with the same application. I´m crunching CPDN for about five years and hardly have seen two identical or nearly identical results even with the same application (because the applications are closed). Maybe the goal of accuracy is to challenging.

In my ten years of DC I remember a project (but I don´t know which) that sent WUs only to hosts with the same CPU knowing that AMD and Intel produces different results. Of course remember the pentium bug on the first pentiums ;-)

As shown in the other thread HD3800/4700/4800 GPUs return the exact same fitness value as CPUs with my versions. So at least here it is possible. I've told Anthony and Travis already some time ago, that I think the CUDA version could probably return the same values too.

The behaviour of the HD58x0 GPUs is definitely peculiar and probably some kind of mess-up on some thing that simply needs to be found and fixed.

Profile Mr. Hankey
Send message
Joined: 9 Apr 09
Posts: 10
Credit: 117,667,871
RAC: 0
Message 38149 - Posted: 5 Apr 2010 | 22:29:23 UTC

Ok, well I just set all my 5850s to NNW. I have been using the optimized .20b application.... Let us know when it is safe to return :)

[boinc.at] Nowi
Send message
Joined: 22 Mar 09
Posts: 89
Credit: 346,231,241
RAC: 416,187
Message 38151 - Posted: 5 Apr 2010 | 22:42:07 UTC - in response to Message 38146.

HD3800/4700/4800 GPUs return the exact same fitness value as CPUs with my versions. So at least here it is possible. I've told Anthony and Travis already some time ago, that I think the CUDA version could probably return the same values too.


I´m only CPU-cruncher on MW, but my accurate results were errored out because that four 58x0 overruled my result as could be seen here: http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90277453

Will I get my credits back?

I have a valid result with 4800, my CPU and a cuda-app and the 5800 was marked as invalid. (http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90148812) This supports the statement of Cluster Physik.

Profile arkayn
Avatar
Send message
Joined: 14 Feb 09
Posts: 915
Credit: 74,781,427
RAC: 225
Message 38155 - Posted: 5 Apr 2010 | 23:28:07 UTC - in response to Message 38104.

Is it possible for you to compile a test case for those of us who have both 48xx and 58xx cards so we can run them and see what is really doing the good work.

At Lunatics we have a bench suite that allows us to test new builds of the Astropulse apps.
____________

Profile krahulik
Send message
Joined: 7 Nov 08
Posts: 14
Credit: 179,303,710
RAC: 2
Message 38157 - Posted: 5 Apr 2010 | 23:45:10 UTC

This WU:
3x 58xx -> valid
1x 48xx/47xx -> invalid

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38158 - Posted: 5 Apr 2010 | 23:52:38 UTC - in response to Message 38155.

Is it possible for you to compile a test case for those of us who have both 48xx and 58xx cards so we can run them and see what is really doing the good work.

At Lunatics we have a bench suite that allows us to test new builds of the Astropulse apps.



I'll run some sample WUs standalone on my laptop tonight so I can be sure of the fitness. I'll put out the input files and the expected output when they're done.
____________

Profile Furlozza
Avatar
Send message
Joined: 7 Feb 09
Posts: 9
Credit: 25,983,618
RAC: 0
Message 38169 - Posted: 6 Apr 2010 | 3:26:19 UTC - in response to Message 38158.
Last modified: 6 Apr 2010 | 3:30:28 UTC

Am running down MW wus until problem solved.

*putting on devil's advocate hat*

Could the problem not be that in the 58x0 series, an instruction has been included in cards that actually makes them more accurate?

*removing hat*

EDIT: make the DELETING wus
____________
You should see the world from my eyes.

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38173 - Posted: 6 Apr 2010 | 4:20:43 UTC - in response to Message 38169.

Am running down MW wus until problem solved.

*putting on devil's advocate hat*

Could the problem not be that in the 58x0 series, an instruction has been included in cards that actually makes them more accurate?

*removing hat*

EDIT: make the DELETING wus


Well, unless the 58x0 series is more accurate than CPUs (which I doubt), they're the culprit.

From an email Anthony just sent me:


this is a 2 stream workunit with sgr coordinates)

(uses hardcoded values in atSurveyGeometry.c)
-2.558875331749281 v0.19 CPU application (SSE3)
-2.558875331749119 v0.20 apps ((ati 48xx)
-2.558875331749284 v0.18 optimized
-2.558875331749081 nvidia on boinc

-2.558875355118770 (not v0.20) ati on boinc 58xx

(use computed values in atSurveyGeometry.c)
-2.558875329826787 cpu (from repository)
-2.558875329826697 nvidia (old version, circa oct 2009)
-2.558875329826689 nvidia (new unreleased version)

the 58x0 series just isn't matching up to anything we have.

____________

Brian Priebe
Send message
Joined: 27 Nov 09
Posts: 98
Credit: 172,335,484
RAC: 80,913
Message 38176 - Posted: 6 Apr 2010 | 7:48:05 UTC - in response to Message 38169.
Last modified: 6 Apr 2010 | 7:51:33 UTC

<removed>

Profile Simplex0
Avatar
Send message
Joined: 11 Nov 07
Posts: 232
Credit: 178,221,048
RAC: 0
Message 38291 - Posted: 7 Apr 2010 | 10:10:31 UTC

A few questions.



1. The 58xx cards have been around for 6 months does this means the all the results during that time can have an error that are twice as big as they was supposed to be?



2. I am using 4870 cards to crunch an have something like 150 results that are marked as 'Invalid', caused by 58xx cards I assume. Will I get credit granted for that later?



3. If this has been going on for 6 months and you have spotted the problem just recently you obviously have serious problem with your validation method. Do you have a strategy now to prevent it from happen again?


Despite this problems I still think you guy's in Milkyway@home are the nr.1 in BOINC

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38325 - Posted: 7 Apr 2010 | 21:25:44 UTC - in response to Message 38291.

A few questions.

1. The 58xx cards have been around for 6 months does this means the all the results during that time can have an error that are twice as big as they was supposed to be?


I think this was maybe a more recent change? At any rate the results we've been getting for the searches have always been validated -- it's just that the issue didn't show up as much because we were not validating the vast majority of the workunits; we were just validating the ones which improved the searches we were doing. So while they had the error it didn't effect our results very much at all. The reason it's been a big deal lately was because in order to fix scripting and single precision app issues we started validating most workunits (even those that didnt improve our searches). So before we were only validating 2-5% of WUs, now we're validating 50-75%.


2. I am using 4870 cards to crunch an have something like 150 results that are marked as 'Invalid', caused by 58xx cards I assume. Will I get credit granted for that later?


Right now my focus is on trying to get the server running stably again and upgrading to the new application. I'm not sure if I'm going to have time to go through the database manually and fix everyones lost credit. Most of these workunits have also been purged from the database right now, so there's really no good way to update and grant lost credit. I think it's just something everyone is going to have to live with and I apologize for that.


3. If this has been going on for 6 months and you have spotted the problem just recently you obviously have serious problem with your validation method. Do you have a strategy now to prevent it from happen again?


Well the real issue here was that we went from doing nearly no validation (we were only validating a minority of results which actually improved our search populations), to doing a lot more validation which made the problem really apparent -- so I guess the swap was a good thing :) On our end, we don't really need this extra validation because results which don't improve our search populations aren't particularly important, other than to weed out bad applications (which in this case we were unlucky enough to have one). But at any rate, I think with the more strict validation we have in place now, this kind of thing shouldn't happen again.


Despite this problems I still think you guy's in Milkyway@home are the nr.1 in BOINC


Glad after all of this we aren't totally hated here :)
____________

Brian Priebe
Send message
Joined: 27 Nov 09
Posts: 98
Credit: 172,335,484
RAC: 80,913
Message 38336 - Posted: 7 Apr 2010 | 22:49:43 UTC - in response to Message 38325.
Last modified: 7 Apr 2010 | 22:50:52 UTC

But at any rate, I think with the more strict validation we have in place now, this kind of thing shouldn't happen again.
For a possible counter-example, check Workunit 90623954. Two anonymous platforms sporting versions 0.20b and 0.22 out-quorumed an HD5870 running version 0.23. All of the results were from ATI Cypress boards (HD5870 and HD5850 apparently).

Shouldn't 5xxx-series GPU results from applications prior to 0.23 be automatically discarded?

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38341 - Posted: 7 Apr 2010 | 23:14:50 UTC - in response to Message 38336.


Shouldn't 5xxx-series GPU results from applications prior to 0.23 be automatically discarded?


They should... but that's takes a couple extra database queries per workunit, and the server is crashing enough as it is. I had the check in there for awhile and the server couldn't keep up with it.
____________

Profile Gary Roberts
Send message
Joined: 1 Mar 09
Posts: 52
Credit: 1,085,070,246
RAC: 812,678
Message 38344 - Posted: 7 Apr 2010 | 23:31:31 UTC - in response to Message 38336.

Two anonymous platforms sporting versions 0.20b and 0.22 out-quorumed an HD5870 running version 0.23.

There are probably a significant number of people running AP and not paying close attention to the boards. Anybody noticing cases of 5800 series cards still running the wrong app should send a PM to the owner (if possible) since that will give them an email as well. Hopefully they are monitoring their email a bit more closely.

____________
Cheers,
Gary.

Brian Priebe
Send message
Joined: 27 Nov 09
Posts: 98
Credit: 172,335,484
RAC: 80,913
Message 38353 - Posted: 8 Apr 2010 | 0:58:36 UTC - in response to Message 38344.

Anybody noticing cases of 5800 series cards still running the wrong app should send a PM to the owner (if possible) since that will give them an email as well.
I personally think it would be more appropriate if RPI were sending out these e-mails but...

...I've notified 4 other owners as requested.

Profile banditwolf
Avatar
Send message
Joined: 12 Nov 07
Posts: 2425
Credit: 295,133
RAC: 0
Message 38356 - Posted: 8 Apr 2010 | 2:08:24 UTC - in response to Message 38353.

Anybody noticing cases of 5800 series cards still running the wrong app should send a PM to the owner (if possible) since that will give them an email as well.
I personally think it would be more appropriate if RPI were sending out these e-mails but...

Doing a mass email to everyone should be easy. Just the ones who don't want emails from the project would be left out, then individual emails.
____________
Doesn't expecting the unexpected make the unexpected the expected?
If it makes sense, DON'T do it.

Profile Simplex0
Avatar
Send message
Joined: 11 Nov 07
Posts: 232
Credit: 178,221,048
RAC: 0
Message 38362 - Posted: 8 Apr 2010 | 9:00:39 UTC - in response to Message 38341.


Shouldn't 5xxx-series GPU results from applications prior to 0.23 be automatically discarded?


They should... but that's takes a couple extra database queries per workunit, and the server is crashing enough as it is. I had the check in there for awhile and the server couldn't keep up with it.



Until that problem is sorted out I vill run Folding@home instead.
Hope you will have this fixed soon.

Chris S
Avatar
Send message
Joined: 20 Sep 08
Posts: 1357
Credit: 173,075,472
RAC: 7
Message 38627 - Posted: 12 Apr 2010 | 11:26:39 UTC

I have unfortunately had to swap 7 machines running various 3850, 4850, and 4870 cards on to another project as each one was producing 90% computation errors or work not validated. I'll check back in a few days and see if this is still continuing.
____________
Don't drink water, that's the stuff that rusts pipes

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38643 - Posted: 12 Apr 2010 | 19:14:31 UTC - in response to Message 38627.

I have unfortunately had to swap 7 machines running various 3850, 4850, and 4870 cards on to another project as each one was producing 90% computation errors or work not validated. I'll check back in a few days and see if this is still continuing.


Did your machines upgrade to the correct application (0.23) and are they running the right brook32/64.dll?

If they're giving that many errors it's probably because they're using the wrong application.
____________

Profile Tomasz R. Gwiazda
Avatar
Send message
Joined: 23 Mar 09
Posts: 13
Credit: 100,032,796
RAC: 0
Message 38646 - Posted: 12 Apr 2010 | 19:37:08 UTC - in response to Message 38643.

hi Travis

I've got question, my 3 hosts (2x5870, 5870, 4850) were disconected form MW. And after you published new app 0.23 reconnected.
Everything works fine.
But from time to time i receve some WU for app 0.19 , even when i manualy delete an app from hdd
Is it suppose to happen ?
____________

Join us at www.boincatpoland.org

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38648 - Posted: 12 Apr 2010 | 20:00:30 UTC - in response to Message 38646.

hi Travis

I've got question, my 3 hosts (2x5870, 5870, 4850) were disconected form MW. And after you published new app 0.23 reconnected.
Everything works fine.
But from time to time i receve some WU for app 0.19 , even when i manualy delete an app from hdd
Is it suppose to happen ?


If you're running windows, for CPU the highest app version is still 0.19. So if it's running 0.19 on the CPU that's not a problem.
____________

Profile Tomasz R. Gwiazda
Avatar
Send message
Joined: 23 Mar 09
Posts: 13
Credit: 100,032,796
RAC: 0
Message 38649 - Posted: 12 Apr 2010 | 20:08:01 UTC - in response to Message 38648.

yes i run Win7 and XP
but not using CPU at all (never used for MW)
hmm maybe it's "fault" of MW preferences

Use CPU
(enforced by 6.10+ clients)
____________

Join us at www.boincatpoland.org

Chris S
Avatar
Send message
Joined: 20 Sep 08
Posts: 1357
Credit: 173,075,472
RAC: 7
Message 38671 - Posted: 13 Apr 2010 | 9:40:00 UTC - in response to Message 38643.

Did your machines upgrade to the correct application (0.23) and are they running the right brook32/64.dll?

If they're giving that many errors it's probably because they're using the wrong application.


Thanks for the response. I've had a bit of a change round and they seem OK now, so they are back crunching for MW after many days of lost work! See my post here. http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=1679&nowrap=true#38670
____________
Don't drink water, that's the stuff that rusts pipes

Profile Crunch3r
Volunteer developer
Avatar
Send message
Joined: 17 Feb 08
Posts: 358
Credit: 256,958,638
RAC: 2,724
Message 38677 - Posted: 13 Apr 2010 | 11:30:29 UTC - in response to Message 38648.
Last modified: 13 Apr 2010 | 11:32:58 UTC

hi Travis

I've got question, my 3 hosts (2x5870, 5870, 4850) were disconected form MW. And after you published new app 0.23 reconnected.
Everything works fine.
But from time to time i receve some WU for app 0.19 , even when i manualy delete an app from hdd
Is it suppose to happen ?


If you're running windows, for CPU the highest app version is still 0.19. So if it's running 0.19 on the CPU that's not a problem.


I think i know what he's talking about.
Yesterday, for some reason i had the same happening here on two machines. Although running on anonymous platform which only has the gpu app specified, the server send some tasks assigned for the CPU app (0.19) which was not selected in the prefs (Don't use CPU) nor specified in the app_info.xml.

That shouldn't have happened at all.

The boinc client started all of those tasks at once using the GPU app(i checked that in the tasks manager) and labeled them as CPU app in the boinc manager.

So my 8 core machine had 2 GPU apps running (as specified in the app_info.xml) and another 8 active tasks showing up as CPU tasks in the manager, although it actually used the gpu app for them.

Result of that was that the V8 slowed down quite a bit and the other one locked up completely.

EDIT
Same thing has happened on Collatz -> http://boinc.thesonntags.com/collatz/forum_thread.php?id=370

So i think that it's a bug server side, where it does't honor prefs nor apps specified in an app_info.xml.
____________

Join BOINC United now!

Snagletooth
Send message
Joined: 18 Feb 08
Posts: 4
Credit: 83,890
RAC: 34
Message 38716 - Posted: 14 Apr 2010 | 13:45:39 UTC

My second WU to be marked invalid: Workunit 91877832
The cuda and the HD5800 validated and left my poor Mac's CPU out in the cold.

In case it's purged:

de_s222_3s_13_360570_1270920195
application MilkyWay@Home
created 10 Apr 2010 17:23:15 UTC
canonical result 99637572
granted credit 213.76
minimum quorum 2
initial replication 3
max # of error/total/success tasks 3, 9, 6
This is displayed on the workunit page
Task ID
click for details Computer Sent Time reported
or deadline
explain Status Run time
(sec) CPU time
(sec) Claimed credit Granted credit Application
99637572 121302 10 Apr 2010 17:27:11 UTC 11 Apr 2010 21:10:34 UTC Completed and validated 813.94 24.32 0.18 213.76 MilkyWay@Home v0.24 (cuda23)
99698813 42872 11 Apr 2010 23:32:02 UTC 14 Apr 2010 11:42:03 UTC Completed, marked as invalid 27,759.43 25,764.73 195.20 0.00 MilkyWay@Home v0.26
101340563 155415 14 Apr 2010 11:43:12 UTC 14 Apr 2010 11:59:27 UTC Completed and validated 78.36 6.51 0.05 213.76 MilkyWay@Home v0.23 (ati13ati)

Profile [TG-SET]ZI
Send message
Joined: 11 Sep 09
Posts: 5
Credit: 40,080,262
RAC: 0
Message 38764 - Posted: 15 Apr 2010 | 21:41:25 UTC - in response to Message 38716.

wann funkrioniert die scheisse hier wieder es kann nicht sein das 50% für den arsch sind

Profile Crunch3r
Volunteer developer
Avatar
Send message
Joined: 17 Feb 08
Posts: 358
Credit: 256,958,638
RAC: 2,724
Message 38767 - Posted: 16 Apr 2010 | 1:14:26 UTC - in response to Message 38764.

wann funkrioniert die scheisse hier wieder es kann nicht sein das 50% für den arsch sind


Die "Scheisse" funktionert so wie es soll. Es liegt mit hoher Warscheinlichkeit an deinem Scheiss übertakten...macht wohl deine Scheisskarte nicht mit und produziert nur falsche Resultate. Darum bekommst du auch keine Scheiss Credits mehr...

So ne Scheisse aber auch ...


____________

Join BOINC United now!

Profile [TG-SET]ZI
Send message
Joined: 11 Sep 09
Posts: 5
Credit: 40,080,262
RAC: 0
Message 38774 - Posted: 16 Apr 2010 | 8:11:28 UTC - in response to Message 38767.

ob übertaktet oder nicht das Ergebnis ist nicht fürs Gesicht

[boinc.at] Nowi
Send message
Joined: 22 Mar 09
Posts: 89
Credit: 346,231,241
RAC: 416,187
Message 38775 - Posted: 16 Apr 2010 | 8:38:07 UTC

Bitte, Bitte,

wir machen hier Wissenschaft und da kommt es vor, das manchmal Dinge nicht funktionieren. Ich möchte nicht daran zurückdenken, wie viele Experimente in meiner Diss (organische Chemie) schief gelaufen sind, und der schwarze Schleim im Kolben sah wirklich aus wie Sch...

Also bitte die Haltung bewahren :-)

Please,

we are all making science here and so it can happen that sometimes things go wrong. I don´t want to remember, how many experiments in my PhD-thesis (organic chemistry) went wrong, and the black slime in the flask looked like Sch...

So please save your countenance :-)

Profile [TG-SET]ZI
Send message
Joined: 11 Sep 09
Posts: 5
Credit: 40,080,262
RAC: 0
Message 38789 - Posted: 16 Apr 2010 | 22:46:42 UTC - in response to Message 38775.

wenn meine karte bei allen anderen Projekten funktioniert, nur hier nicht liegt das nicht an meiner karte

Profile Cori
Avatar
Send message
Joined: 27 Aug 07
Posts: 647
Credit: 27,592,547
RAC: 0
Message 38791 - Posted: 16 Apr 2010 | 23:00:37 UTC - in response to Message 38789.

Kann man so allgemein auch nicht sagen. Vielleicht vertragen andere Projekte einfach grössere Abweichungen bei den Ergebnissen. Eine geringere Fehlertoleranz ist aber kein "Fehler" des Projektes. :-P
____________
Lovely greetings, Cori

Profile [TG-SET]ZI
Send message
Joined: 11 Sep 09
Posts: 5
Credit: 40,080,262
RAC: 0
Message 38792 - Posted: 16 Apr 2010 | 23:05:25 UTC - in response to Message 38791.

ist ja nicht so als das meine karte hier noch nie funktioniert hat. sie hat ja schon mal zu 100% funktioniert

Len LE/GE
Send message
Joined: 8 Feb 08
Posts: 232
Credit: 86,935,187
RAC: 37,658
Message 38799 - Posted: 17 Apr 2010 | 1:04:45 UTC - in response to Message 38792.

Ein paar Leute haben geschrieben, dass sie mit der neuen ATI app ihre Übertaktung etwas reduzieren mussten.

Different programs are stressing the GPU in a different way. If you are running at the limit of the card, even a new prog version can push it over the line.
Suggestion: Set your card back to default speed for the moment. Wait until the new programm (MW3) is out and stable, than see where your new oc limit is.

Andris Pavenis
Send message
Joined: 22 Mar 08
Posts: 1
Credit: 2,056,687
RAC: 0
Message 38810 - Posted: 17 Apr 2010 | 10:30:36 UTC

Maybe it could be a good idea to send WUs for validation to different types of systems if possible (for example no more than 1 to 58X0, etc.) or no more than one to systems that have many validation problems recently (58X0 whould currently qualify for that, but in future that could of course change).

Profile [TG-SET]ZI
Send message
Joined: 11 Sep 09
Posts: 5
Credit: 40,080,262
RAC: 0
Message 38811 - Posted: 17 Apr 2010 | 21:54:30 UTC - in response to Message 38810.

aber wenn die fehlertoleranz so gering ist das man die mouse nicht mehr bewegen kann ist es vieleicht etwas übertrieben. man hat auch mal was zu tun am pc

Cluster Physik
Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 38829 - Posted: 18 Apr 2010 | 13:38:43 UTC - in response to Message 38811.

aber wenn die fehlertoleranz so gering ist das man die mouse nicht mehr bewegen kann ist es vieleicht etwas übertrieben. man hat auch mal was zu tun am pc

Wenn Deine Karten schon fehlerhafte Resultate ausspucken, wenn Du nur die Maus bewegst, würde ich mir ernsthaft Gedanken machen. Bei mir ist das vollkommen egal. Ich lasse MW auch auf meinem normalen Arbeits-PC laufen (habe da eine 3870X2 reingesteckt), da kann ich machen was ich will und es läuft ohne ungültige Ergebnisse.

Hilfreich wäre vielleicht, wenn Du mal ein paar Angaben über Deinen Rechner machst (GPU, Takte, OS, Treiber, MW-Version).

PS:
Wenn es geht, sind übrigens englischsprachige Posts zu empfehlen.

Safari Hood
Send message
Joined: 6 Jul 09
Posts: 1
Credit: 1,714,690
RAC: 0
Message 38835 - Posted: 18 Apr 2010 | 16:25:51 UTC

So, am I the only one running nVidia that's still having troubles?

Host reference :
http://milkyway.cs.rpi.edu/milkyway/show_host_detail.php?hostid=161353

I've been keeping up w/the news about the validation problem. I'm using the stock applications from MW@Home, and I'm running nVidia (dual GTX 295).

I don't overclock the CPU or GPU because I don't want to void the warranty.

I uploaded 24 WUs a little bit ago, only to have 9 of them go Validation inconclusive.

Any help would be appreciated.

- Safari Hood

Microcruncher*
Avatar
Send message
Joined: 1 Jul 09
Posts: 7
Credit: 1,734,500
RAC: 0
Message 38836 - Posted: 18 Apr 2010 | 16:53:51 UTC - in response to Message 38835.
Last modified: 18 Apr 2010 | 17:03:13 UTC

I uploaded 24 WUs a little bit ago, only to have 9 of them go Validation inconclusive.

For some obscure reason WUs are listed as "Completed, validation inconclusive" instead of "Completed, waiting for validation" even if there is no other deviating result that would make the validation inconclusive. Currently I see 8 of 10 of your WUs in the "validation inconclusive" state although there are no deviating results.

One WU is labeled as "Completed, waiting for validation" with one task "Completed, waiting for validation" and the other task "Completed, validation inconclusive".

The validator is still buggy in this regard.

paris
Avatar
Send message
Joined: 26 Apr 08
Posts: 56
Credit: 3,609,134
RAC: 7,581
Message 38841 - Posted: 18 Apr 2010 | 17:53:38 UTC

I posted the following message earlier, but I may have stuck it in the wrong thread:

I don't know if these are the workunits that I am having problems with but I am using the stock app with a stock Mac mini core duo (plus a few on a core 2 duo) and more than half of the units I return are marked invalid. Many of the units start with a quorum of one and then after they are returned the quorum jumps to two. I usually end up paired with a GPU and then it will validate against a third machine (also running a GPU) leaving mine as odd man out. Shouldn't the stock app on a stock CPU return (mostly) valid results?

I like this project and I realize that I am not helping very much compared with the GPU folks but I would really like to be able to return valid units. Any help or insight would be appreciated. I have tried to keep up with the postings on the forum boards here, so if I missed something vital somewhere please point me in the right direction. Thank you.

____________

Plus SETI Classic = 21,082 WUs

Profile banditwolf
Avatar
Send message
Joined: 12 Nov 07
Posts: 2425
Credit: 295,133
RAC: 0
Message 38842 - Posted: 18 Apr 2010 | 18:22:00 UTC - in response to Message 38836.

I uploaded 24 WUs a little bit ago, only to have 9 of them go Validation inconclusive.

For some obscure reason WUs are listed as "Completed, validation inconclusive" instead of "Completed, waiting for validation" even if there is no other deviating result that would make the validation inconclusive. Currently I see 8 of 10 of your WUs in the "validation inconclusive" state although there are no deviating results.

One WU is labeled as "Completed, waiting for validation" with one task "Completed, waiting for validation" and the other task "Completed, validation inconclusive".

The validator is still buggy in this regard.

"Completed, waiting for validation" = waiting on another task
"Completed, validation inconclusive" = wu wasn't quite on target, and another was sent out and waiting for it to come back.

____________
Doesn't expecting the unexpected make the unexpected the expected?
If it makes sense, DON'T do it.

Microcruncher*
Avatar
Send message
Joined: 1 Jul 09
Posts: 7
Credit: 1,734,500
RAC: 0
Message 38844 - Posted: 18 Apr 2010 | 18:44:53 UTC - in response to Message 38842.
Last modified: 18 Apr 2010 | 18:46:07 UTC

I uploaded 24 WUs a little bit ago, only to have 9 of them go Validation inconclusive.

For some obscure reason WUs are listed as "Completed, validation inconclusive" instead of "Completed, waiting for validation" even if there is no other deviating result that would make the validation inconclusive. Currently I see 8 of 10 of your WUs in the "validation inconclusive" state although there are no deviating results.

One WU is labeled as "Completed, waiting for validation" with one task "Completed, waiting for validation" and the other task "Completed, validation inconclusive".

The validator is still buggy in this regard.

"Completed, waiting for validation" = waiting on another task
"Completed, validation inconclusive" = wu wasn't quite on target, and another was sent out and waiting for it to come back.


The entry "Completed, validation inconclusive" should only be set if there are deviating results and not if the result is the only one sent back. If the other WU is unsent or no result is reported back the entry should be set to "Completed, waiting for validation".

Just my two cents...

Brian Priebe
Send message
Joined: 27 Nov 09
Posts: 98
Credit: 172,335,484
RAC: 80,913
Message 38848 - Posted: 18 Apr 2010 | 21:48:47 UTC - in response to Message 38835.

I uploaded 24 WUs a little bit ago, only to have 9 of them go Validation inconclusive.
Offhand, except for Workunit 90594183, all of the pending WU's you submitted appear to be in the expected state based on what my own units do. All results require a confirmation from another client (quorum=2) and your work units are showing 'Validation inconclusive' because none of the other results have come back yet. In fact, some of those confirmation tasks haven't even been sent out yet.

You should post a complaint about 90594183 in the "Number Crunching: Waiting for Validation..." thread (http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=1678) to let Travis know there is a problem. A lot of us had problems with validation around the same time.

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38855 - Posted: 19 Apr 2010 | 2:42:30 UTC - in response to Message 38844.

I uploaded 24 WUs a little bit ago, only to have 9 of them go Validation inconclusive.

For some obscure reason WUs are listed as "Completed, validation inconclusive" instead of "Completed, waiting for validation" even if there is no other deviating result that would make the validation inconclusive. Currently I see 8 of 10 of your WUs in the "validation inconclusive" state although there are no deviating results.

One WU is labeled as "Completed, waiting for validation" with one task "Completed, waiting for validation" and the other task "Completed, validation inconclusive".

The validator is still buggy in this regard.

"Completed, waiting for validation" = waiting on another task
"Completed, validation inconclusive" = wu wasn't quite on target, and another was sent out and waiting for it to come back.


The entry "Completed, validation inconclusive" should only be set if there are deviating results and not if the result is the only one sent back. If the other WU is unsent or no result is reported back the entry should be set to "Completed, waiting for validation".

Just my two cents...



Sadly this is just how the BOINC server relays messages.

If your WU gets set to Completed, validation inconclusive, that just means the server has sent out more results and is waiting to complete validation.

Completed, waiting for validation means that the result hasn't gone through the validator at all yet.
____________

Fix The USA Impeach Obama
Avatar
Send message
Joined: 17 Mar 08
Posts: 165
Credit: 410,224,363
RAC: 0
Message 38857 - Posted: 19 Apr 2010 | 4:11:53 UTC

OOPS it would seem that the server is out of work..

Ouch.. :)


____________

Profile Travis
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 30 Aug 07
Posts: 1976
Credit: 26,480
RAC: 0
Message 38858 - Posted: 19 Apr 2010 | 4:38:13 UTC - in response to Message 38857.

OOPS it would seem that the server is out of work..

Ouch.. :)



Generating work right now (see the other news post). Things should be up and running...
____________

Post to thread

Message boards : News : testing new validator


Main page · Your account · Message boards


Copyright © 2013 AstroInformatics Group