Message boards :
News :
testing new validator
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next
Author | Message |
---|---|
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
As stated in corresponding thread, 5xxx ability to work is very questionable right now. Bugs in ATI's OpenCL SDK implementation. They promised to fix those in new SDK release, will see...I recall GPUGRID was saying that ATI OpenCL was completely unusable. Kept locking up the machine at random. Also major problems with 4xxx performance that rendered them useless for any purpose. We have an OpenCL version of the MW@Home GPU application... and its about 10x slower on both NVIDIA and ATI cards. OpenCL still needs a lot of work it seems... If someone with both cards could do some comparison the numbers would be very helpful. When I release the code for the new application I'll have some real-sized workunit examples and the output that will be required (it will have to be within at least 10e-11). Hopefully this will help us either figure out the problem. |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
The 1 max error is because our application really shouldn't error out. Chances are if there's an error it was our fault (ie, a badly generated or specified workunit), and we don't want to send out more bad WUs. I don't mind upping it to 3 if people would prefer that, however. |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
It knows the time taken but doesn't use this for validation. I'm not quite sure how that would be helpful. |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
Good catch. There was a small bug in the check_set code for the validator. This shouldn't happen anymore.
The only thing used is the fitness value reported by the application. If the fitness returned is within 10e-11 of 2 other fitnesses for the quorum, it's valid. |
Send message Joined: 1 Jul 09 Posts: 8 Credit: 1,734,500 RAC: 0 |
Shouldn't it read: Completed, waiting for validation instead of Completed, validation inconclusive if one result is returned and no other results are reported? [EDIT]: Fixed a typo... |
Send message Joined: 29 Aug 07 Posts: 115 Credit: 501,600,397 RAC: 5,019 |
|
Send message Joined: 1 Mar 09 Posts: 56 Credit: 1,984,937,499 RAC: 0 |
If not, what exactly is used? Thanks very much for the reply. All we can see in the data returned is what's shown below. This is one of the invalids from the quorum I linked previously. Can't see any 'fitness' value in there so can you advise if it's possible to get that value from somewhere? I imagine you could trawl the slot directory and find it there for your own host before the result is uploaded but that doesn't help with finding the fitness for each of your wingmen. Device 0: ATI Radeon HD5800 series (Cypress) 1024 MB local RAM (remote 2047 MB cached + 2047 MB uncached) GPU core clock: 850 MHz, memory clock: 1200 MHz 1600 shader units organized in 20 SIMDs with 16 VLIW units (5-issue), wavefront size 64 threads supporting double precision Starting WU on GPU 0 main integral, 640 iterations predicted runtime per iteration is 123 ms (33.3333 ms are allowed), dividing each iteration in 4 parts borders of the domains at 0 400 800 1200 1600 Calculated about 3.28897e+013 floatingpoint ops on GPU, 2.47165e+008 on FPU. Approximate GPU time 84.7168 seconds. probability calculation (stars) Calculated about 3.34818e+009 floatingpoint ops on FPU. WU completed. CPU time: 3.04202 seconds, GPU time: 84.7168 seconds, wall clock time: 86.535 seconds, CPU frequency: 2.87056 GHz </stderr_txt> Cheers, Gary. |
Send message Joined: 1 Jul 09 Posts: 8 Credit: 1,734,500 RAC: 0 |
Here is another one: http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=89444466 errors: Too many success results One 0.19 result on a CPU and two 0.20b results on a HD 47xx/48xx and on a HD 58xx lead to this weird result. |
Send message Joined: 1 Mar 09 Posts: 56 Credit: 1,984,937,499 RAC: 0 |
Here is another one: Bug already fixed. Check the third post in this thread. Cheers, Gary. |
Send message Joined: 1 Jul 09 Posts: 8 Credit: 1,734,500 RAC: 0 |
Here is another one: Thank you. I think I need more coffee... |
Send message Joined: 22 Jan 09 Posts: 35 Credit: 46,731,190 RAC: 0 |
82% valid tasks, is not going to work in the long run, obviously. But I'll hang around for the shakedown. Well I'm out of here for the time being as 82% is not satisfactory for me! I realize that MilkyWay is still classed (as far as I know) as an Alpha project, but IMHO it is mature enough that they shouldn't be running tests in a production environment - at least some of these bugs (if not the majority of them) SHOULD have been caught in testing before releasing this new version validator into the wild. See y'all later. |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
82% valid tasks, is not going to work in the long run, obviously. But I'll hang around for the shakedown. Right now it looks like the problem isn't the validator but the (optimized?) GPU applications. I don't think it will take us too long to sort this out. And honestly, I put the new validator out tonight only screwing up a few workunits. I don't think that's too bad :P There's a lot of things you just can't catch until you put that kind of thing out in the wild anyways. Like I mentioned in the previous post, I rewrote the assimilator/validator code from the ground up in Java. This is going to make debugging and testing a LOT easier (yay garbage collection, exceptions and no more segmentation faults), and the validator much more stable (no memory leaks, writing to bad areas of memory). Oddly enough, it seems to be using significantly less CPU than the older version (which was c/c++). |
Send message Joined: 1 Mar 09 Posts: 56 Credit: 1,984,937,499 RAC: 0 |
The 1 max error is because our application really shouldn't error out. Chances are if there's an error it was our fault (ie, a badly generated or specified workunit), and we don't want to send out more bad WUs. With an IR of 3, if the whole WU is bad all 3 will be bad and and you'll quickly hit the 3 error results limit. You shouldn't underestimate the ability of the average cruncher to trash the tasks even if the app itself really shouldn't error out :-). Also, it's very frustrating to the CPU crunchers to see many hours of work down the drain just because of a second error result in a quorum before the third success result has had a chance to come in. What problem is there in sending out an extra copy or two of the task to see if you can get a quorum? I don't mind upping it to 3 if people would prefer that, however. Well, at least make it 2 so as to give a bit more protection to those who have invested their resources (and put a memo on your monitor bezel to "Not send out any bad WUs" :-). Cheers, Gary. |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
The 1 max error is because our application really shouldn't error out. Chances are if there's an error it was our fault (ie, a badly generated or specified workunit), and we don't want to send out more bad WUs. Good points. I upped the max error results to 3. This should be reflected in all the current (and new) workunits. |
Send message Joined: 1 Mar 09 Posts: 56 Credit: 1,984,937,499 RAC: 0 |
Here's another example of the 'Too many success results' bug. Note that one of the victims actually invested over a day of CPU time for no reward. I guess he wont be particularly impressed. I wonder why people still persist with slow CPUs on a project like this? And here's one that is rather more important that I've just noticed. Looks like Travis has set 3,6,6 for errors/total/success and this quorum has failed with the error message "Too many total results". However there are only 6 tasks listed in the quorum, one of which is a 'client detached'. Perhaps that triggered an attempt to send out a 7th copy which junked the whole quorum. Because of the conflict between 48xx and 58xx, there mustn't have been 3 agreeing results at the time the attempt was made to send out the 7th copy. Until things are sorted regarding validation, perhaps it should be 3,9,6 rather than 3,6,6 to prevent this problem. EDIT: If you think about it, it makes sense to have the 'total' equal to the sum of 'errors' and 'success' so that all bases are covered. Cheers, Gary. |
Send message Joined: 30 Dec 07 Posts: 311 Credit: 149,490,184 RAC: 0 |
So it is possible that most of these results would be accurate to 10e-11 if compared only against an unoptimised CPU application but the results from ATI 48xx, NVIDIA and optimised CPU applications are on one side of the required fitness value and the results from ATI 58xx and 5970 are on the other side. Therefore the difference between these two sets of hardware is less accurate than 10e-11, even though individual results compared against an unoptimised CPU application may still have the required accuracy. And this is the reason that some projects that validate results with a quorum need to use homogeneous redundancy to ensure accurate results on different types of hardware? |
Send message Joined: 7 Feb 09 Posts: 9 Credit: 25,983,618 RAC: 0 |
Is the "Canonical" result used in anyway in determining the validity of results? I haven't checked too many, but have noted that the first result in sometimes determines validity or invalidity. Strangely enough, the main variance that I can see is if the first in is either 48xx or 57xx/58xx and made the Canonical result, then all wus returned with that series card is validated whereas higher (or lower) cards are invalidated. All other data showing in the text file we get to see is usually the same. It is expecially annoying when the seen calcs return 10e-13 on the GPU FPU, the required figure on the GPU, as with the canonical whereas we are ruled invalid. Stars are usually at 10-e9 Possibly invalid opinion, but sometimes a coincidence ...... |
Send message Joined: 1 Mar 09 Posts: 56 Credit: 1,984,937,499 RAC: 0 |
Is the "Canonical" result used in anyway in determining the validity of results? I haven't checked too many, but have noted that the first result in sometimes determines validity or invalidity. I think it's the other way around. The validator selects those results that agree (within specification) and one of them (perhaps the first one) is nominated as 'canonical'. Maybe it's the one whose answer is the closest to the average of all valid results for that quorum. I guess it depends on how the validator has been written. .... Take a look more closely. The numbers you are quoting are 'flops' not 'fitness' and they are e+013 and e+009 rather than 'minus'. Travis has already said that only 'fitness' is used for validation but he hasn't answered (yet) about where we might be able to observe the actual 'fitness' values for results in a quorum. I suspect we can't access those values which will make it rather unsatisfactory for anyone trying to understand why results are being deemed invalid. Seeing as the program is being modified at the moment, it might be a good opportunity to add some code to display on the website the fitness value returned by each successful task. Cheers, Gary. |
Send message Joined: 24 Dec 07 Posts: 1947 Credit: 240,884,648 RAC: 0 |
Oh well...NNT until the problems with the validator have been overcome. |
Send message Joined: 14 Dec 09 Posts: 161 Credit: 589,318,064 RAC: 0 |
I'm still gathering up a lot of "can't validate" messages. What does "check skipped" mean anyway? |
©2024 Astroinformatics Group