Message boards :
News :
validator strictness
Message board moderation
Author | Message |
---|---|
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
I've lowered the strictness of the validator from 10e-11 to 10e-10. I'm hoping this should significantly reduce the number of WUs flagged invalid. If the issue persists I might have to lower it farther to 10e-9. The new application will have the strictness back at 10e-11, so keep that in mind if you're compiling your own versions. The issue we're having seems to be that the ATI 48xx GPUs and the ATI 58xx GPUs are returning different results, and if too many of either make it into the quorum they will invalidate the other results (including stock results). I'm still trying to determine if the 58xx GPU or the 48xx GPU is the one correctly validating against the stock application. I've also updated the validator so if you check your tasks they will show what fitness they reported, so you can compare vs other tasks for the same workunit. I'm hoping we should have this issue straightened out shortly, and thanks for your patience. |
Send message Joined: 6 Mar 09 Posts: 41 Credit: 38,856,291 RAC: 0 |
I don't see any decrease in invalids on my 4870's... |
Send message Joined: 24 Dec 07 Posts: 1947 Credit: 240,884,648 RAC: 0 |
I think the services of Cluster Physik are needed! You need to add the ATI 38XX series as also being different to the 58XX series. |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
I think the services of Cluster Physik are needed! The HD38xx GPUs return the exact same results as the HD47xx/48xx GPUs (I have testd both series extensively before making the applications available), those validate also against CPU and CUDA. Only the HD5800 series GPUs appear to deviate significantly, I have no idea why, as it executes the exact same code as the other GPUs. Maybe it's a driver/compiler hiccup of some sort. I don't have a HD5800 series GPU, so I have no possibility to test for it. But I found a calculated WU with some different applications. Fitness values: -3.19087277379105500000 (CPU 0.20, SSE3, x64) -3.19087277379105500000 (HD4870/HD38xx, 0.21) -3.19087277379125100000 (CUDA 0.24) -3.19087286725516700000 (HD5870 0.21) As you see, the HD38xx/48xx GPUs deliver the exact same result as the CPU (all my versions starting with 0.20 return the exact same values). The CUDA application arrives at a slightly different one, but the deviation is in the 10^-13 range, completely acceptable (the stock CPU application typically deviates by about the same amount from the CUDA and my versions, I put some special stuff in to "guard" the calculations against the differences between the architectures). But what goes on with the HD5800 GPUs I've no idea right now. Can anyone check if this depends on the driver version? |
Send message Joined: 24 Dec 07 Posts: 1947 Credit: 240,884,648 RAC: 0 |
Surely ATI wouldn't have played 'funny buggers' with the compiler and unless you utilise a particular flag that triggers DP during compiling that it compiles for SP? |
Send message Joined: 14 Dec 09 Posts: 161 Credit: 589,318,064 RAC: 0 |
I'm shifting my 2x5970 from collatz to milkyway now. Hope it helps to fix the issue. |
Send message Joined: 1 Mar 09 Posts: 56 Credit: 1,984,937,499 RAC: 0 |
The issue we're having seems to be that the ATI 48xx GPUs and the ATI 58xx GPUs are returning different results, and if too many of either make it into the quorum they will invalidate the other results (including stock results). Yes, precisely and here is some info from WU 90332871 which I've just chosen at random which shows a couple of things. Five tasks were needed to get a quorum and the three 5800 series ended up clobbering the two non-5800 cards. The two that were clobbered had precisely identical results and I doubt you would expect that if their results were tainted by overclocking or some other error causing condition. Two of the 5800 results had no visible fitness value listed in the taskID output but since they all validated I guess they must have been pretty much identical to the one value that was visible. I've highlighted in red the non-agreement between the 5800 value and the 4800/3800 values so you can easily see that the mis-match was quite woeful (around the e-07 level). I've done this as a separate block below the main block since color tags don't seem to work inside code tags. The three that validated are marked with *. Fitness value returned -- GPU series -- application used ======================================================== -3.16907880276272100000 - 4800 series - v0.21 (ati13ati) -3.16907889603708600000 - 5800 series - anon v0.20b (Win64, CAL 1.4) by Gipsel* -3.16907880276272100000 - 3800 series - v0.21 (ati13ati) ??????????????????????? - 5800 series - v0.21 (ati13ati)* ??????????????????????? - 5800 series - v0.21 (ati13ati)* -3.16907880276272100000 - 4800 series -3.16907889603708600000 - 5800 series I'm still trying to determine if the 58xx GPU or the 48xx GPU is the one correctly validating against the stock application. Most of the above are the stock 0.21 app. I seem to recall that initially CP did the 0.21 version to take advantage of special features of 5800 series cards that gave a decent speedup. I also recall that he said it was OK for older cards to use this app - they would get the same answers but just not have the extra speedup of the 5800 series. I remember testing the app on my 4800 cards and finding that there was a minor speedup so I did use 0.21 under AP until it was made the stock app. Now it would seem that 0.21 doesn't give the same answers on 4800/3800 series cards as it does on 5800 series. Maybe CP can throw some light on this. I've also updated the validator so if you check your tasks they will show what fitness they reported, so you can compare vs other tasks for the same workunit. Thanks very much for that, it's really useful. Can you comment on why two of the above five didn't actually show a fitness value? PS: I composed the above before I had seen CP's response so now we have the answer that the 5800 series are giving the wrong answer. Even though the explanation has been given, I'm still going to post what I've been composing because it does highlight the major difference of the 5800 answers using a current 'in the wild' quorum. Until you can find out why this is happening and then rectify the problem, it would be rather unfair to keep penalising the 3800/4700/4800/CUDA/CPU owners who happen to get teamed up with three 5800 series crunchers. Maybe you should go back to single result validation until this gets sorted. After all, it looks like any 5800 results that get into the database might be rather useless anyway. Cheers, Gary. |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
Surely ATI wouldn't have played 'funny buggers' with the compiler and unless you utilise a particular flag that triggers DP during compiling that it compiles for SP? No, it doesn't work this way. All ATI cards use the exact same code (specifically using DP), but it gets JIT compiled to the specific GPU by the driver during runtime. Currently I suspect some bug somewhere in this step. Later today I will try to get some dissassembly of the GPU ISA code and look for differences between HD3800, HD4700/4800 and the HD5800 (as said, the code before compiling is exactly the same). There are some slight differences between the different GPU series, but they are not larger than between CPUs and GPUs, so I would not expect them to be causing this. Also the CUDA applications returns very similar results to the ATI version on HD3800 and HD4800 GPUs. And the architectural differences between them are actually larger than between HD4800 and HD5800 GPUs. So I don't think this is caused by these differences. Some strange bug somewhere is more likely. |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
Most of the above are the stock 0.21 app. I seem to recall that initially CP did the 0.21 version to take advantage of special features of 5800 series cards that gave a decent speedup. ... Actually, version 0.20b and the "stock" v0.21 for ATI are identical. Furthermore, no HD5000 specific code is used at all (that was Collatz, where a specific path for HD5000 series GPUs yielded a decent speedup). The (IL) code for all ATI GPUs is exactly the same (but gets compiled to slightly different GPU specific ISA code by the driver). For some reason I still have to figure out, it only returns different values on HD5800 GPUs. |
Send message Joined: 1 Mar 09 Posts: 56 Credit: 1,984,937,499 RAC: 0 |
... no HD5000 specific code is used at all (that was Collatz, where a specific path for HD5000 series GPUs yielded a decent speedup). Yes, my apologies for the misinformation and my confusion between MW and Collatz. Having thought about it more carefully, I do remember the speedup on my HD4850s from around 17-18 mins to around 15 mins per task - obviously Collatz and not MW from the times. I wish you every success with your disassembly activities mentioned in your other response. It's really great that you are willing to spend time chasing the source of the problem. I'm sure all participants (let alone the Devs :-) ) must be (as I am) very grateful for your continuing support of this project. Cheers, Gary. |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
... no HD5000 specific code is used at all (that was Collatz, where a specific path for HD5000 series GPUs yielded a decent speedup). We're very thankful for everyones help :) Anthony is also looking into the GPU issue right now. As soon as John gives me the okay with the new astronomy code I'll be making the source available for the new application as well. |
Send message Joined: 12 Aug 09 Posts: 262 Credit: 92,631,041 RAC: 0 |
I am running with a nVidia GTX 285 and have a lot of invalid since 5 April. Before it was crunching well. The RAC is dropping. What I see is that when mine are invalid, others run with ATI cards and are invalid as well. Perhaps you can use this information Travis. Greetings from, TJ |
Send message Joined: 2 Jan 09 Posts: 34 Credit: 93,631,891 RAC: 0 |
I was browsing the ATI/AMD Developer KB and ran across this, might it have something to do with the apparent problems with 58xx series cards? |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
I was browsing the ATI/AMD Developer KB and ran across this, might it have something to do with the apparent problems with 58xx series cards? Nice catch! This could be really the culprit. I hope I can build a modified version later today (should be easy, without the need to search for other issues). |
Send message Joined: 24 Dec 07 Posts: 1947 Credit: 240,884,648 RAC: 0 |
I was browsing the ATI/AMD Developer KB and ran across this, might it have something to do with the apparent problems with 58xx series cards? Ahh...good to see I wasn't too far off the mark and it was to do with SP vs DP, but just at the coding level. Sounds like we have a winner! |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
I was browsing the ATI/AMD Developer KB and ran across this, might it have something to do with the apparent problems with 58xx series cards? In fact, it is a difference how the texture units of the HD5000 GPUs work. The new GPUs have additional circuitry to ensure the values loaded from memory are valid encodings of numbers and normalizes floating point numbers for instance. That's why it matters in which format they are declared. With older GPUs no such checks were done and the format declaration was basically a placeholder. As the ATI application was developed with CAL 1.3, it still uses the buffer formats recommended back then, which simply leads to erratic behaviour on newer GPUs. Im astonished that Collatz doesn't suffer from this. I guess the reason is that I use a texture sampler with point sampling (i.e. no filtering applied) here which obviously enables the checks as described above, but simply load values from a texture (without a sampler) over at Collatz. The difference is basically just the indexing, a texture sampler takes float values as coordinates, a load instruction uses integer values to index into the texture array. Obviously it also bypass the checks. |
Send message Joined: 12 Aug 09 Posts: 172 Credit: 645,240,165 RAC: 0 |
So should I take my 5970's down for a rest, or has this problem been fixed yet? I was browsing the ATI/AMD Developer KB and ran across this, might it have something to do with the apparent problems with 58xx series cards? |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
So should I take my 5970's down for a rest, or has this problem been fixed yet? I think the application is going to need to be updated before the problem gets fixed. I'll make a news post as soon as we have new applications for the 58x0 series. |
Send message Joined: 1 Mar 09 Posts: 56 Credit: 1,984,937,499 RAC: 0 |
I think the application is going to need to be updated before the problem gets fixed. I'll make a news post as soon as we have new applications for the 58x0 series. Just to clarify things a bit, if CP comes out with a corrected 'current generation' app before you release your new source code, that would provide an immediate solution to the 'invalids' problem if all 5800 series owners were to immediately adopt the new app. You have mentioned several times about 'releasing the new code' and 'allowing people to compile their own apps' but I don't think you actually spelled out exactly what precompiled apps you would be releasing as well. I might be wrong but I got the impression at one point that you might be building for CPU and CUDA but perhaps not for ATI? In other words, we would need to rely on the continuing services of CP or someone else to port the new code and build the appropriate ATI apps. Is this how things will work when you release the new code? Cheers, Gary. |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
Just to clarify things a bit, if CP comes out with a corrected 'current generation' app before you release your new source code, that would provide an immediate solution to the 'invalids' problem if all 5800 series owners were to immediately adopt the new app. Yeah, it seems like CP (and Anthony) are working on new version of the 58x0 application, which will solve this problem. Hopefully they'll be out soon.
Right now I've compiled some OSX applications, and they're on the server right now as milkyway3 (technically what people are running right now is milkyway2, as before that was just astronomy). I don't have the hardware to compile the windows applications so I'm going to need Anthony to do that. But for the time being I'm going to test it on OS X which should effect the least amount of our users and what I have direct control over upgrading if there are any problems. |
©2024 Astroinformatics Group