Message boards :
News :
stock ATI 58x0 apps updated
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 30 Dec 07 Posts: 311 Credit: 149,490,184 RAC: 0 |
Processed two new freshly issued no quorum tasks with new app auto downloaded about 4 hours ago. Both tasks completed, uploaded, reported and remain in "Completed, waiting for validation" status. Installed new manually downloaded version to run 2 tasks concurrently. After server started working again downloaded a further 48 tasks: 35 newly issued tasks with no quorum are remaining in "Completed, waiting for validation" status. 4 reissued no quorum tasks validated successfully. 8 quorum of 2 tasks have validated against HD 4700/4800, HD 3800 and CPU wingmen. 1 quorum of 2 task is waiting on wingman All quorum of 2 tasks received were reissues of tasks from 30 March or before. I did not receive any newly issued tasks with a quorum of 2. All 37 tasks I received newly issued today with no quorum remain in "Completed, waiting for validation" status. Excellent that my HD 5870 is now returning the required accuracy and is validating against other hardware classes. Well done and thank you to all involved in diagnosing the problem and coding and testing the new application. Hopefully the server can be adjusted soon so that it begins to validate newly issued tasks. |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
Now that the ATI applications have been updated, I'm curious - Cluster Physik mentioned that with a few changes, the CUDA applications could give exactly the same results as the ATI and CPU applications. They are currently close (well within tolerances) but not the same. Do you intend to update the CUDA applications based on his advice, or does the current work updating them with the new science and server communication code prevent this from being a priority? I would think that the implementation of the new model takes precedence. If all applications run well with the new code (was just released as a beta version, it is not final yet), one can think about getting the results of CPU/ATI/CUDA versions even closer together. What I've done is basically only a different method for the final summation of the integration steps (using two double values instead of a single one to hold the values, limits the precision loss for this step and greatly reduces the effect of the different summation order of the values between CPUs and GPUs) and to use a simple trick in the likelihood/fitness calculation which buys you one or two digits of precision there. As long as the individual mathematical operations used are precise to all but the very last bit (which is true for all GPUs), minor differences in the very last bit doesn't change the final likelihood result, as the limit for the precision is the summation of the individual values from all the integration points. |
Send message Joined: 12 Aug 09 Posts: 172 Credit: 645,240,165 RAC: 0 |
A few early hicups, reset all four boxes, and all is well so far. Thanks for all your work CP, et. al. Kind Regards |
Send message Joined: 19 Feb 10 Posts: 17 Credit: 7,573,117 RAC: 0 |
Good day, Same here and I think it appers on all ati cards 58xx and 48xx. Is this normal with the new validator or will it be ever fixed ? |
Send message Joined: 18 Nov 07 Posts: 280 Credit: 2,442,757 RAC: 0 |
I would think that the implementation of the new model takes precedence. If all applications run well with the new code (was just released as a beta version, it is not final yet), one can think about getting the results of CPU/ATI/CUDA versions even closer together. Sounds like it should be relatively simple to implement when the time comes. Thanks for the explanation :) |
Send message Joined: 30 Dec 07 Posts: 311 Credit: 149,490,184 RAC: 0 |
Excellent, of the 37 newly issued tasks with no quorum that were "Completed, waiting for validation" a second task was issued 4-7 hours later for about half of them changing them from minimum quorum of 1 to minimum quorum of 2. The other half have now validated as unchanged minimum quorum of 1. So now only 7 of the 37 remain as "Completed, validation inconclusive" and are waiting for wingmen to complete and return their tasks. No invalids so all is good, just had to get used to the strange way some no quorum tasks get converted to quorum tasks hours later and others stay as no quorum tasks and validate hours later. |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
No invalids so all is good, just had to get used to the strange way some no quorum tasks get converted to quorum tasks hours later and others stay as no quorum tasks and validate hours later. We're doing validation in a rather unique way (when i post our recent DAIS paper you can read all about it if you'd like). Since we're running evolutionary algorithms to optimize models to the observed data of the milky way galaxy, the server keeps a population of best found models. When it generates work, it creates permutations of these models to send out to hosts. When you report work back, you give the fitness of the model you just evaluated. Before, what we were doing was if the model you sent back to us was a good one (it would improve the population the server knew about), we'd validate it and if it was validated insert it into the population. If the model you just calculated wasn't better than what the server had in it's population we didn't validate it at all, because we were just going to discard it anyways. Because only 2-10% or so of results that come back actually improved the population, this meant we could get away with really minimal validation. Right now, we're still validating all of the workunits that do improve the populations, but also 50% of the workunits that don't; in order to catch bad applications (like single precision GPU apps) and other errors like the GPU thing we had this week. We're also going to be testing some new milky way models, so it's important that we have things to the appropriate degrees of accuracy. So anyways to answer your question, all workunits get sent out initially with a quorum of 1. When the validator looks at them, it checks to see if it will improve the population, or random() < 0.5; if either of those are the case, it increases the quorum on the workunit to 2 and waits for it to be validated again. Thats why some workunits start with a quorum of 1 and then later have it increased. |
Send message Joined: 30 Dec 07 Posts: 311 Credit: 149,490,184 RAC: 0 |
Thank you for the explanation of how the quorum system works. My 5870 is currently working on another relatively new ATI project. I only installed the fixed 0.23 application version and processed a batch to help test. I'll be back here later when you introduce your new application which uses a different model of the Milky Way galaxy and uses parameters from the command line. It will be very pleasing if this reduces server congestion because currently when the server cannot be contacted quickly it takes a long time before pending requests to the server time out in BOINC Manager and this causes delay to server requests and replies for other projects. This also happens on some other projects with busy servers and the combination of 2 or more slow project servers can create quite an inefficient and sluggish BOINC Manager if you are attached to multiple projects. Improved server performance will also reduce the occurrence of phantom tasks. All very good, I look forward to it. |
Send message Joined: 14 Dec 09 Posts: 161 Credit: 589,318,064 RAC: 0 |
it still says validation inconclusive when it's compared with cuda. Here is some links: http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90669112 http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90686852 http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90669102 |
Send message Joined: 26 Jan 10 Posts: 3 Credit: 1,966,092 RAC: 0 |
Same here with a Radeon HD4890: http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90932417 http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90932416 http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90932410 http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90932400 But others seem to have worked: http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90932409 Regards pollux_9t |
Send message Joined: 30 Dec 07 Posts: 311 Credit: 149,490,184 RAC: 0 |
That's because for both of you, your quorum partner (wingman) has currently not yet completed and reported the task that has been sent out. With a quorum of 2, two tasks are sent out to different contributors and the server compares the the 2 results when they are completed and reported. If only one task has been completed and reported there is nothing to compare yet ("Completed, validation inconclusive" status). You need to wait until the other task has been completed and reported and some time after that the server will compare the 2 results (your result and that of your wingman) and see if they agree to the required accuracy. If they agree then both tasks will be validated and credit granted. If they do not agree then another task will be sent out to see which one of the two is correct. At this stage both of the tasks that do not agree will also have a status of "Completed, validation inconclusive" All newly issued work units are sent out to just one contributor at first (quorum of 1). After they are completed and reported the server than sends out another task for some of these so the quorum changes to a quorum of 2. |
Send message Joined: 14 Dec 09 Posts: 161 Credit: 589,318,064 RAC: 0 |
If they do not agree then another task will be sent out to see which one of the two is correct. At this stage both of the tasks that do not agree will also have a status of "Completed, validation inconclusive" What if at this stage, the computer which latest wu was sent to, detaches? |
Send message Joined: 1 Mar 09 Posts: 56 Credit: 1,984,937,499 RAC: 0 |
What if at this stage, the computer which latest wu was sent to, detaches? If it was the proper BOINC 'detach' operation, the client would advise the server that the task was not going to be completed and the server would immediately send out a new copy to a 4th machine. A worse scenario would be if the owner of the 3rd machine happened to turn his machine off and go on holidays for a month, the server would not know and would have to wait for the deadline to expire before it could send out the 4th copy. The server can be infinitely patient and the quorun would eventually get completed. By that time it would be unlikely that such a delayed result would be of any use to the project and it would most likely be discarded. However, the hosts which were deemed to have supplied the agreeing results would still get credit even if the result was discarded. Cheers, Gary. |
Send message Joined: 26 Jan 10 Posts: 3 Credit: 1,966,092 RAC: 0 |
Thanks for the quick answer! Regards, pollux_9t |
Send message Joined: 15 Jul 08 Posts: 383 Credit: 729,293,740 RAC: 0 |
Something still doesn't seem to be working correctly on the validator. Here's 5 new WUs in which in each case a 58xx card running v.23 and another 58xx running .20b validated against each other and invalidated good v.23 results from 48xx cards: http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90934041 http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90924553 http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90902501 http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90873943 http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90757381 As I understand it this should not be happening. Interesting coincidences that both 58xx machines have 4 GPUs and are from the same team & sub-team... Is the validator still broken or? Both of these 4 x 58xx machines running v.23 are putting out a massive number of invalid results. The above are just a few examples. Here's the 2 machines: http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=123623&offset=0&show_names=0&state=4 http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=39218&offset=0&show_names=0&state=4 Edit: Here's another WU with the same scenario: http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=90970028 but with yet a 3rd 58xx with v.23 spewing out a massive number of bad results. This time a machine with only 2 x 58xx cards: http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=112491&offset=0&show_names=0&state=4 |
Send message Joined: 19 Feb 10 Posts: 17 Credit: 7,573,117 RAC: 0 |
Something still doesn't seem to be working correctly on the validator. Here's 5 new WUs in which in each case a 58xx card running v.23 and another 58xx running .20b validated against each other and invalidated good v.23 results from 48xx cards: I have random invalid wus, I suspect some driver problems. Try to update the ati driver to the newest. I'll do the same. |
Send message Joined: 15 Jul 08 Posts: 383 Credit: 729,293,740 RAC: 0 |
Something still doesn't seem to be working correctly on the validator. Here's 5 new WUs in which in each case a 58xx card running v.23 and another 58xx running .20b validated against each other and invalidated good v.23 results from 48xx cards: On some of the above WUs all machines have the latest drivers. The driver version is irrelevant. |
Send message Joined: 4 Oct 08 Posts: 1734 Credit: 64,228,409 RAC: 0 |
I am finding that about 25% of the output of my HD5850 GPU is not validated immediately, and about 50% to 60% is in the same state very shortly after returning the results. A rather interesting situation. But, none of them seems to be rejected after the wingmen report. Much better than a few days ago. Go away, I was asleep |
Send message Joined: 17 Feb 08 Posts: 363 Credit: 258,227,990 RAC: 0 |
That's what i'm seeing here too. I doubt that the validator does what it should. If it would work correctly it would NOT declare a result done on two 58xx with 0.20b and 0.23 as a match and VALID, discarding the most likely correct one done on a 48xx(marked as invalid) ... Join Support science! Joinc Team BOINC United now! |
©2024 Astroinformatics Group