Message boards :
Number crunching :
CUDA Application Updated
Message board moderation
Author | Message |
---|---|
Send message Joined: 16 Jun 09 Posts: 85 Credit: 172,476 RAC: 0 |
The CUDA application for 32 bit Windows has been updated with speed improvements, users should notice a 2x increase in performance. Thank you to Brent from NVIDIA for assisting with making the application run faster on NVIDIA's hardware and also thanks to Cluster Physik for providing methods to also increase the performance. |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
The CUDA application for 32 bit Windows has been updated with speed improvements, users should notice a 2x increase in performance. So how long do those 53 credit WUs you just crunched really take? The two seconds one sees in your task list are just the CPU time. As a HD4870 at stock clocks needs about 48 seconds, and the nvidia guy has probably done what was possible, I guess the CUDA app is now approaching its theoretical ceiling. My guess is something between 2:30 and 3:00 minutes for a GTX285 at stock clocks for a 53 credit WU. |
Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0 |
The CUDA application for 32 bit Windows has been updated with speed improvements, users should notice a 2x increase in performance. Um, did the version number get an update? I am just got 0.20 cuda23 ... Whatever the version, the ones I just downloaded ran on GTX260 in about 202 seconds (3:02) which would be about the run time on a GTX295 core if my experience on GPU Grid transfers (I have not tried it there yet) ... Changed version to 6.10.7 just in case, but, I will note that the windows interface becomes almost unusable because of lag ... so ... not sure "we" have the balance quite right yet ... note I have no speciall settings set to control things I pretty much run stock ... YMMV |
Send message Joined: 22 Feb 09 Posts: 20 Credit: 105,156,399 RAC: 0 |
I have crunched some units with the new app. and they last 3:20 (200 seconds) with a stock GTX280. This graphic card took 6:30 (390 seconds) with the previous one. Impressive: 95% faster than before. |
Send message Joined: 24 Dec 07 Posts: 1947 Credit: 240,884,648 RAC: 0 |
I have crunched some units with the new app. and they last 3:20 (200 seconds) with a stock GTX280. This graphic card took 6:30 (390 seconds) with the previous one. Not quite 95% faster...you may want to redo your maths. |
Send message Joined: 12 Aug 08 Posts: 253 Credit: 275,593,872 RAC: 0 |
I have crunched some units with the new app. and they last 3:20 (200 seconds) with a stock GTX280. This graphic card took 6:30 (390 seconds) with the previous one. Edboard and I must have went to the same school then.
|
Send message Joined: 6 Apr 08 Posts: 2018 Credit: 100,142,856 RAC: 0 |
I have crunched some units with the new app. and they last 3:20 (200 seconds) with a stock GTX280. This graphic card took 6:30 (390 seconds) with the previous one. I think you're both right - we should all be given 95% more credit immediatley. |
Send message Joined: 24 Dec 07 Posts: 1947 Credit: 240,884,648 RAC: 0 |
I have crunched some units with the new app. and they last 3:20 (200 seconds) with a stock GTX280. This graphic card took 6:30 (390 seconds) with the previous one. The reference point is the 390 seconds (previous), not the 200 seconds (current). But I agree with Ice. |
Send message Joined: 22 Feb 09 Posts: 20 Credit: 105,156,399 RAC: 0 |
I said "faster", so I'm comparing "speeds" not "durations". In other words, I'm comparing inverses of time (1/t) not times (t). |
Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0 |
I said "faster", so I'm comparing "speeds" not "durations". In other words, I'm comparing inverses of time (1/t) not times (t). Quibble all they want, I still think it is about twice as fast ... which is an increase of about 100% ... Of course the down side is that I have seen more lag on the screen for some updates. Changing the pane/tab in BM ... but once the tab is up it seems to refresh ok... so I don't quite understand all I know about that ... Still, as most my machines are dedicated to BOINC, it is a little bit of who cares most of the time. I am still waiting for it to settle in so I can see if BOINC will allow MW on CUDA to play nice with GPU Grid ... or not ... so far it has been a little bit of not ... sigh ... |
Send message Joined: 22 Feb 09 Posts: 20 Credit: 105,156,399 RAC: 0 |
It's exactly the same to say: 100% faster Twice faster 100% increase in speed I chose the first because it was not exactly 100%. May be it would be more clear if I had choosen the third one: 95% increase in speed or I had said: "almost twice faster" |
Send message Joined: 1 Sep 08 Posts: 204 Credit: 219,354,537 RAC: 0 |
I think you're both right - we should all be given 95% more credit immediatley. You're getting that automatically due to the speed increase, don't you? Anyway, the new CUDA app looks quite nice: 200s for a GTX280 is only 4 times slower than a 110€ ATI. That's better than expected ;) MrS Scanning for our furry friends since Jan 2002 |
Send message Joined: 5 Aug 09 Posts: 9 Credit: 32,279,415 RAC: 0 |
down side is that I have seen more lag on the screen ... as most my machines are dedicated to BOINC ... What about setting the CUDA App "priority" down just a bit, to improve the GUI responsiveness for the other 95% of us who want to contribute to MW Science but need to perform work in the foreground to pay for our BOINC "contributions" ?? Can this new CUDA App be "detuned" sufficiently to improve GUI Responsiveness ?? |
Send message Joined: 12 Dec 07 Posts: 3 Credit: 15,796,608 RAC: 0 |
Anyway, the new CUDA app looks quite nice: 200s for a GTX280 is only 4 times slower than a 110€ ATI. That's better than expected ;) If I'm not wrong the theoretical DP performance of a 4870 (is this the card you meant?) vs a GTX280 is 240 GFLOPS vs 78 GFLOPS = ~3 : 1. So a factor of 4 is nice but we should get to a factor of 3... :P I know it's just a comparison of theoretical numbers... |
Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0 |
Anyway, the new CUDA app looks quite nice: 200s for a GTX280 is only 4 times slower than a 110€ ATI. That's better than expected ;) Because those are theoretical numbers is the reason that 4:1 ratio is not so bad. It is another case of AMD vs. Intel and which is better or faster for a particular project. |
Send message Joined: 1 Sep 08 Posts: 204 Credit: 219,354,537 RAC: 0 |
You're right, I was talking about a 4870 and it's maximum dp performance is indeed 240 GFlops at 750 MHz. Mine runs at 800 MHz (256 GFlops) and achieves ~190 GFlops at MW. That's a really really good optimization done by CP, so even achieving something close to these numbers is challenging. @Seigell: there's still the option to choose "don't run CUDA when user is active". It's not ideal, but achieving good performance on the GPU while keeping the UI responsive is also rather challenging. Ideally the app would switch behaviour depending on what the user is doing (idle, normal work, graphics intensive work / game). In the non-idle cases the GPU wouldn't have to stop completely, just crunch a little less intensive. MrS Scanning for our furry friends since Jan 2002 |
Send message Joined: 31 Mar 08 Posts: 61 Credit: 18,325,284 RAC: 0 |
You're right, I was talking about a 4870 and it's maximum dp performance is indeed 240 GFlops at 750 MHz. Mine runs at 800 MHz (256 GFlops) and achieves ~190 GFlops at MW. That's a really really good optimization done by CP, so even achieving something close to these numbers is challenging. Which makes me disappointed when other projects whine overpay in milky way! I know its already done, credit lowering and all, optimization will overcome or surpass lowering and all, but Boinc and its projects are decentralized, so MW admin should not bow under pressure from other projects! CUDA is not equal to Brooks. ATI is not equal to Nnvidia. Intel is not equal to AMD. |
Send message Joined: 26 Jan 09 Posts: 589 Credit: 497,834,261 RAC: 0 |
|
Send message Joined: 1 Sep 08 Posts: 204 Credit: 219,354,537 RAC: 0 |
Thamir, let's not turn this thread into another credit flame war. But think about that for a moment: this 110€ ATI could do about 100k RAC @ MW, even after the credit "lowering" / "adjustment to SETI standard". In my team we've got about 90 active members. On my account alone we've got a good 30 cpu cores, additionally I know of at least 2 fast nVidias (GT200), a medium one (G92), a small one (G84), a medium ATI and a PS3. Recently we were doing 300k RAC. 150k RAC of which were achieved just by our 2 HD4870s, whereas the other half was done by all those CPUs, the GPUs I named and possibly others. Really, MW@ATI yields an insane amount of credit, even after the recent adjustments. I wouldn't want it to be even more extreme. The wy it is now it already almost completely discourages me from running anything else. I might even stop cpu crunching entirely if I wouldn't have an electricity flatrate. (*) That's why other projects are "whining" about the overpay at MW. For most of them it's just impossible to utilize the hardware to this extent. It's not just about being lazy programmers - the algorithm and the problem itself don't allow it. They could never achieve the same Flop/s even if they made an ATI app. MrS (*) Of course you can always say "but it's for the science!" But that just doesn't work - it's not the science which I see, it's the credits. And I'm trying to help my team as much as I can, without spending an arm and a leg. Scanning for our furry friends since Jan 2002 |
Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0 |
That's why other projects are "whining" about the overpay at MW. For most of them it's just impossible to utilize the hardware to this extent. It's not just about being lazy programmers - the algorithm and the problem itself don't allow it. They could never achieve the same Flop/s even if they made an ATI app. That is very true. The MW algorithm is really perfect for a GPU. Take vast amounts of parallelism (millions of threads), no branching (except you want to call a loop with a counter checked each iteration a branch), a very compute intense algorithm with only a few memory accesses, minimal communication between the threads (the values are just added in the end), and what you get is virtually the peak performance of a given GPU for the instruction mix of the algorithm. It's not all about multiply-adds, so you won't get exactly peak performance. But the v0.20 has cut all the overhead down to a minimum so you really arrive within 10% of what is theoretically possible with the algorithm's instruction mix. That's better than any current CPU achieves, even relative to its peak performance. And that will continue to scale, the new ATI HD5870 should easily double the performance of a HD4890 at Milkyway. And when the next nvidia generation arrives, I'm quite sure it will do much more than to double the DP performance of a GTX285 ;) |
©2024 Astroinformatics Group