CUDA Application Updated

Author	Message
Anthony Waters Send message Joined: 16 Jun 09 Posts: 85 Credit: 172,476 RAC: 0	Message 31548 - Posted: 26 Sep 2009, 3:33:54 UTC The CUDA application for 32 bit Windows has been updated with speed improvements, users should notice a 2x increase in performance. Thank you to Brent from NVIDIA for assisting with making the application run faster on NVIDIA's hardware and also thanks to Cluster Physik for providing methods to also increase the performance. ID: 31548 · Rating: 0 · rate: / Reply Quote

Cluster Physik Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0	Message 31553 - Posted: 26 Sep 2009, 4:00:10 UTC - in response to Message 31548. Last modified: 26 Sep 2009, 4:41:33 UTC The CUDA application for 32 bit Windows has been updated with speed improvements, users should notice a 2x increase in performance. Thank you to Brent from NVIDIA for assisting with making the application run faster on NVIDIA's hardware and also thanks to Cluster Physik for providing methods to also increase the performance. So how long do those 53 credit WUs you just crunched really take? The two seconds one sees in your task list are just the CPU time. As a HD4870 at stock clocks needs about 48 seconds, and the nvidia guy has probably done what was possible, I guess the CUDA app is now approaching its theoretical ceiling. My guess is something between 2:30 and 3:00 minutes for a GTX285 at stock clocks for a 53 credit WU. ID: 31553 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0	Message 31554 - Posted: 26 Sep 2009, 4:21:48 UTC - in response to Message 31548. The CUDA application for 32 bit Windows has been updated with speed improvements, users should notice a 2x increase in performance. Thank you to Brent from NVIDIA for assisting with making the application run faster on NVIDIA's hardware and also thanks to Cluster Physik for providing methods to also increase the performance. Um, did the version number get an update? I am just got 0.20 cuda23 ... Whatever the version, the ones I just downloaded ran on GTX260 in about 202 seconds (3:02) which would be about the run time on a GTX295 core if my experience on GPU Grid transfers (I have not tried it there yet) ... Changed version to 6.10.7 just in case, but, I will note that the windows interface becomes almost unusable because of lag ... so ... not sure "we" have the balance quite right yet ... note I have no speciall settings set to control things I pretty much run stock ... YMMV ID: 31554 · Rating: 0 · rate: / Reply Quote

Edboard Send message Joined: 22 Feb 09 Posts: 20 Credit: 105,156,399 RAC: 0	Message 31564 - Posted: 26 Sep 2009, 11:03:17 UTC Last modified: 26 Sep 2009, 11:03:56 UTC I have crunched some units with the new app. and they last 3:20 (200 seconds) with a stock GTX280. This graphic card took 6:30 (390 seconds) with the previous one. Impressive: 95% faster than before. ID: 31564 · Rating: 0 · rate: / Reply Quote

The Gas Giant Send message Joined: 24 Dec 07 Posts: 1947 Credit: 240,884,648 RAC: 0	Message 31591 - Posted: 26 Sep 2009, 20:27:42 UTC - in response to Message 31564. I have crunched some units with the new app. and they last 3:20 (200 seconds) with a stock GTX280. This graphic card took 6:30 (390 seconds) with the previous one. Impressive: 95% faster than before. Not quite 95% faster...you may want to redo your maths. ID: 31591 · Rating: 0 · rate: / Reply Quote

KWSN Checklist Send message Joined: 12 Aug 08 Posts: 253 Credit: 275,593,872 RAC: 0	Message 31594 - Posted: 26 Sep 2009, 20:51:03 UTC - in response to Message 31591. Last modified: 26 Sep 2009, 20:51:29 UTC I have crunched some units with the new app. and they last 3:20 (200 seconds) with a stock GTX280. This graphic card took 6:30 (390 seconds) with the previous one. Impressive: 95% faster than before. Not quite 95% faster...you may want to redo your maths. Edboard and I must have went to the same school then. ID: 31594 · Rating: 0 · rate: / Reply Quote

GalaxyIce Send message Joined: 6 Apr 08 Posts: 2018 Credit: 100,142,856 RAC: 0	Message 31596 - Posted: 26 Sep 2009, 20:57:14 UTC - in response to Message 31594. I have crunched some units with the new app. and they last 3:20 (200 seconds) with a stock GTX280. This graphic card took 6:30 (390 seconds) with the previous one. Impressive: 95% faster than before. Not quite 95% faster...you may want to redo your maths. Edboard and I must have went to the same school then. I think you're both right - we should all be given 95% more credit immediatley. ID: 31596 · Rating: 0 · rate: / Reply Quote

The Gas Giant Send message Joined: 24 Dec 07 Posts: 1947 Credit: 240,884,648 RAC: 0	Message 31599 - Posted: 26 Sep 2009, 22:17:02 UTC - in response to Message 31594. I have crunched some units with the new app. and they last 3:20 (200 seconds) with a stock GTX280. This graphic card took 6:30 (390 seconds) with the previous one. Impressive: 95% faster than before. Not quite 95% faster...you may want to redo your maths. Edboard and I must have went to the same school then. The reference point is the 390 seconds (previous), not the 200 seconds (current). But I agree with Ice. ID: 31599 · Rating: 0 · rate: / Reply Quote

Edboard Send message Joined: 22 Feb 09 Posts: 20 Credit: 105,156,399 RAC: 0	Message 31610 - Posted: 27 Sep 2009, 7:13:38 UTC - in response to Message 31599. I said "faster", so I'm comparing "speeds" not "durations". In other words, I'm comparing inverses of time (1/t) not times (t). ID: 31610 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0	Message 31612 - Posted: 27 Sep 2009, 8:03:06 UTC - in response to Message 31610. I said "faster", so I'm comparing "speeds" not "durations". In other words, I'm comparing inverses of time (1/t) not times (t). Quibble all they want, I still think it is about twice as fast ... which is an increase of about 100% ... Of course the down side is that I have seen more lag on the screen for some updates. Changing the pane/tab in BM ... but once the tab is up it seems to refresh ok... so I don't quite understand all I know about that ... Still, as most my machines are dedicated to BOINC, it is a little bit of who cares most of the time. I am still waiting for it to settle in so I can see if BOINC will allow MW on CUDA to play nice with GPU Grid ... or not ... so far it has been a little bit of not ... sigh ... ID: 31612 · Rating: 0 · rate: / Reply Quote

Edboard Send message Joined: 22 Feb 09 Posts: 20 Credit: 105,156,399 RAC: 0	Message 31613 - Posted: 27 Sep 2009, 8:57:17 UTC - in response to Message 31612. Last modified: 27 Sep 2009, 9:01:35 UTC It's exactly the same to say: 100% faster Twice faster 100% increase in speed I chose the first because it was not exactly 100%. May be it would be more clear if I had choosen the third one: 95% increase in speed or I had said: "almost twice faster" ID: 31613 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Send message Joined: 1 Sep 08 Posts: 204 Credit: 219,354,537 RAC: 0	Message 31621 - Posted: 27 Sep 2009, 11:00:08 UTC - in response to Message 31596. I think you're both right - we should all be given 95% more credit immediatley. You're getting that automatically due to the speed increase, don't you? Anyway, the new CUDA app looks quite nice: 200s for a GTX280 is only 4 times slower than a 110€ ATI. That's better than expected ;) MrS Scanning for our furry friends since Jan 2002 ID: 31621 · Rating: 0 · rate: / Reply Quote

seigell Send message Joined: 5 Aug 09 Posts: 9 Credit: 32,279,415 RAC: 0	Message 31664 - Posted: 28 Sep 2009, 16:44:57 UTC - in response to Message 31612. down side is that I have seen more lag on the screen ... as most my machines are dedicated to BOINC ... What about setting the CUDA App "priority" down just a bit, to improve the GUI responsiveness for the other 95% of us who want to contribute to MW Science but need to perform work in the foreground to pay for our BOINC "contributions" ?? Can this new CUDA App be "detuned" sufficiently to improve GUI Responsiveness ?? ID: 31664 · Rating: 0 · rate: / Reply Quote

\|MatMan\| Send message Joined: 12 Dec 07 Posts: 3 Credit: 15,796,608 RAC: 0	Message 31665 - Posted: 28 Sep 2009, 17:16:58 UTC - in response to Message 31621. Last modified: 28 Sep 2009, 17:18:10 UTC Anyway, the new CUDA app looks quite nice: 200s for a GTX280 is only 4 times slower than a 110€ ATI. That's better than expected ;) If I'm not wrong the theoretical DP performance of a 4870 (is this the card you meant?) vs a GTX280 is 240 GFLOPS vs 78 GFLOPS = ~3 : 1. So a factor of 4 is nice but we should get to a factor of 3... :P I know it's just a comparison of theoretical numbers... ID: 31665 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 12 Apr 08 Posts: 621 Credit: 161,934,067 RAC: 0	Message 31668 - Posted: 28 Sep 2009, 19:43:36 UTC - in response to Message 31665. Anyway, the new CUDA app looks quite nice: 200s for a GTX280 is only 4 times slower than a 110€ ATI. That's better than expected ;) If I'm not wrong the theoretical DP performance of a 4870 (is this the card you meant?) vs a GTX280 is 240 GFLOPS vs 78 GFLOPS = ~3 : 1. So a factor of 4 is nice but we should get to a factor of 3... :P I know it's just a comparison of theoretical numbers... Because those are theoretical numbers is the reason that 4:1 ratio is not so bad. It is another case of AMD vs. Intel and which is better or faster for a particular project. ID: 31668 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Send message Joined: 1 Sep 08 Posts: 204 Credit: 219,354,537 RAC: 0	Message 31706 - Posted: 29 Sep 2009, 19:20:13 UTC - in response to Message 31665. You're right, I was talking about a 4870 and it's maximum dp performance is indeed 240 GFlops at 750 MHz. Mine runs at 800 MHz (256 GFlops) and achieves ~190 GFlops at MW. That's a really really good optimization done by CP, so even achieving something close to these numbers is challenging. @Seigell: there's still the option to choose "don't run CUDA when user is active". It's not ideal, but achieving good performance on the GPU while keeping the UI responsive is also rather challenging. Ideally the app would switch behaviour depending on what the user is doing (idle, normal work, graphics intensive work / game). In the non-idle cases the GPU wouldn't have to stop completely, just crunch a little less intensive. MrS Scanning for our furry friends since Jan 2002 ID: 31706 · Rating: 0 · rate: / Reply Quote

Thamir Ghaslan Send message Joined: 31 Mar 08 Posts: 61 Credit: 18,325,284 RAC: 0	Message 31707 - Posted: 29 Sep 2009, 19:42:50 UTC - in response to Message 31706. You're right, I was talking about a 4870 and it's maximum dp performance is indeed 240 GFlops at 750 MHz. Mine runs at 800 MHz (256 GFlops) and achieves ~190 GFlops at MW. That's a really really good optimization done by CP, so even achieving something close to these numbers is challenging. Which makes me disappointed when other projects whine overpay in milky way! I know its already done, credit lowering and all, optimization will overcome or surpass lowering and all, but Boinc and its projects are decentralized, so MW admin should not bow under pressure from other projects! CUDA is not equal to Brooks. ATI is not equal to Nnvidia. Intel is not equal to AMD. ID: 31707 · Rating: 0 · rate: / Reply Quote

verstapp Send message Joined: 26 Jan 09 Posts: 589 Credit: 497,834,261 RAC: 0	Message 31715 - Posted: 29 Sep 2009, 21:00:54 UTC 'All processors are equal, but some are more equal than others.' :D Cheers, PeterV . ID: 31715 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Send message Joined: 1 Sep 08 Posts: 204 Credit: 219,354,537 RAC: 0	Message 31718 - Posted: 29 Sep 2009, 21:38:22 UTC - in response to Message 31707. Thamir, let's not turn this thread into another credit flame war. But think about that for a moment: this 110€ ATI could do about 100k RAC @ MW, even after the credit "lowering" / "adjustment to SETI standard". In my team we've got about 90 active members. On my account alone we've got a good 30 cpu cores, additionally I know of at least 2 fast nVidias (GT200), a medium one (G92), a small one (G84), a medium ATI and a PS3. Recently we were doing 300k RAC. 150k RAC of which were achieved just by our 2 HD4870s, whereas the other half was done by all those CPUs, the GPUs I named and possibly others. Really, MW@ATI yields an insane amount of credit, even after the recent adjustments. I wouldn't want it to be even more extreme. The wy it is now it already almost completely discourages me from running anything else. I might even stop cpu crunching entirely if I wouldn't have an electricity flatrate. () That's why other projects are "whining" about the overpay at MW. For most of them it's just impossible to utilize the hardware to this extent. It's not just about being lazy programmers - the algorithm and the problem itself don't allow it. They could never achieve the same Flop/s even if they made an ATI app. MrS () Of course you can always say "but it's for the science!" But that just doesn't work - it's not the science which I see, it's the credits. And I'm trying to help my team as much as I can, without spending an arm and a leg. Scanning for our furry friends since Jan 2002 ID: 31718 · Rating: 0 · rate: / Reply Quote

Cluster Physik Send message Joined: 26 Jul 08 Posts: 627 Credit: 94,940,203 RAC: 0	Message 31720 - Posted: 29 Sep 2009, 22:16:21 UTC - in response to Message 31718. Last modified: 29 Sep 2009, 22:17:18 UTC That's why other projects are "whining" about the overpay at MW. For most of them it's just impossible to utilize the hardware to this extent. It's not just about being lazy programmers - the algorithm and the problem itself don't allow it. They could never achieve the same Flop/s even if they made an ATI app. MrS That is very true. The MW algorithm is really perfect for a GPU. Take vast amounts of parallelism (millions of threads), no branching (except you want to call a loop with a counter checked each iteration a branch), a very compute intense algorithm with only a few memory accesses, minimal communication between the threads (the values are just added in the end), and what you get is virtually the peak performance of a given GPU for the instruction mix of the algorithm. It's not all about multiply-adds, so you won't get exactly peak performance. But the v0.20 has cut all the overhead down to a minimum so you really arrive within 10% of what is theoretically possible with the algorithm's instruction mix. That's better than any current CPU achieves, even relative to its peak performance. And that will continue to scale, the new ATI HD5870 should easily double the performance of a HD4890 at Milkyway. And when the next nvidia generation arrives, I'm quite sure it will do much more than to double the DP performance of a GTX285 ;) ID: 31720 · Rating: 0 · rate: / Reply Quote