Welcome to MilkyWay@home

CUDA Application Updated

Message boards : Number crunching : CUDA Application Updated
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Anthony Waters

Send message
Joined: 16 Jun 09
Posts: 85
Credit: 172,476
RAC: 0
Message 31548 - Posted: 26 Sep 2009, 3:33:54 UTC

The CUDA application for 32 bit Windows has been updated with speed improvements, users should notice a 2x increase in performance.

Thank you to Brent from NVIDIA for assisting with making the application run faster on NVIDIA's hardware and also thanks to Cluster Physik for providing methods to also increase the performance.
ID: 31548 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Cluster Physik

Send message
Joined: 26 Jul 08
Posts: 627
Credit: 94,940,203
RAC: 0
Message 31553 - Posted: 26 Sep 2009, 4:00:10 UTC - in response to Message 31548.  
Last modified: 26 Sep 2009, 4:41:33 UTC

The CUDA application for 32 bit Windows has been updated with speed improvements, users should notice a 2x increase in performance.

Thank you to Brent from NVIDIA for assisting with making the application run faster on NVIDIA's hardware and also thanks to Cluster Physik for providing methods to also increase the performance.

So how long do those 53 credit WUs you just crunched really take? The two seconds one sees in your task list are just the CPU time.

As a HD4870 at stock clocks needs about 48 seconds, and the nvidia guy has probably done what was possible, I guess the CUDA app is now approaching its theoretical ceiling. My guess is something between 2:30 and 3:00 minutes for a GTX285 at stock clocks for a 53 credit WU.
ID: 31553 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 12 Apr 08
Posts: 621
Credit: 161,934,067
RAC: 0
Message 31554 - Posted: 26 Sep 2009, 4:21:48 UTC - in response to Message 31548.  

The CUDA application for 32 bit Windows has been updated with speed improvements, users should notice a 2x increase in performance.

Thank you to Brent from NVIDIA for assisting with making the application run faster on NVIDIA's hardware and also thanks to Cluster Physik for providing methods to also increase the performance.

Um, did the version number get an update? I am just got 0.20 cuda23 ...

Whatever the version, the ones I just downloaded ran on GTX260 in about 202 seconds (3:02) which would be about the run time on a GTX295 core if my experience on GPU Grid transfers (I have not tried it there yet) ...

Changed version to 6.10.7 just in case, but, I will note that the windows interface becomes almost unusable because of lag ... so ... not sure "we" have the balance quite right yet ... note I have no speciall settings set to control things I pretty much run stock ... YMMV
ID: 31554 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Edboard
Avatar

Send message
Joined: 22 Feb 09
Posts: 20
Credit: 105,156,399
RAC: 0
Message 31564 - Posted: 26 Sep 2009, 11:03:17 UTC
Last modified: 26 Sep 2009, 11:03:56 UTC

I have crunched some units with the new app. and they last 3:20 (200 seconds) with a stock GTX280. This graphic card took 6:30 (390 seconds) with the previous one.

Impressive: 95% faster than before.
ID: 31564 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile The Gas Giant
Avatar

Send message
Joined: 24 Dec 07
Posts: 1947
Credit: 240,884,648
RAC: 0
Message 31591 - Posted: 26 Sep 2009, 20:27:42 UTC - in response to Message 31564.  

I have crunched some units with the new app. and they last 3:20 (200 seconds) with a stock GTX280. This graphic card took 6:30 (390 seconds) with the previous one.

Impressive: 95% faster than before.

Not quite 95% faster...you may want to redo your maths.
ID: 31591 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile KWSN Checklist
Avatar

Send message
Joined: 12 Aug 08
Posts: 253
Credit: 275,593,872
RAC: 0
Message 31594 - Posted: 26 Sep 2009, 20:51:03 UTC - in response to Message 31591.  
Last modified: 26 Sep 2009, 20:51:29 UTC

I have crunched some units with the new app. and they last 3:20 (200 seconds) with a stock GTX280. This graphic card took 6:30 (390 seconds) with the previous one.

Impressive: 95% faster than before.

Not quite 95% faster...you may want to redo your maths.

Edboard and I must have went to the same school then.

    ID: 31594 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
    Profile GalaxyIce
    Avatar

    Send message
    Joined: 6 Apr 08
    Posts: 2018
    Credit: 100,142,856
    RAC: 0
    Message 31596 - Posted: 26 Sep 2009, 20:57:14 UTC - in response to Message 31594.  

    I have crunched some units with the new app. and they last 3:20 (200 seconds) with a stock GTX280. This graphic card took 6:30 (390 seconds) with the previous one.

    Impressive: 95% faster than before.

    Not quite 95% faster...you may want to redo your maths.

    Edboard and I must have went to the same school then.

    I think you're both right - we should all be given 95% more credit immediatley.



    ID: 31596 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
    Profile The Gas Giant
    Avatar

    Send message
    Joined: 24 Dec 07
    Posts: 1947
    Credit: 240,884,648
    RAC: 0
    Message 31599 - Posted: 26 Sep 2009, 22:17:02 UTC - in response to Message 31594.  

    I have crunched some units with the new app. and they last 3:20 (200 seconds) with a stock GTX280. This graphic card took 6:30 (390 seconds) with the previous one.

    Impressive: 95% faster than before.

    Not quite 95% faster...you may want to redo your maths.

    Edboard and I must have went to the same school then.

    The reference point is the 390 seconds (previous), not the 200 seconds (current).

    But I agree with Ice.
    ID: 31599 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
    Profile Edboard
    Avatar

    Send message
    Joined: 22 Feb 09
    Posts: 20
    Credit: 105,156,399
    RAC: 0
    Message 31610 - Posted: 27 Sep 2009, 7:13:38 UTC - in response to Message 31599.  

    I said "faster", so I'm comparing "speeds" not "durations". In other words, I'm comparing inverses of time (1/t) not times (t).
    ID: 31610 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
    Profile Paul D. Buck

    Send message
    Joined: 12 Apr 08
    Posts: 621
    Credit: 161,934,067
    RAC: 0
    Message 31612 - Posted: 27 Sep 2009, 8:03:06 UTC - in response to Message 31610.  

    I said "faster", so I'm comparing "speeds" not "durations". In other words, I'm comparing inverses of time (1/t) not times (t).

    Quibble all they want, I still think it is about twice as fast ... which is an increase of about 100% ...

    Of course the down side is that I have seen more lag on the screen for some updates. Changing the pane/tab in BM ... but once the tab is up it seems to refresh ok... so I don't quite understand all I know about that ...

    Still, as most my machines are dedicated to BOINC, it is a little bit of who cares most of the time. I am still waiting for it to settle in so I can see if BOINC will allow MW on CUDA to play nice with GPU Grid ... or not ... so far it has been a little bit of not ... sigh ...
    ID: 31612 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
    Profile Edboard
    Avatar

    Send message
    Joined: 22 Feb 09
    Posts: 20
    Credit: 105,156,399
    RAC: 0
    Message 31613 - Posted: 27 Sep 2009, 8:57:17 UTC - in response to Message 31612.  
    Last modified: 27 Sep 2009, 9:01:35 UTC

    It's exactly the same to say:

    100% faster
    Twice faster
    100% increase in speed

    I chose the first because it was not exactly 100%. May be it would be more clear if I had choosen the third one: 95% increase in speed or I had said: "almost twice faster"
    ID: 31613 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
    ExtraTerrestrial Apes
    Avatar

    Send message
    Joined: 1 Sep 08
    Posts: 204
    Credit: 219,354,537
    RAC: 0
    Message 31621 - Posted: 27 Sep 2009, 11:00:08 UTC - in response to Message 31596.  

    I think you're both right - we should all be given 95% more credit immediatley.


    You're getting that automatically due to the speed increase, don't you?

    Anyway, the new CUDA app looks quite nice: 200s for a GTX280 is only 4 times slower than a 110€ ATI. That's better than expected ;)

    MrS
    Scanning for our furry friends since Jan 2002
    ID: 31621 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
    seigell

    Send message
    Joined: 5 Aug 09
    Posts: 9
    Credit: 32,279,415
    RAC: 0
    Message 31664 - Posted: 28 Sep 2009, 16:44:57 UTC - in response to Message 31612.  

    down side is that I have seen more lag on the screen ... as most my machines are dedicated to BOINC ...


    What about setting the CUDA App "priority" down just a bit, to improve the GUI responsiveness for the other 95% of us who want to contribute to MW Science but need to perform work in the foreground to pay for our BOINC "contributions" ??

    Can this new CUDA App be "detuned" sufficiently to improve GUI Responsiveness ??
    ID: 31664 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
    |MatMan|

    Send message
    Joined: 12 Dec 07
    Posts: 3
    Credit: 15,796,608
    RAC: 0
    Message 31665 - Posted: 28 Sep 2009, 17:16:58 UTC - in response to Message 31621.  
    Last modified: 28 Sep 2009, 17:18:10 UTC

    Anyway, the new CUDA app looks quite nice: 200s for a GTX280 is only 4 times slower than a 110€ ATI. That's better than expected ;)

    If I'm not wrong the theoretical DP performance of a 4870 (is this the card you meant?) vs a GTX280 is 240 GFLOPS vs 78 GFLOPS = ~3 : 1.
    So a factor of 4 is nice but we should get to a factor of 3... :P

    I know it's just a comparison of theoretical numbers...
    ID: 31665 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
    Profile Paul D. Buck

    Send message
    Joined: 12 Apr 08
    Posts: 621
    Credit: 161,934,067
    RAC: 0
    Message 31668 - Posted: 28 Sep 2009, 19:43:36 UTC - in response to Message 31665.  

    Anyway, the new CUDA app looks quite nice: 200s for a GTX280 is only 4 times slower than a 110€ ATI. That's better than expected ;)

    If I'm not wrong the theoretical DP performance of a 4870 (is this the card you meant?) vs a GTX280 is 240 GFLOPS vs 78 GFLOPS = ~3 : 1.
    So a factor of 4 is nice but we should get to a factor of 3... :P

    I know it's just a comparison of theoretical numbers...

    Because those are theoretical numbers is the reason that 4:1 ratio is not so bad.

    It is another case of AMD vs. Intel and which is better or faster for a particular project.
    ID: 31668 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
    ExtraTerrestrial Apes
    Avatar

    Send message
    Joined: 1 Sep 08
    Posts: 204
    Credit: 219,354,537
    RAC: 0
    Message 31706 - Posted: 29 Sep 2009, 19:20:13 UTC - in response to Message 31665.  

    You're right, I was talking about a 4870 and it's maximum dp performance is indeed 240 GFlops at 750 MHz. Mine runs at 800 MHz (256 GFlops) and achieves ~190 GFlops at MW. That's a really really good optimization done by CP, so even achieving something close to these numbers is challenging.

    @Seigell: there's still the option to choose "don't run CUDA when user is active". It's not ideal, but achieving good performance on the GPU while keeping the UI responsive is also rather challenging.
    Ideally the app would switch behaviour depending on what the user is doing (idle, normal work, graphics intensive work / game). In the non-idle cases the GPU wouldn't have to stop completely, just crunch a little less intensive.

    MrS
    Scanning for our furry friends since Jan 2002
    ID: 31706 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
    Thamir Ghaslan

    Send message
    Joined: 31 Mar 08
    Posts: 61
    Credit: 18,325,284
    RAC: 0
    Message 31707 - Posted: 29 Sep 2009, 19:42:50 UTC - in response to Message 31706.  

    You're right, I was talking about a 4870 and it's maximum dp performance is indeed 240 GFlops at 750 MHz. Mine runs at 800 MHz (256 GFlops) and achieves ~190 GFlops at MW. That's a really really good optimization done by CP, so even achieving something close to these numbers is challenging.


    Which makes me disappointed when other projects whine overpay in milky way!

    I know its already done, credit lowering and all, optimization will overcome or surpass lowering and all, but Boinc and its projects are decentralized, so MW admin should not bow under pressure from other projects!

    CUDA is not equal to Brooks. ATI is not equal to Nnvidia. Intel is not equal to AMD.
    ID: 31707 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
    Profile verstapp
    Avatar

    Send message
    Joined: 26 Jan 09
    Posts: 589
    Credit: 497,834,261
    RAC: 0
    Message 31715 - Posted: 29 Sep 2009, 21:00:54 UTC

    'All processors are equal, but some are more equal than others.' :D
    Cheers,

    PeterV

    .
    ID: 31715 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
    ExtraTerrestrial Apes
    Avatar

    Send message
    Joined: 1 Sep 08
    Posts: 204
    Credit: 219,354,537
    RAC: 0
    Message 31718 - Posted: 29 Sep 2009, 21:38:22 UTC - in response to Message 31707.  

    Thamir, let's not turn this thread into another credit flame war.

    But think about that for a moment: this 110€ ATI could do about 100k RAC @ MW, even after the credit "lowering" / "adjustment to SETI standard". In my team we've got about 90 active members. On my account alone we've got a good 30 cpu cores, additionally I know of at least 2 fast nVidias (GT200), a medium one (G92), a small one (G84), a medium ATI and a PS3.
    Recently we were doing 300k RAC. 150k RAC of which were achieved just by our 2 HD4870s, whereas the other half was done by all those CPUs, the GPUs I named and possibly others.

    Really, MW@ATI yields an insane amount of credit, even after the recent adjustments. I wouldn't want it to be even more extreme. The wy it is now it already almost completely discourages me from running anything else. I might even stop cpu crunching entirely if I wouldn't have an electricity flatrate. (*)

    That's why other projects are "whining" about the overpay at MW. For most of them it's just impossible to utilize the hardware to this extent. It's not just about being lazy programmers - the algorithm and the problem itself don't allow it. They could never achieve the same Flop/s even if they made an ATI app.

    MrS

    (*) Of course you can always say "but it's for the science!" But that just doesn't work - it's not the science which I see, it's the credits. And I'm trying to help my team as much as I can, without spending an arm and a leg.
    Scanning for our furry friends since Jan 2002
    ID: 31718 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
    Cluster Physik

    Send message
    Joined: 26 Jul 08
    Posts: 627
    Credit: 94,940,203
    RAC: 0
    Message 31720 - Posted: 29 Sep 2009, 22:16:21 UTC - in response to Message 31718.  
    Last modified: 29 Sep 2009, 22:17:18 UTC

    That's why other projects are "whining" about the overpay at MW. For most of them it's just impossible to utilize the hardware to this extent. It's not just about being lazy programmers - the algorithm and the problem itself don't allow it. They could never achieve the same Flop/s even if they made an ATI app.

    MrS

    That is very true.
    The MW algorithm is really perfect for a GPU. Take vast amounts of parallelism (millions of threads), no branching (except you want to call a loop with a counter checked each iteration a branch), a very compute intense algorithm with only a few memory accesses, minimal communication between the threads (the values are just added in the end), and what you get is virtually the peak performance of a given GPU for the instruction mix of the algorithm. It's not all about multiply-adds, so you won't get exactly peak performance. But the v0.20 has cut all the overhead down to a minimum so you really arrive within 10% of what is theoretically possible with the algorithm's instruction mix. That's better than any current CPU achieves, even relative to its peak performance.

    And that will continue to scale, the new ATI HD5870 should easily double the performance of a HD4890 at Milkyway. And when the next nvidia generation arrives, I'm quite sure it will do much more than to double the DP performance of a GTX285 ;)
    ID: 31720 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
    1 · 2 · Next

    Message boards : Number crunching : CUDA Application Updated

    ©2024 Astroinformatics Group