GTX 480

Author	Message
[B^S] thierry@home Send message Joined: 28 Aug 07 Posts: 20 Credit: 5,558,437 RAC: 0	Message 43075 - Posted: 21 Oct 2010, 21:21:34 UTC I have just installed a GTX 480 and all the WUs die immediately in a Calculation error. Must I install something? Thanks Tired to crunch alone? Join BOINC Synergy, the most exciting team in the galaxy. .Join now! ID: 43075 · Rating: 0 · rate: / Reply Quote

[B^S] thierry@home Send message Joined: 28 Aug 07 Posts: 20 Credit: 5,558,437 RAC: 0	Message 43077 - Posted: 21 Oct 2010, 21:56:37 UTC Found the opti app. It works fine :-) Tired to crunch alone? Join BOINC Synergy, the most exciting team in the galaxy. .Join now! ID: 43077 · Rating: 0 · rate: / Reply Quote

kepan Send message Joined: 28 Oct 10 Posts: 8 Credit: 80,345,093 RAC: 0	Message 43471 - Posted: 4 Nov 2010, 18:27:28 UTC - in response to Message 43077. What is 'the opti app.'? I have just installed an ASUS GTX480 and all work ends up in calculation error. Boinc shows: NVIDIA GPU 0: GeForce GTX 480 (driver version 26099, CUDA version 3020, compute capability 2.0, 1503MB, 778 GFLOPS peak) ID: 43471 · Rating: 0 · rate: / Reply Quote

Sutaru Tsureku Send message Joined: 30 Apr 09 Posts: 101 Credit: 29,871,931 RAC: 386	Message 43476 - Posted: 4 Nov 2010, 20:14:23 UTC - in response to Message 43471. Number crunching : 2 x GTX470 worked with milkyway 0.24 (cuda23) ID: 43476 · Rating: 0 · rate: / Reply Quote

arkayn Send message Joined: 14 Feb 09 Posts: 999 Credit: 74,932,619 RAC: 0	Message 43483 - Posted: 4 Nov 2010, 22:51:35 UTC Last modified: 4 Nov 2010, 22:51:58 UTC Also on my site. http://www.arkayn.us/milkyway ID: 43483 · Rating: 0 · rate: / Reply Quote

Crunch3r Volunteer developer Send message Joined: 17 Feb 08 Posts: 363 Credit: 258,227,990 RAC: 0	Message 43531 - Posted: 5 Nov 2010, 22:46:32 UTC - in response to Message 43471. Last modified: 5 Nov 2010, 22:56:30 UTC What is 'the opti app.'? I have just installed an ASUS GTX480 and all work ends up in calculation error. Boinc shows: NVIDIA GPU 0: GeForce GTX 480 (driver version 26099, CUDA version 3020, compute capability 2.0, 1503MB, 778 GFLOPS peak) FWIW, it's not an "optimized" app. I just a recompile of the stock .024 cuda app code with a minor change to allow processing on GT4XX cards. I only did this, because the project "devs" were not able/unwilling to recompile the app to add the support for the GT4XX cards... It's just a few lines of code that need to be modded... int choose_cuda_13() { //check for and find a CUDA 1.3 (double precision) //capable card int device_count; cutilSafeCall(cudaGetDeviceCount(&device_count)); fprintf(stderr, "Found %d CUDA cards\n", device_count); if (device_count < 1) { fprintf(stderr, "No CUDA cards found, you cannot run the GPU version\n"); exit(1); } int eligable_devices = (int)malloc(sizeof(int) * device_count); int eligable_device_idx = 0; int max_gflops = 0; int device; char chosen_device = 0; for(int idx = 0;idx<device_count;++idx) { cudaDeviceProp deviceProp; cutilSafeCall(cudaGetDeviceProperties(&deviceProp, idx)); fprintf(stderr, "Found a %s\n", deviceProp.name); if ((deviceProp.major == 1 && deviceProp.minor == 3) \|\| deviceProp.major == 2) { eligable_devices[eligable_device_idx++] = idx; fprintf(stderr, "Device can be used it has double precision support\n"); //check how many gflops it has int gflops = deviceProp.multiProcessorCount deviceProp.clockRate; if (gflops >= max_gflops) { max_gflops = gflops; device = idx; if (chosen_device) free(chosen_device); chosen_device = (char) malloc(sizeof(char) strlen(deviceProp.name)+1); strncpy(chosen_device, deviceProp.name, strlen(deviceProp.name)); chosen_device[strlen(deviceProp.name)] = '\0'; } } else { fprintf(stderr, "Device cannot be used, it does not have double precision support\n"); } } free(eligable_devices); if (eligable_device_idx < 1) { fprintf(stderr, "No compute capability 1.3 or 2.x cards have been found, exiting...\n"); free(chosen_device); return -1; } else { fprintf(stderr, "Chose device %s\n", chosen_device); cutilSafeCall(cudaSetDevice(device)); free(chosen_device); return device; } } Anyway, from looking at the code of the 0.24 app (and especially it's somewhat faulty successor v0.25).. there's still room for improving the code dramatically... the approximation of the exp function in in 0.25 is a good start but it's quite slow that way... There are options to approximate the exp using some modified remez algorithm... for SSE2 capable CPUs, it looks like this... static inline __m128d _mm_fexp_poly_pd(__m128d x1) // precise,but slower than _mm_exp_pd { /remez11_0_log2_sse/ __m128i k1; __m128d p1,a1; __m128d xmm0, xmm1; const __m128i offset = _mm_setr_epi32(1023, 1023, 0, 0); x1 = _mm_min_pd(x1, _mm_set1_pd( 129.00000)); x1 = _mm_max_pd(x1, _mm_set1_pd(-126.99999)); xmm0 = _mm_load_pd(log2e); xmm1 = _mm_setzero_pd(); a1 = _mm_mul_pd(x1, xmm0); /* k = (int)floor(a); p = (float)k; / p1 = _mm_cmplt_pd(a1, xmm1); p1 = _mm_and_pd(p1, DONE); a1 = _mm_sub_pd(a1, p1); k1 = _mm_cvttpd_epi32(a1); // ipart p1 = _mm_cvtepi32_pd(k1); / x -= p * log2; / xmm0 = _mm_load_pd(c1); xmm1 = _mm_load_pd(c2); a1 = _mm_mul_pd(p1, xmm0); x1 = _mm_sub_pd(x1, a1); a1 = _mm_mul_pd(p1, xmm1); x1 = _mm_sub_pd(x1, a1); / Compute e^x using a polynomial approximation. / xmm0 = _mm_load_pd(w11); xmm1 = _mm_load_pd(w10); a1 = _mm_mul_pd(x1, xmm0); a1 = _mm_add_pd(a1, xmm1); xmm0 = _mm_load_pd(w9); xmm1 = _mm_load_pd(w8); a1 = _mm_mul_pd(a1, x1); a1 = _mm_add_pd(a1, xmm0); a1 = _mm_mul_pd(a1, x1); a1 = _mm_add_pd(a1, xmm1); xmm0 = _mm_load_pd(w7); xmm1 = _mm_load_pd(w6); a1 = _mm_mul_pd(a1, x1); a1 = _mm_add_pd(a1, xmm0); a1 = _mm_mul_pd(a1, x1); a1 = _mm_add_pd(a1, xmm1); xmm0 = _mm_load_pd(w5); xmm1 = _mm_load_pd(w4); a1 = _mm_mul_pd(a1, x1); a1 = _mm_add_pd(a1, xmm0); a1 = _mm_mul_pd(a1, x1); a1 = _mm_add_pd(a1, xmm1); xmm0 = _mm_load_pd(w3); xmm1 = _mm_load_pd(w2); a1 = _mm_mul_pd(a1, x1); a1 = _mm_add_pd(a1, xmm0); a1 = _mm_mul_pd(a1, x1); a1 = _mm_add_pd(a1, xmm1); xmm0 = _mm_load_pd(w1); xmm1 = _mm_load_pd(w0); a1 = _mm_mul_pd(a1, x1); a1 = _mm_add_pd(a1, xmm0); a1 = _mm_mul_pd(a1, x1); a1 = _mm_add_pd(a1, xmm1); / p = 2^k; / k1 = _mm_add_epi32(k1, offset); k1 = _mm_slli_epi32(k1, 20); k1 = _mm_shuffle_epi32(k1, _MM_SHUFFLE(1,3,0,2)); p1 = _mm_castsi128_pd(k1); / a = 2^k. / return _mm_mul_pd(a1, p1); } This can be modified for cuda as well... Anyway... I might do that when i have the time to do so, or modify Gipsels approach using a LUT, which is quite faster..Using my implementation of it, it beats Gipsels CPU SSE3 version by 20% in performance but it's not precise enough... Something i might work on when i have the time to do so... Basically it's just fun beating the crap out of the stock apps... having a good laugh at the stock code and of course, waiting for the OpenCL ATI app... that'll be the icing on the cake and history will repeat itself.... Join Support science! Joinc Team BOINC United now! ID: 43531 · Rating: 0 · rate: / Reply Quote

[AF>EDLS] Polynesia Send message Joined: 5 Apr 09 Posts: 71 Credit: 6,120,786 RAC: 0	Message 43532 - Posted: 5 Nov 2010, 23:27:28 UTC - in response to Message 43483. Also on my site. http://www.arkayn.us/milkyway the application is for 0.21 and it is at 0.45 ... Team Alliance francophone, boinc: 7.0.18 GA-P55-UD5, i7 860, Win 7 64 bits, 8g DDR3, GTX 470 ID: 43532 · Rating: 0 · rate: / Reply Quote

arkayn Send message Joined: 14 Feb 09 Posts: 999 Credit: 74,932,619 RAC: 0	Message 43535 - Posted: 6 Nov 2010, 2:28:50 UTC - in response to Message 43532. I do not have the CPU apps linked right now, only the GPU apps. ID: 43535 · Rating: 0 · rate: / Reply Quote

[AF>EDLS] Polynesia Send message Joined: 5 Apr 09 Posts: 71 Credit: 6,120,786 RAC: 0	Message 43546 - Posted: 6 Nov 2010, 11:18:41 UTC - in response to Message 43535. [Quote] Je n'ai pas le processeur applications liées à l'heure actuelle, seules les applications GPU. [/ Quote] Hello, When is an application optimized for Nvidia? thank you Team Alliance francophone, boinc: 7.0.18 GA-P55-UD5, i7 860, Win 7 64 bits, 8g DDR3, GTX 470 ID: 43546 · Rating: 0 · rate: / Reply Quote