Message boards :
Number crunching :
GTX 480
Message board moderation
Author | Message |
---|---|
Send message Joined: 28 Aug 07 Posts: 20 Credit: 5,558,437 RAC: 0 |
I have just installed a GTX 480 and all the WUs die immediately in a Calculation error. Must I install something? Thanks Tired to crunch alone? Join BOINC Synergy, the most exciting team in the galaxy. .Join now! |
Send message Joined: 28 Aug 07 Posts: 20 Credit: 5,558,437 RAC: 0 |
Found the opti app. It works fine :-) Tired to crunch alone? Join BOINC Synergy, the most exciting team in the galaxy. .Join now! |
Send message Joined: 28 Oct 10 Posts: 8 Credit: 80,345,093 RAC: 0 |
What is 'the opti app.'? I have just installed an ASUS GTX480 and all work ends up in calculation error. Boinc shows: NVIDIA GPU 0: GeForce GTX 480 (driver version 26099, CUDA version 3020, compute capability 2.0, 1503MB, 778 GFLOPS peak) |
Send message Joined: 30 Apr 09 Posts: 101 Credit: 29,871,931 RAC: 386 |
|
Send message Joined: 14 Feb 09 Posts: 999 Credit: 74,932,619 RAC: 0 |
|
Send message Joined: 17 Feb 08 Posts: 363 Credit: 258,227,990 RAC: 0 |
What is 'the opti app.'? I have just installed an ASUS GTX480 and all work ends up in calculation error. FWIW, it's not an "optimized" app. I just a recompile of the stock .024 cuda app code with a minor change to allow processing on GT4XX cards. I only did this, because the project "devs" were not able/unwilling to recompile the app to add the support for the GT4XX cards... It's just a few lines of code that need to be modded... int choose_cuda_13() { //check for and find a CUDA 1.3 (double precision) //capable card int device_count; cutilSafeCall(cudaGetDeviceCount(&device_count)); fprintf(stderr, "Found %d CUDA cards\n", device_count); if (device_count < 1) { fprintf(stderr, "No CUDA cards found, you cannot run the GPU version\n"); exit(1); } int *eligable_devices = (int*)malloc(sizeof(int) * device_count); int eligable_device_idx = 0; int max_gflops = 0; int device; char *chosen_device = 0; for(int idx = 0;idx<device_count;++idx) { cudaDeviceProp deviceProp; cutilSafeCall(cudaGetDeviceProperties(&deviceProp, idx)); fprintf(stderr, "Found a %s\n", deviceProp.name); if ((deviceProp.major == 1 && deviceProp.minor == 3) || deviceProp.major == 2) { eligable_devices[eligable_device_idx++] = idx; fprintf(stderr, "Device can be used it has double precision support\n"); //check how many gflops it has int gflops = deviceProp.multiProcessorCount * deviceProp.clockRate; if (gflops >= max_gflops) { max_gflops = gflops; device = idx; if (chosen_device) free(chosen_device); chosen_device = (char*) malloc(sizeof(char) * strlen(deviceProp.name)+1); strncpy(chosen_device, deviceProp.name, strlen(deviceProp.name)); chosen_device[strlen(deviceProp.name)] = '\0'; } } else { fprintf(stderr, "Device cannot be used, it does not have double precision support\n"); } } free(eligable_devices); if (eligable_device_idx < 1) { fprintf(stderr, "No compute capability 1.3 or 2.x cards have been found, exiting...\n"); free(chosen_device); return -1; } else { fprintf(stderr, "Chose device %s\n", chosen_device); cutilSafeCall(cudaSetDevice(device)); free(chosen_device); return device; } } Anyway, from looking at the code of the 0.24 app (and especially it's somewhat faulty successor v0.25).. there's still room for improving the code dramatically... the approximation of the exp function in in 0.25 is a good start but it's quite slow that way... There are options to approximate the exp using some modified remez algorithm... for SSE2 capable CPUs, it looks like this... static inline __m128d _mm_fexp_poly_pd(__m128d x1) // precise,but slower than _mm_exp_pd { /*remez11_0_log2_sse*/ __m128i k1; __m128d p1,a1; __m128d xmm0, xmm1; const __m128i offset = _mm_setr_epi32(1023, 1023, 0, 0); x1 = _mm_min_pd(x1, _mm_set1_pd( 129.00000)); x1 = _mm_max_pd(x1, _mm_set1_pd(-126.99999)); xmm0 = _mm_load_pd(log2e); xmm1 = _mm_setzero_pd(); a1 = _mm_mul_pd(x1, xmm0); /* k = (int)floor(a); p = (float)k; */ p1 = _mm_cmplt_pd(a1, xmm1); p1 = _mm_and_pd(p1, DONE); a1 = _mm_sub_pd(a1, p1); k1 = _mm_cvttpd_epi32(a1); // ipart p1 = _mm_cvtepi32_pd(k1); /* x -= p * log2; */ xmm0 = _mm_load_pd(c1); xmm1 = _mm_load_pd(c2); a1 = _mm_mul_pd(p1, xmm0); x1 = _mm_sub_pd(x1, a1); a1 = _mm_mul_pd(p1, xmm1); x1 = _mm_sub_pd(x1, a1); /* Compute e^x using a polynomial approximation. */ xmm0 = _mm_load_pd(w11); xmm1 = _mm_load_pd(w10); a1 = _mm_mul_pd(x1, xmm0); a1 = _mm_add_pd(a1, xmm1); xmm0 = _mm_load_pd(w9); xmm1 = _mm_load_pd(w8); a1 = _mm_mul_pd(a1, x1); a1 = _mm_add_pd(a1, xmm0); a1 = _mm_mul_pd(a1, x1); a1 = _mm_add_pd(a1, xmm1); xmm0 = _mm_load_pd(w7); xmm1 = _mm_load_pd(w6); a1 = _mm_mul_pd(a1, x1); a1 = _mm_add_pd(a1, xmm0); a1 = _mm_mul_pd(a1, x1); a1 = _mm_add_pd(a1, xmm1); xmm0 = _mm_load_pd(w5); xmm1 = _mm_load_pd(w4); a1 = _mm_mul_pd(a1, x1); a1 = _mm_add_pd(a1, xmm0); a1 = _mm_mul_pd(a1, x1); a1 = _mm_add_pd(a1, xmm1); xmm0 = _mm_load_pd(w3); xmm1 = _mm_load_pd(w2); a1 = _mm_mul_pd(a1, x1); a1 = _mm_add_pd(a1, xmm0); a1 = _mm_mul_pd(a1, x1); a1 = _mm_add_pd(a1, xmm1); xmm0 = _mm_load_pd(w1); xmm1 = _mm_load_pd(w0); a1 = _mm_mul_pd(a1, x1); a1 = _mm_add_pd(a1, xmm0); a1 = _mm_mul_pd(a1, x1); a1 = _mm_add_pd(a1, xmm1); /* p = 2^k; */ k1 = _mm_add_epi32(k1, offset); k1 = _mm_slli_epi32(k1, 20); k1 = _mm_shuffle_epi32(k1, _MM_SHUFFLE(1,3,0,2)); p1 = _mm_castsi128_pd(k1); /* a *= 2^k. */ return _mm_mul_pd(a1, p1); } This can be modified for cuda as well... Anyway... I might do that when i have the time to do so, or modify Gipsels approach using a LUT, which is quite faster..Using my implementation of it, it beats Gipsels CPU SSE3 version by 20% in performance but it's not precise enough... Something i might work on when i have the time to do so... Basically it's just fun beating the crap out of the stock apps... having a good laugh at the stock code and of course, waiting for the OpenCL ATI app... that'll be the icing on the cake and history will repeat itself.... Join Support science! Joinc Team BOINC United now! |
Send message Joined: 5 Apr 09 Posts: 71 Credit: 6,120,786 RAC: 0 |
Also on my site. the application is for 0.21 and it is at 0.45 ... Team Alliance francophone, boinc: 7.0.18 GA-P55-UD5, i7 860, Win 7 64 bits, 8g DDR3, GTX 470 |
Send message Joined: 14 Feb 09 Posts: 999 Credit: 74,932,619 RAC: 0 |
I do not have the CPU apps linked right now, only the GPU apps. |
Send message Joined: 5 Apr 09 Posts: 71 Credit: 6,120,786 RAC: 0 |
[Quote] Je n'ai pas le processeur applications liées à l'heure actuelle, seules les applications GPU. [/ Quote] Hello, When is an application optimized for Nvidia? thank you Team Alliance francophone, boinc: 7.0.18 GA-P55-UD5, i7 860, Win 7 64 bits, 8g DDR3, GTX 470 |
©2024 Astroinformatics Group