OpenCL for Nvidia available for testing

Author	Message
Zeddicus Send message Joined: 30 May 10 Posts: 2 Credit: 2,351 RAC: 0	Message 44678 - Posted: 4 Dec 2010, 18:56:24 UTC - in response to Message 44676. Last modified: 4 Dec 2010, 18:56:53 UTC By the way: GeForce 8800 GTS (driver 26099, CUDA version 3020, compute capability 1.0). That GPU doesn't have doubles and won't work. It needs at least compute capability 1.3. Thanks for the information! But why did milkyway sent me any WUs then? ID: 44678 · Rating: 0 · rate: / Reply Quote

Matt Arsenault Volunteer moderator Project developer Project tester Project scientist Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0	Message 44681 - Posted: 4 Dec 2010, 20:31:43 UTC - in response to Message 44678. By the way: GeForce 8800 GTS (driver 26099, CUDA version 3020, compute capability 1.0). That GPU doesn't have doubles and won't work. It needs at least compute capability 1.3. Thanks for the information! But why did milkyway sent me any WUs then? If you tried manually installing this, it will try sending the workunits to it. You shouldn't get sent the Nvidia applications since you don't have doubles. ID: 44681 · Rating: 0 · rate: / Reply Quote

Werkstatt Send message Joined: 19 Feb 08 Posts: 350 Credit: 141,284,369 RAC: 0	Message 44682 - Posted: 4 Dec 2010, 22:19:57 UTC - in response to Message 44617. Is anyone running it in a system with both ATI and Nvidia drivers installed? I just realized a hypothetical problem that might happen. I have now a system running which has GTX460 and HD4850, running both MW and Collatz on both GPU's. No problems seen in the last 12 hours. http://milkyway.cs.rpi.edu/milkyway/show_host_detail.php?hostid=168415 ID: 44682 · Rating: 0 · rate: / Reply Quote

cenit Send message Joined: 16 Mar 09 Posts: 58 Credit: 1,129,612 RAC: 0	Message 44683 - Posted: 4 Dec 2010, 22:29:31 UTC - in response to Message 44681. Last modified: 4 Dec 2010, 22:29:53 UTC By the way: GeForce 8800 GTS (driver 26099, CUDA version 3020, compute capability 1.0). That GPU doesn't have doubles and won't work. It needs at least compute capability 1.3. Thanks for the information! But why did milkyway sent me any WUs then? If you tried manually installing this, it will try sending the workunits to it. You shouldn't get sent the Nvidia applications since you don't have doubles. to be honest, this is not the way boinc should be set to work. In your server, you should analyze the host requiring work and even if he has installed the app manually, the server should not send work because the host didn't satisfy all the requirements. If you leave it in this way it is really easy to trick your server into doing really bad things! ID: 44683 · Rating: 0 · rate: / Reply Quote

Matt Arsenault Volunteer moderator Project developer Project tester Project scientist Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0	Message 44684 - Posted: 4 Dec 2010, 23:12:14 UTC - in response to Message 44683. to be honest, this is not the way boinc should be set to work. In your server, you should analyze the host requiring work and even if he has installed the app manually, the server should not send work because the host didn't satisfy all the requirements. If you leave it in this way it is really easy to trick your server into doing really bad things! This is related to the problem I think is most annoying in BOINC. I think everything involving the version and capability management in BOINC should be better; pretty much everything about app_info.xml and the scheduling isn't user friendly and is inflexible. The plan class system is inflexible; adding versions with different system requirements involves modifying the server code and it isn't composable for different features. app_info.xml doesn't handle updates and requires far too much manual intervention for what most people want. You might only want to run GPU workunits, or only N-body on the CPU and separation on the GPU, but right now you can't do that. You can say no CPU or no GPU, but not on a per application basis. You have to manually download files, put in the same information in several places in app_info.xml, and then you're stuck with whatever version you happened to install unless you go out of your way to update it. There should be a finer grain way of telling the server what capabilities different applications and possibly workunits require, and on the client there should be a way of specifying capabilities you want to use with an actual interface of some sort beyond just stating that manually installed application X should be used in an XML file, and it should handle updates automatically. I kind of want to work on a replacement that's more intelligent, but I don't really have the time. ID: 44684 · Rating: 0 · rate: / Reply Quote

bill Send message Joined: 15 Jul 09 Posts: 12 Credit: 45,145,989 RAC: 0	Message 44686 - Posted: 5 Dec 2010, 1:28:43 UTC - in response to Message 44588. OK Matt, if you're still interested, your new app takes a little over 17 minutes to complete 1 opencl wu. This is with Seti@Home working on both cpus (e6600@2.4GHz) with Folding@home working also. Windows XP32, Nvidia driver 206.63. Palit GeForce GTS 450 (Fermi) Sonic Platinum 1GB ID: 44686 · Rating: 0 · rate: / Reply Quote

Matt Arsenault Volunteer moderator Project developer Project tester Project scientist Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0	Message 44690 - Posted: 5 Dec 2010, 3:10:01 UTC - in response to Message 44588. The update I've posted should help with better errors when the GPU doesn't have doubles, and should help with system responsiveness on lower end GPUs. ID: 44690 · Rating: 0 · rate: / Reply Quote

[AF>EDLS] Polynesia Send message Joined: 5 Apr 09 Posts: 71 Credit: 6,120,786 RAC: 0	Message 44716 - Posted: 5 Dec 2010, 22:35:48 UTC Last modified: 5 Dec 2010, 22:51:30 UTC What updates? an updated application? Team Alliance francophone, boinc: 7.0.18 GA-P55-UD5, i7 860, Win 7 64 bits, 8g DDR3, GTX 470 ID: 44716 · Rating: 0 · rate: / Reply Quote

Matt Arsenault Volunteer moderator Project developer Project tester Project scientist Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0	Message 44717 - Posted: 5 Dec 2010, 22:50:35 UTC - in response to Message 44716. What updates? an updated application? The second set of links I added to the original post. ID: 44717 · Rating: 0 · rate: / Reply Quote

[AF>EDLS] Polynesia Send message Joined: 5 Apr 09 Posts: 71 Credit: 6,120,786 RAC: 0	Message 44718 - Posted: 5 Dec 2010, 22:56:44 UTC ok, thank you but I did not understand what the minor updates of the application? Team Alliance francophone, boinc: 7.0.18 GA-P55-UD5, i7 860, Win 7 64 bits, 8g DDR3, GTX 470 ID: 44718 · Rating: 0 · rate: / Reply Quote

Astromancer. Send message Joined: 21 Nov 09 Posts: 49 Credit: 20,942,758 RAC: 0	Message 44749 - Posted: 6 Dec 2010, 22:34:27 UTC Last modified: 6 Dec 2010, 22:38:15 UTC With 48.0 I found that the runtimes are over 2m longer than with the CUDA app on my GTX260. 15:40 - 16:00 for OpenCL and 13:22 - 13:31 for CUDA. The memory bandwith used on the card is up to about 50% as well which is a HUGE jump from CUDA (0%). As well as having a larger system memory footprint. Just found that interesting, it doesn't particularly concern me. And it also uses more CPU time than the CUDA app, though I didn't notice the cpu being used while I was watching the task manager. Does it use most / all of the cycles at the start or end of the WU or something? ID: 44749 · Rating: 0 · rate: / Reply Quote

Matt Arsenault Volunteer moderator Project developer Project tester Project scientist Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0	Message 44754 - Posted: 6 Dec 2010, 23:38:54 UTC - in response to Message 44749. With 48.0 I found that the runtimes are over 2m longer than with the CUDA app on my GTX260. 15:40 - 16:00 for OpenCL and 13:22 - 13:31 for CUDA. Are you sure you're comparing the same things? Some of the newer workunits are quite a bit larger than they have been in the past. If it's not that, I there might be some mysterious problem I haven't quite figured out where there's a mysterious drop in performance at some points with how I break the problem up to keep the system responsive. There seemed to be strange peaks in run time at some points I try, and I haven't quite figured out a good rule for different GPUs. I think I had something, but I haven't actually played with it on slower GPUs. For me on the 285, it seems to be about 2% faster than the CUDA one. The memory bandwith used on the card is up to about 50% as well which is a HUGE jump from CUDA (0%). As well as having a larger system memory footprint. Just found that interesting, it doesn't particularly concern me. I don't see why that would happen. There's basically no transfer done except at the beginning / end. And it also uses more CPU time than the CUDA app, though I didn't notice the cpu being used while I was watching the task manager. Does it use most / all of the cycles at the start or end of the WU or something? Quite likely. I think the old CUDA one did the final likelihood calculation on the GPU, which takes a few seconds at the end on the CPU, but isn't actually worth the effort to do it on the GPU. ID: 44754 · Rating: 0 · rate: / Reply Quote

Astromancer. Send message Joined: 21 Nov 09 Posts: 49 Credit: 20,942,758 RAC: 0	Message 44755 - Posted: 6 Dec 2010, 23:46:48 UTC - in response to Message 44754. Matt, I downloaded the WU's all today and within about an hour of each other. I did 4 OpenCL then 3 CUDA before posting and another few CUDA after posting (With the same type of run time seen). The memory bandwith usage struck me as odd as well which is why I posted it. I was running GPU-z to watch what was going on a bit to make sure it wasn't say using 50% of the GPU core or something and noticed that the "Memory Controller Load" was reading at 50% or over. I've only ever seen that with SETI before. I'll give a try with 48.1 and see if anything different happens. If there is some kind of test WU I can run through the command line or the like to help you out any, I'd be more than willing to do it. ID: 44755 · Rating: 0 · rate: / Reply Quote

Matt Arsenault Volunteer moderator Project developer Project tester Project scientist Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0	Message 44757 - Posted: 6 Dec 2010, 23:59:31 UTC - in response to Message 44755. I downloaded the WU's all today and within about an hour of each other. I did 4 OpenCL then 3 CUDA before posting and another few CUDA after posting (With the same type of run time seen). The current workunits aren't uniformly larger. There is a mixture of different sizes out right now, so it's hard to tell if this means anything without knowing which workunits you ran on each. ID: 44757 · Rating: 0 · rate: / Reply Quote

Astromancer. Send message Joined: 21 Nov 09 Posts: 49 Credit: 20,942,758 RAC: 0	Message 44761 - Posted: 7 Dec 2010, 2:36:23 UTC - in response to Message 44757. After I posted the last one I figured you would need more details since the deleter is deleting things right away. (Looks like I was right) So I went about getting them for you. One other thing I noticed on my system is that the OpenCL tasks sit at 100% with the clock still going for about 30s. I'll PM you with all the info so I don't make a huge post full of data useless to anyone but you and Travis. ID: 44761 · Rating: 0 · rate: / Reply Quote

europa Send message Joined: 29 Oct 10 Posts: 89 Credit: 39,246,947 RAC: 0	Message 44805 - Posted: 7 Dec 2010, 22:39:51 UTC - in response to Message 44676. Matt, I wanted to give you an update on my use of the OpenCL app. To re-cap, I'm running 64-bit Ubuntu 10.10 on an AMD quad-core with 8GB of DDR2 RAM and a GTX460 Fermi card with the latest Nvidia driver from their website. 1. Simply extracting the tar to the top-level MW folder did not work since the executable that it put there was for cuda23, not Open CL. The WU's continued to be retrieved as cuda23 WU's and the completed one's continued to fail validation. 2. Having noticed the app.xml and correct executable had been extracted into a sub-folder in the MW main folder, I first moved them to another non-MW folder and then copied the contents into the main MW folder with the rest of the extracted files. 3. I then deleted the cuda23 executable. 4. Based on discussion at Collatz, I copied libcudart32_23.so into the MW folder. 5. I suspended the WU's and exited Boinc Manager. 6. I then opened a terminal window as root and typed "service boinc-client restart" [Enter] and closed the terminal window. 7. I re-started Boinc Manager and the apps and quickly saw work units id's as "open_cl" work units along with the notation that they were using 0.5CPU and 1.0 GPU. The WU's so far have taken about 15-20 min. to process vs. a few hours before. 6. However, at first,the WU's were coming back the message "completed, validation inconclusive". Since then, it has changed to "Successful" so I guess I'm ok. Here is the stderr_text for one of them. Task 263496737 Name de_separation_17_3s_fix_1_1719797_1291478684_1 Workunit 198514068 Created 4 Dec 2010 16:09:29 UTC Sent 4 Dec 2010 16:10:24 UTC Received 4 Dec 2010 19:06:06 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 228451 Report deadline 12 Dec 2010 16:10:24 UTC Run time 1519.988412 CPU time 8.83 stderr out <core_client_version>6.10.58</core_client_version> <![CDATA[ <stderr_txt> <search_application> milkywayathome separation 0.48 Linux x86_64 double OpenCL </search_application> Found 1 platforms Platform 0 information: Platform name: NVIDIA CUDA Platform version: OpenCL 1.0 CUDA 3.2.1 Platform vendor: Platform profile: Platform extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll Using device 0 on platform 0 Found 1 CL devices Device GeForce GTX 460 (NVIDIA Corporation:0x10de) Type: CL_DEVICE_TYPE_GPU Driver version: 260.19.21 Version: OpenCL 1.0 CUDA Compute capability: 2.1 Little endian: CL_TRUE Error correction: CL_FALSE Image support: CL_TRUE Address bits: 32 Max compute units: 7 Clock frequency: 1502 Mhz Global mem size: 804454400 Max mem alloc: 201113600 Global mem cache: 114688 Cacheline size: 128 Local mem type: CL_LOCAL Local mem size: 49152 Max const args: 9 Max const buf size: 65536 Max parameter size: 4352 Max work group size: 1024 Max work item dim: 3 Max work item sizes: { 1024, 1024, 64 } Mem base addr align: 4096 Min type align size: 128 Timer resolution: 1000 ns Double extension: MW_CL_KHR_FP64 Extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 Compiler flags: -cl-mad-enable -cl-no-signed-zeros -cl-strict-aliasing -cl-finite-math-only -DUSE_CL_MATH_TYPES=0 -DUSE_MAD=0 -DUSE_FMA=0 -cl-nv-verbose -DDOUBLEPREC=1 -DMILKYWAY_MATH_COMPILATION -DNSTREAM=3 -DFAST_H_PROB=1 -DAUX_BG_PROFILE=0 -DUSE_IMAGES=1 -DI_DONT_KNOW_WHY_THIS_DOESNT_WORK_HERE=1 Build status: CL_BUILD_SUCCESS Build log: : Considering profile 'compute_20' for gpu='sm_21' in 'cuModuleLoadDataEx_4' Kernel work group info: Work group size = 512 Kernel local mem size = 0 Compile work group size = { 0, 0, 0 } Lower n solution: n = 40, x = 0 Higher n solution: n = 40, x = 0 Using solution: n = 40, x = 0 Range: { nu_steps = 640, mu_steps = 1600, r_steps = 1400 } Iteration area: 2240000 Chunk estimate: 40 Num chunks: 40 Added area: 0 Effective area: 2240000 Integration time: 795.240523 s. Average time per iteration = 1242.563318 ms Kernel work group info: Work group size = 512 Kernel local mem size = 0 Compile work group size = { 0, 0, 0 } Lower n solution: n = 38, x = 1792 Higher n solution: n = 50, x = 0 Using solution: n = 38, x = 1792 Range: { nu_steps = 640, mu_steps = 400, r_steps = 1400 } Iteration area: 560000 Chunk estimate: 40 Num chunks: 38 Added area: 1792 Effective area: 561792 Integration time: 343.485032 s. Average time per iteration = 536.695363 ms Kernel work group info: Work group size = 512 Kernel local mem size = 0 Compile work group size = { 0, 0, 0 } Lower n solution: n = 38, x = 1792 Higher n solution: n = 50, x = 0 Using solution: n = 38, x = 1792 Range: { nu_steps = 640, mu_steps = 400, r_steps = 1400 } Iteration area: 560000 Chunk estimate: 40 Num chunks: 38 Added area: 1792 Effective area: 561792 Integration time: 369.818321 s. Average time per iteration = 577.841126 ms <background_integral> 0.00049448328945061429 </background_integral> <stream_integrals> 98.00489805211894633885 736.88664944943548107403 0.00661912358812184829 </stream_integrals> <background_only_likelihood> -3.28428541304979715321 </background_only_likelihood> <stream_only_likelihood> -35.67127293699287093887 -4.01220139598113423318 -231.34968260617702640047 </stream_only_likelihood> <search_likelihood> -3.04403874542270713732 </search_likelihood> 12:51:02 (3471): called boinc_finish </stderr_txt> ]]> Validate state Checked, but no consensus yet Claimed credit 0.0784035175711632 Granted credit 0 application version Anonymous platform I assume since it's not failing, that it is using the Fermi at the double-precision level. Sorry for the delay in posting this, the computer quit and it me a couple of days to fix it. Thanks for getting me back in the game. Regards, Steve ID: 44805 · Rating: 0 · rate: / Reply Quote

Matt Arsenault Volunteer moderator Project developer Project tester Project scientist Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0	Message 44812 - Posted: 7 Dec 2010, 23:48:17 UTC - in response to Message 44805. Thanks to everyone you posted information. I've looked at the pieces, and I think I've pieced together why I made the slower GPUs slower; I half missed something obvious. ID: 44812 · Rating: 0 · rate: / Reply Quote

Matt Arsenault Volunteer moderator Project developer Project tester Project scientist Send message Joined: 8 May 10 Posts: 576 Credit: 15,979,383 RAC: 0	Message 44820 - Posted: 8 Dec 2010, 7:21:52 UTC - in response to Message 44588. I've posted another minor update which should hopefully fix being slower on some GPUs. ID: 44820 · Rating: 0 · rate: / Reply Quote

[AF>EDLS] Polynesia Send message Joined: 5 Apr 09 Posts: 71 Credit: 6,120,786 RAC: 0	Message 44853 - Posted: 9 Dec 2010, 18:40:21 UTC Last modified: 9 Dec 2010, 18:55:06 UTC Big trouble!! I'm trying now the 0.48.2 and I am having slowdowns my PC that I was not with 0.48.1 ... In addition, the unit does not and this calculation is boosting the temperature of my card ... GPU load is yet to 99% ... but 0% CPU I do not think so in this case recalculate these units 0.48.2 ... I still do not think that just my version of boinc: 6.12.8? Team Alliance francophone, boinc: 7.0.18 GA-P55-UD5, i7 860, Win 7 64 bits, 8g DDR3, GTX 470 ID: 44853 · Rating: 0 · rate: / Reply Quote

europa Send message Joined: 29 Oct 10 Posts: 89 Credit: 39,246,947 RAC: 0	Message 44859 - Posted: 9 Dec 2010, 22:57:17 UTC - in response to Message 44820. Matt, The OpenCL setup continues to work like a charm on the first machine. I've assembled a second machine that is virtually identical. Aside from being an AMD 6 core vs. AMD quad-core and having DDR3 RAM vs. DDR2, they are the same. I even made a point of getting the same model graphics card (MSI Twin-Frozr GTX-460). Both are 64-bit Ubuntu 10.10 and also have the backward compatibility 32-bit libraries. I updated the Nvidia driver, I've matched the permissions with those on the working machine BUTno matter what I do, I cannot get milkyway_separation_0.48_x86_64-pc-linux-gnu__cuda_opencl to run on this machine. The cuda 23 variant keeps re-appearing in the folder and executing. I've tried multiple bulk deletes of the MW folder contents and re-extractions of the OpenCL tar but I always end up with the cuda23 executable reappearing in the folder and taking over. When I start up Boinc it sees the Fermi card. As I said earlier, things continue to run like a charm on the first machine and this one is virtually identical. I don't figure out the problem. Any suggestions? Thanks for any help that you can provide. Regards, Steve ID: 44859 · Rating: 0 · rate: / Reply Quote