opencl_nvidia_101 on RTX 3060 Ti

Author	Message
GolfSierra Send message Joined: 11 Mar 22 Posts: 42 Credit: 21,902,543 RAC: 0	Message 72521 - Posted: 5 Apr 2022, 18:15:25 UTC Last modified: 5 Apr 2022, 18:17:10 UTC I changed the grafix card on my rig from an old GTX 660 Ti to a RTX 3060 Ti. From the old set-up, I still have many WUs indicated with "opencl_nvidia_101" and to me it seems that the new card isn't much faster than the old one. Are there other WU types around that make use of the advanced features of a modern grafix card? My rig: AMD Ryzen 9 5950x RTX 3060 Ti Asus Prime B550M-WiFi 32 GB RAM Controlled by an app_confix.xml, I have 4 GPU tasks running with 1 CPU each and another 8 CPU only tasks. According to HWiNFO64, this keeps the average CPU (Tdie) just around 90Â° C, which still is within the temp limits of the CPU. It seems there is no difference in running GPU tasks with 0.25 GPU and 1 CPU or just with 0.5 CPU. Running MW@home on all 16 cores results in temps up to 94Â° C, but still no thermal throttling (HTC) kicks in. However, I don't like to have the Ryzen load at max all the time. ID: 72521 · Rating: 0 · rate: / Reply Quote

Max_Pirx Send message Joined: 13 Dec 17 Posts: 46 Credit: 2,421,362,376 RAC: 0	Message 72523 - Posted: 5 Apr 2022, 19:23:02 UTC - in response to Message 72521. Nominally, the RTX 3060Ti should be 2.5 times faster than GTX 660Ti (that is in terms of FP64 FLOPs - 253 GFLOPs vs 110 GFLOPs). This should be a noticeable difference, something like 8 simultaneous WUs vs 4 simultaneous WUs completed for a similar time. ID: 72523 · Rating: 0 · rate: / Reply Quote

GolfSierra Send message Joined: 11 Mar 22 Posts: 42 Credit: 21,902,543 RAC: 0	Message 72525 - Posted: 5 Apr 2022, 20:59:09 UTC - in response to Message 72523. Well, the RTX 3060 Ti completes a WU in about 9:55 minutes. This is almost the same as with the GTX 660 Ti, only that I now can run 4 GPU tasks in parallel, with the GTX 660 Ti it were only 2 in parallel. The 660 Ti uses OpenCL 3.0 and CUDA 3.0 while the 3060 Ti runs still OpenCL 3.0 but CUDA 8.6. MW@home runs Seperation 1.46 WUs since 3-4 years now? And they are still using the old OpenCL 3.0 standard? ID: 72525 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 715 Credit: 555,441,634 RAC: 38,678	Message 72526 - Posted: 6 Apr 2022, 0:08:36 UTC - in response to Message 72525. Well, the RTX 3060 Ti completes a WU in about 9:55 minutes. This is almost the same as with the GTX 660 Ti, only that I now can run 4 GPU tasks in parallel, with the GTX 660 Ti it were only 2 in parallel. The 660 Ti uses OpenCL 3.0 and CUDA 3.0 while the 3060 Ti runs still OpenCL 3.0 but CUDA 8.6. MW@home runs Seperation 1.46 WUs since 3-4 years now? And they are still using the old OpenCL 3.0 standard? MW used to use the OpenCL 1.2 standard. And OpenCL 3.0 is in fact the latest standard and only recently supported by the card vendors drivers. https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/OpenCL_API.pdf ID: 72526 · Rating: 0 · rate: / Reply Quote

GolfSierra Send message Joined: 11 Mar 22 Posts: 42 Credit: 21,902,543 RAC: 0	Message 72531 - Posted: 6 Apr 2022, 7:17:43 UTC - in response to Message 72526. My bad. Depending the sources I have found, the 660 Ti uses opencl 1.1 or 1.2, not 3.0. ID: 72531 · Rating: 0 · rate: / Reply Quote

ace_quaker Send message Joined: 1 Feb 09 Posts: 4 Credit: 101,666,161 RAC: 0	Message 72542 - Posted: 6 Apr 2022, 21:29:45 UTC I have a 3900x/3060Ti doing Seperation (GPU) tasks under linux using nvidia 510 drivers currently does 1 work unit in about 2:40. Is your 9:55 work unit time running 4 at once? I had a lot of invalids to start out but I updated my drivers from 470 to 510, disabled milkyway CPU work, backed off on my CPU cores loaded with other BOINC projects and one of these seems to have greatly lowered the invalid rate. I am pretty hesitant to push multiple tasks, what do you all think for my card would it be worth it? ID: 72542 · Rating: 0 · rate: / Reply Quote

AndreyOR Send message Joined: 13 Oct 21 Posts: 44 Credit: 226,966,300 RAC: 4,484	Message 72553 - Posted: 7 Apr 2022, 7:08:41 UTC I have an HP Omen (running Windows10 and use WSL2 for Linux tasks) that came with a 3060Ti. I upgraded the CPU to 5900X (12C/24T) and RAM to 64GB. MW Separation tasks take ~ 2:45 per task running one a time. I tested running multiple at a time a while back and found that time per task slowed down too much. You get more done per unit of time by running one task at a time. I also run Einstein GPU tasks and found that FGRP1G tasks are best run one at a time but O3AS ones - 3 at a time. I very rarely get errors or invalids and run my PC full load (CPU & GPU) 24/7. Even with demanding loads such as 24 Rosettas or 24 LHC ATLAS tasks at a time which can almost use all of 64GB of RAM. ace_quaker, besides drivers, check if your firmware/VBIOS is up to date. Maybe also motherboard drivers/firmware. I wouldn't think that the load you place on your CPU should affect the error rate of GPU tasks. From my experience, MW Separation on 3060Ti is best run one task at a time. GolfSierra, I undervolted my CPU using RyzenMaster and GPU using MSI Afterburner. This did wonders for cooling and noise with no noticeable performance hit. CPU temperatures rarely go over 70 (usually 60s) and GPU is usually in the 50s even under heavy loads mentioned above. Since I run my PC full load 24/7, I eventually optimized some more and tuned down the CPU to 3.7 GHz as it significantly decreased power usage and I didn't notice any performance hit compared to having it run at 4+ GHz. You may consider doing something similar so you can use all or at least more of the 32 threads available on the 5950X and with lower temperatures. ID: 72553 · Rating: 0 · rate: / Reply Quote

Chooka Send message Joined: 13 Dec 12 Posts: 101 Credit: 1,782,758,310 RAC: 0	Message 72554 - Posted: 7 Apr 2022, 10:39:51 UTC I found the following with my 3070Ti - 1 x wu = 148 sec 2 x wu = 120 sec 3 x wu = 114 sec 4 x wu = 112.5sec Any which way you cut it, these cards SUCK at Milkyway. (I bought this card for Primegrid where they do very well) For comparison, my 7990 = roughly 46sec running 3 wu's. Radeon VII is about 45sec running 4 wu's. ID: 72554 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 715 Credit: 555,441,634 RAC: 38,678	Message 72567 - Posted: 7 Apr 2022, 22:03:01 UTC I find the following with my GTX 1080 Ti's - 1X = 101 seconds 2X = 80 seconds My 3080's do no better 2X=75-82 seconds This Separation app is not very well optimized. When I was injecting an extra OpenCL optimized library into the OS environment for the benefit of Einstein Gamma-Ray tasks, I discovered an unintentional beneficial side-effect on Milkyway Separation tasks that knocked 30 seconds off the runtimes compared to not injecting the optimized library. When we switched to a newer optimized GR application that got rid of the injected library to bring the optimization code inside the newer application I lost the side-effect benefit on my MW tasks and they reverted to the original runtimes of the stock application. ID: 72567 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0	Message 72577 - Posted: 8 Apr 2022, 10:45:30 UTC - in response to Message 72567. I find the following with my GTX 1080 Ti's - 1X = 101 seconds 2X = 80 seconds My 3080's do no better 2X=75-82 seconds This Separation app is not very well optimized. When I was injecting an extra OpenCL optimized library into the OS environment for the benefit of Einstein Gamma-Ray tasks, I discovered an unintentional beneficial side-effect on Milkyway Separation tasks that knocked 30 seconds off the runtimes compared to not injecting the optimized library. When we switched to a newer optimized GR application that got rid of the injected library to bring the optimization code inside the newer application I lost the side-effect benefit on my MW tasks and they reverted to the original runtimes of the stock application. As I said somewhere else they need Petri here too if you guys can spare him!! :-) ID: 72577 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 715 Credit: 555,441,634 RAC: 38,678	Message 72580 - Posted: 8 Apr 2022, 13:42:59 UTC - in response to Message 72577. As I said somewhere else they need Petri here too if you guys can spare him!! :-) I agree. I hope I can turn his attention to the Separation app as soon as he is finished with a final revision of the GR app. ID: 72580 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 213 Credit: 108,362,077 RAC: 4,510	Message 72603 - Posted: 8 Apr 2022, 21:20:10 UTC - in response to Message 72580. As I said somewhere else they need Petri here too if you guys can spare him!! :-) I agree. I hope I can turn his attention to the Separation app as soon as he is finished with a final revision of the GR app. I have the feeling that improving the efficiency of the existing program without also finding a way of putting a lot more sub-tasks within each work unit aimed at GPUs will probably lead to more server overloading because of increased database activity :-( There are several issues regarding packing more into an individual work unit. The biggest issue is likely to be that the current parameter mechanism uses the command line option of the work unit, so it may be constrained by that (modern BOINC clients on Linux can handle command lines of more than 1024 bytes; is that also true for Windows???) So it might be necessary to re-write the wrapper to get the parameters from a file instead, and that would entail re-working the work-unit generator as well. Also, packing more into a work-unit will mean large increases in run time for users of the CPU version, or the need for separate versions for CPU and GPU, with separate validators. Of course, some of the people with powerful GPUs will say something along the lines of "But you don't need CPU tasks for this project!" -- witness some of the posts about Open Pandemics at WCG! That may or may not be true, but it is unfair to folks without a GPU :-) Yes, making the application more efficient will have benefits, especially for the users with [lots of] powerful GPUs. However, the possible consequences should also be considered -- need I remind the ex-SETI@Home folks of the issues there as more and more folks ran the "Special Sauce" application and the servers often got very bogged down... Cheers - Al. ID: 72603 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 715 Credit: 555,441,634 RAC: 38,678	Message 72604 - Posted: 8 Apr 2022, 21:56:12 UTC You make a very valid point Al. We can improve the application speed to the point the projects servers can't keep up or cope. In the end, IF we get Petri to improve the speed of the application and submit it to the project managers, it is still up them to decide whether to release it the general public. With the very low MW participation of GPUUG team members, I doubt the couple of us could impact the servers too much. The improvement in the Einstein GR app that Petri developed was incorporated into the Einstein stock application. Again that was a decision by the project managers that their servers could handle it. But now would not be the time to incorporate an improved application here because things certainly have not returned to normal after the disk and database rebuild. ID: 72604 · Rating: 0 · rate: / Reply Quote

GolfSierra Send message Joined: 11 Mar 22 Posts: 42 Credit: 21,902,543 RAC: 0	Message 72631 - Posted: 9 Apr 2022, 9:28:35 UTC - in response to Message 72542. Last modified: 9 Apr 2022, 9:29:53 UTC I have a 3900x/3060Ti doing Seperation (GPU) tasks under linux using nvidia 510 drivers currently does 1 work unit in about 2:40. Is your 9:55 work unit time running 4 at once? Yes, I'm controlling MW@home through an app_config.xml. 4 GPU tasks in parallel, the setting ist 0.5 CPU and 0.25 GPU per task. Had no invalid results so far. You should give it a try. Beside the GPU tasks, my computer does 8 CPU tasks and 1 CPU task (16 CPUs) in parallel. I didn't enable all CPUs for number crunching to keep the CPU average temps lower. This is my app_config.xml <app_config> <app> <name>milkyway</name> <max_concurrent>12</max_concurrent> <gpu_versions> <gpu_usage>0.25</gpu_usage> <cpu_usage>0.5</cpu_usage> </gpu_versions> </app> <app> <name>milkyway_nbody</name> <max_concurrent>12</max_concurrent> <gpu_versions> <gpu_usage>0.25</gpu_usage> <cpu_usage>0.5</cpu_usage> </gpu_versions> </app> </app_config> ID: 72631 · Rating: 0 · rate: / Reply Quote

GolfSierra Send message Joined: 11 Mar 22 Posts: 42 Credit: 21,902,543 RAC: 0	Message 72634 - Posted: 9 Apr 2022, 9:48:13 UTC - in response to Message 72567. Last modified: 9 Apr 2022, 9:48:49 UTC I find the following with my GTX 1080 Ti's - 1X = 101 seconds 2X = 80 seconds My 3080's do no better 2X=75-82 seconds This Separation app is not very well optimized. I fully agree. Video cards have mage huge progress in capabilities, that's why GPUs are used for mining and not CPUs. However, distributed computing does not yet make use of this potential. ID: 72634 · Rating: 0 · rate: / Reply Quote