Message boards :
Number crunching :
Run Multiple WU's on Your GPU
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next
Author | Message |
---|---|
Send message Joined: 13 Oct 16 Posts: 112 Credit: 1,174,293,644 RAC: 0 |
Not that I know of, I always have a crap load as well and so do most people I think. Nothing to worry about me thinks ;) |
Send message Joined: 24 Jan 11 Posts: 715 Credit: 557,042,037 RAC: 42,489 |
Quick question..... ATM I have about 186 WU's which are validation inconclusive. Could that be a result of running multiple WU's? Chooka, your app_info.xml looks fine for elements and syntax. Your inconclusives are less than 10% of valid tasks and matches my inconclusive percentage. I see nothing to worry about. I didn't see any glaring completely wrong numbers in several of your inconclusive and valid task stderr.txt outputs. Just keep motorin' along. ;^} |
Send message Joined: 13 Dec 12 Posts: 101 Credit: 1,782,758,310 RAC: 0 |
Hi guys, I'm back again. Bought a new video card and am getting mutiple computaional errors. Does this file still look correct? I seriously can't get my head around this stuff every time it comes up. Quite depressing really :( <app_info> <app> <name>milkyway</name> </app> <file_info> <name>milkyway_1.43_windows_x86_64__opencl_ati_101.exe</name> <executable/> </file_info> <app_version> <app_name>milkyway</app_name> <version_num>143</version_num> <platform>windows_x86_64</platform> <avg_ncpus>0.5</avg_ncpus> <max_ncpus>0.567833</max_ncpus> <plan_class>opencl_ati_101</plan_class> <cmdline></cmdline> <coproc> <type>ATI</type> <count>0.5</count> </coproc> <file_ref> <file_name>milkyway_1.43_windows_x86_64__opencl_ati_101.exe</file_name> <main_program/> </file_ref> </app_version> <app> <name>milkyway_separation__modified_fit</name> <user_friendly_name>Milkyway Sep. (Mod. Fit)</user_friendly_name> </app> <file_info> <name>milkyway_separation__modified_fit_1.43_windows_x86_64.exe</name> <executable/> </file_info> <app_version> <app_name>milkyway_separation__modified_fit</app_name> <version_num>143</version_num> <platform>windows_x86_64</platform> <file_ref> <file_name>milkyway_separation__modified_fit_1.43_windows_x86_64.exe</file_name> <main_program/> </file_ref> </app_version> <app> <name>milkyway_separation__modified_fit</name> </app> <file_info> <name>milkyway_separation__modified_fit_1.43_windows_x86_64__opencl_ati_101.exe</name> <executable/> </file_info> <app_version> <app_name>milkyway_separation__modified_fit</app_name> <version_num>143</version_num> <platform>windows_x86_64</platform> <avg_ncpus>0.05000</avg_ncpus> <max_ncpus>0.0567833</max_ncpus> <plan_class>opencl_ati_101</plan_class> <cmdline></cmdline> <coproc> <type>ATI</type> <count>0.5</count> </coproc> <file_ref> <file_name>milkyway_separation__modified_fit_1.43_windows_x86_64__opencl_ati_101.exe</file_name> <main_program/> </file_ref> </app_version> </app_info> Is this an old...um...file name? 1.43? If so, where do I download a newer version? Or would I just change the 1.43 to 1.46? Thank you once again. |
Send message Joined: 13 Dec 12 Posts: 101 Credit: 1,782,758,310 RAC: 0 |
Sorry all, I got it sorted. The errors are being reported by others users too so it's not my card. |
Send message Joined: 21 Mar 15 Posts: 3 Credit: 47,175,569 RAC: 0 |
I'm currently using an AMD Radeon RX Vega 64. I'm using the same settings from message 65387 for the AMD Radeon R9 Fury X but instead of 3 WUs per GPU, i'm using 4 WUs per GPU. A single WU takes about 90 seconds before I applied the change. Post-change, four WUs on the GPU take 3 minutes. All run simultaneously and finish at the same time of 3 minutes. I may be able to go higher, but the time may increase a lot more. |
Send message Joined: 24 Oct 11 Posts: 1 Credit: 100,090,867 RAC: 0 |
I've monitored the task flow on my client (ATI HD5870) within last two days. Based on the current statistics there are three kind of MW tasks now: -- de_modfit_fast_XX_... -- reports an error; -- de_modfit_XX_... -- OK; -- de_modfit_fast_SimXX_... -- OK. Erroneous tasks are completed with an error on entire quorum, no difference of video card type. Moreover, CPU running tasks are reported an error too. |
Send message Joined: 21 Mar 15 Posts: 3 Credit: 47,175,569 RAC: 0 |
Are you referring to http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4172? If so, there are two WU's that are causing all sorts of problems, they are: De_modfit_fast_18_3s_146_bundle5_ModfitConstraintsWithDisk_Bouncy and de_modfit_fast_20_3s_146_bundle5_ModfitConstraintsWithDisk_Bouncy. |
Send message Joined: 13 Mar 18 Posts: 9 Credit: 66,232,294 RAC: 0 |
I have observed that running too many WUs per GPU can cause the WUs to error out with OOMs. Is there any guidance on how much is too much? For instance, I have a GPU with 12GB RAM, what's the maximum number of WUs I can put on this GPU? In practice, 8 leads to errors and 4 does not. That said, the GPU is not crunching at 100% utilization with only 4 WUs scheduled at a time. |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
I have observed that running too many WUs per GPU can cause the WUs to error out with OOMs. Is there any guidance on how much is too much? For instance, I have a GPU with 12GB RAM, what's the maximum number of WUs I can put on this GPU? In practice, 8 leads to errors and 4 does not. That said, the GPU is not crunching at 100% utilization with only 4 WUs scheduled at a time. You have run into a Boinc software limitation, not a gpu limitation, Boinc itself can't see 12gb of ram on the gpu, it will in time but not now, so running that many workunits that each take that much memory will be a problem. |
Send message Joined: 13 Mar 18 Posts: 9 Credit: 66,232,294 RAC: 0 |
You have run into a Boinc software limitation, not a gpu limitation, Boinc itself can't see 12gb of ram on the gpu, it will in time but not now, so running that many workunits that each take that much memory will be a problem. How could this be a BOINC limitation? Do you have a citation on this? Or a link to the bug in the source code? It seems to me that if I ask BOINC to schedule 8 tasks per GPU, that BOINC will do that without trying to determine if the GPU has enough RAM. Additionally, the errors I am seeing are coming from the Milkyway WUs. The errors are intermittent too. The computer can successfully handle 8 WUs per GPU most of the time, but even a 5% error rate is too high. |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
You have run into a Boinc software limitation, not a gpu limitation, Boinc itself can't see 12gb of ram on the gpu, it will in time but not now, so running that many workunits that each take that much memory will be a problem. No I don't have it in front of me but the Boinc Server side software isn't capable of seeing all the memory the newer gpu's have. It's not a "bug in the source code" either, it's older programming code that hasn't caught up yet. Boinc is written by a bunch of volunteers right now and even though they are very dedicated they all have "real jobs" too and they are mostly just fixing bugs and things in the Boinc software, both the Client and Server side. The money ran out a while ago and things aren't being done as quickly as they used to be. |
Send message Joined: 13 Mar 18 Posts: 9 Credit: 66,232,294 RAC: 0 |
No I don't have it in front of me but the Boinc Server side software isn't capable of seeing all the memory the newer gpu's have. It's not a "bug in the source code" either, it's older programming code that hasn't caught up yet. Boinc is written by a bunch of volunteers right now and even though they are very dedicated they all have "real jobs" too and they are mostly just fixing bugs and things in the Boinc software, both the Client and Server side. The money ran out a while ago and things aren't being done as quickly as they used to be. Very interesting. I understand the challenge of maintaining an opensource project without the support of a full-time staff. I will take a look at the source and see if I can identify where the issue may be. If you have any recommendations on where to being, that would be much appreciated! :) |
Send message Joined: 13 Oct 16 Posts: 112 Credit: 1,174,293,644 RAC: 0 |
Nice! I would do 3 WUs per GPU then. Try and stagger them. |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
No I don't have it in front of me but the Boinc Server side software isn't capable of seeing all the memory the newer gpu's have. It's not a "bug in the source code" either, it's older programming code that hasn't caught up yet. Boinc is written by a bunch of volunteers right now and even though they are very dedicated they all have "real jobs" too and they are mostly just fixing bugs and things in the Boinc software, both the Client and Server side. The money ran out a while ago and things aren't being done as quickly as they used to be. You can try here http://boinc.berkeley.edu/dl/ but I'm not sure that's what you want. Like you I'm just a user, unlike you I have no clue what to even look for. I do know there is an email list about the software but I don't have the link to it right now, it's mostly about the Alpha and Beta versions of the client software though. |
Send message Joined: 13 Mar 18 Posts: 9 Credit: 66,232,294 RAC: 0 |
I think this is a probable lead https://github.com/BOINC/boinc/issues/1773 |
Send message Joined: 26 Mar 18 Posts: 24 Credit: 102,912,937 RAC: 0 |
You have run into a Boinc software limitation, not a gpu limitation, Boinc itself can't see 12gb of ram on the gpu, it will in time but not now, so running that many workunits that each take that much memory will be a problem. So this info is incorrect. There is no issue with BOINC and 12Gb of ram on a graphics card. The issue is the application running the WU doesn't know to throttle back if it runs out of memory. So with 12Gb of GPU RAM and 8 WU going you can go past the 12Gb of available RAM and it will error out and I think kill all running WUs (or at least the one that ran out of memory). This is not a BOINC limitation but a limitation with the application crunching the WU. I recently tested out a Tesla v100 with 16Gb of GPU RAM. I ran 10 WU at a time and I would peak at 14.5Gb of RAM used. It didn't error out...worked fine. This was running Boinc 7.6.31 on Ubuntu 16.04. If I pushed 12 WU, depending on how they ran (RAM usage ramps up as WU processes) they would error out because I ran out of GPU RAM. In general a Milkyway WU will peak at the end around 1800 Mb. Doing the math: 6 WU @ 1800 = 10.8Gb 8 WU @ 1800 = 14.4Gb That's why you are erroring out at 8 WU....you are randomly running out of GPU RAM. I say randomly because your WU are all starting and ending at random times and its rare for all of them to finish at once (and hit peak memory usage). You could probably get away with 7 but some will still fail randomly. Here is a v100 running 7 WU. Notice the GPU memory usage: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 390.30 Driver Version: 390.30 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... Off | 000094A8:00:00.0 Off | 0 | | N/A 60C P0 199W / 250W | 8915MiB / 16160MiB | 100% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 92457 C ..._x86_64-pc-linux-gnu__opencl_nvidia_101 1838MiB | | 0 92476 C ..._x86_64-pc-linux-gnu__opencl_nvidia_101 1480MiB | | 0 92484 C ..._x86_64-pc-linux-gnu__opencl_nvidia_101 1838MiB | | 0 92500 C ..._x86_64-pc-linux-gnu__opencl_nvidia_101 1444MiB | | 0 92523 C ..._x86_64-pc-linux-gnu__opencl_nvidia_101 1480MiB | | 0 92685 C ..._x86_64-pc-linux-gnu__opencl_nvidia_101 406MiB | | 0 92693 C ..._x86_64-pc-linux-gnu__opencl_nvidia_101 358MiB | +-----------------------------------------------------------------------------+ The ones at the top of the list have been running and are about to finish up. The ones at the bottom (higher PID) have just started. There is a command line version of BOINC. If I were you I'd openup a DOS prompt and go to c:\Program Files\BOINC or wherever you have it installed. Run " boinccmd --get_project_status". Record the current time and the number of WU's you have and the elapsed time. Thats your baseline. Let it run for a couple hours. Get the stats again. Calculate the difference (new time - old time and new total - old total). Do the math to find out how long you were taking per WU. Now divide that by 6. There is your approximate average per WU when running 6 at a time. Change it to 5 WU, do it again. Change it to 4, do it again. Change it to 7, do it again (and watch for errors). You now have a number lower then the others. Stick with that many WUs. Also in case anyone is wondering after lots of playing a Tesla v100 seems optimal at 7 WU using 0.142 for the GPU setting and I used 0.5 for the CPU. It seemed to give the best average WU time with quantity taken into effect...37s per WU with 7 at a time or a average of one WU per 5.3 seconds. I also tested a p100 which despite its price tag being 75% of a v100 its almost half the speed. The best I could get out of it was 54.9s per WU with 6 at a time or a average of one WU per 9.14 seconds. 4 or 5 WU were just about the same (9.32), below or above were slower on average. |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
You have run into a Boinc software limitation, not a gpu limitation, Boinc itself can't see 12gb of ram on the gpu, it will in time but not now, so running that many workunits that each take that much memory will be a problem. Thank you very much, I learned something new today!! |
Send message Joined: 7 May 14 Posts: 57 Credit: 206,540,646 RAC: 0 |
update Radeon VII _ 3 instances , I have uploaded to youtube including config for 3three instances https://youtu.be/4xKy9wGKmz4 |
Send message Joined: 13 Oct 16 Posts: 112 Credit: 1,174,293,644 RAC: 0 |
update Radeon VII _ 3 instances , I have uploaded to youtube including config for 3three instances Nice! We could use someone like you on our Team :) |
Send message Joined: 16 Nov 09 Posts: 1 Credit: 99,314,406 RAC: 0 |
Thanks for the optimizing tips for the Radeon VII. Now I have another problem ..... is it possibe to get more than 300 workunits at a time? :P Every time I have crunched 300 WU's and asks for more it takes more than 5 minutes to get another batch of 300. |
©2024 Astroinformatics Group