Message boards :
News :
30 Workunit Limit Per Request - Fix Implemented
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
Send message Joined: 2 Aug 11 Posts: 13 Credit: 44,453,057 RAC: 0 |
Well - now i catch a bug with no new WU until a queue is full empty (or manual request/update). Which i have never had before. Only one thing i have changed on my side - i set <exlude_gpu> option in cc_config. I have 2 GPUs in computer and tried to separate/dedicate them from projects. One GPU run MW@H exclusively while second run E@H exclusively. As mix of MW and Einstein tasks on same GPU performs pretty bad. 2 <exlude_gpu> options in cc_config did the trick successfully but immediately after it problem with getting enough MW WUs pop-up. So it may be not a server bug/issues but bugs in BOINC client work scheduler if <exlude_gpu> is used. E.g. BOINC try to load WUs for GPU what excluded from project while "forgetting" it has another GPU which allowed to run this project. Debug entries like 11/04/2019 18:18:06 | Milkyway@Home | [sched_op] NVIDIA GPU work request: 127884.51 seconds; 0.00 devices reported above also suggest this: BOINC want a lot work for GPU but stupid enough to report it has no GPU avaible. So server does not send any work for such strange request. So i curious - other people with such problems also use <exlude_gpu> or <ignore_XXX_dev>? |
Send message Joined: 2 Aug 11 Posts: 13 Credit: 44,453,057 RAC: 0 |
Yep, my BOINC client also think it has no GPU when request work for GPU. Ofcourse it gets nothing. 23/04/2019 07:31:24 | Milkyway@Home | [sched_op] Starting scheduler request 23/04/2019 07:31:24 | Milkyway@Home | [work_fetch] request: CPU (0.00 sec, 0.00 inst) AMD/ATI GPU (60773.27 sec, [b]0.00 inst[/b]) 23/04/2019 07:31:24 | Milkyway@Home | Sending scheduler request: To fetch work. 23/04/2019 07:31:24 | Milkyway@Home | Reporting 1 completed tasks 23/04/2019 07:31:24 | Milkyway@Home | Requesting new tasks for AMD/ATI GPU 23/04/2019 07:31:24 | Milkyway@Home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices 23/04/2019 07:31:24 | Milkyway@Home | [sched_op] AMD/ATI GPU work request: 60773.27 seconds; [b]0.00 devices[/b] 23/04/2019 07:31:26 | Milkyway@Home | Scheduler request completed: got 0 new tasks 23/04/2019 07:31:26 | Milkyway@Home | [sched_op] Server version 713 23/04/2019 07:31:26 | Milkyway@Home | Project requested delay of 91 seconds 23/04/2019 07:31:26 | Milkyway@Home | [sched_op] handle_scheduler_reply(): got ack for task de_modfit_80_bundle4_4s_south4s_0_1555431910_2294680_0 23/04/2019 07:31:26 | Milkyway@Home | [work_fetch] backing off AMD/ATI GPU 524 sec Will try to update to latest BOINC client to check if this fixed already or not (probably not)... |
Send message Joined: 2 Aug 11 Posts: 13 Credit: 44,453,057 RAC: 0 |
As expected - latest BOINC client did not change anything. Problem is still here. More logs - when queue is almost empty - BOINC start counting GPU like this (0.75 GPU) and stil does not get any work 23/04/2019 09:45:45 | Milkyway@Home | [work_fetch] set_request() for AMD/ATI GPU: ninst 2 nused_total 0.25 nidle_now 0.75 fetch share 1.00 req_inst 0.75 req_secs 74882.47 23/04/2019 09:45:45 | Milkyway@Home | [sched_op] Starting scheduler request 23/04/2019 09:45:45 | Milkyway@Home | [work_fetch] request: CPU (0.00 sec, 0.00 inst) AMD/ATI GPU (74882.47 sec, 0.75 inst) 23/04/2019 09:45:46 | Milkyway@Home | Sending scheduler request: To fetch work. 23/04/2019 09:45:46 | Milkyway@Home | Reporting 2 completed tasks 23/04/2019 09:45:46 | Milkyway@Home | Requesting new tasks for AMD/ATI GPU 23/04/2019 09:45:46 | Milkyway@Home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices 23/04/2019 09:45:46 | Milkyway@Home | [sched_op] AMD/ATI GPU work request: 74882.47 seconds; 0.75 devices 23/04/2019 09:45:48 | Milkyway@Home | Scheduler request completed: got 0 new tasks 23/04/2019 09:45:48 | Milkyway@Home | [sched_op] Server version 713 23/04/2019 09:45:48 | Milkyway@Home | Project requested delay of 91 seconds 23/04/2019 09:45:48 | Milkyway@Home | [work_fetch] backing off AMD/ATI GPU 884 sec 23/04/2019 09:45:48 | Milkyway@Home | [sched_op] Deferring communication for 00:01:31 But after work queue is totally empty it ask work for whole GPU (1.00 devises) and finally get a lot of work - 68 tasks in this example: 23/04/2019 09:57:40 | Milkyway@Home | [work_fetch] set_request() for AMD/ATI GPU: ninst 2 nused_total 0.00 nidle_now 1.00 fetch share 1.00 req_inst 1.00 req_secs 75450.52 23/04/2019 09:57:40 | Milkyway@Home | [sched_op] Starting scheduler request 23/04/2019 09:57:40 | Milkyway@Home | [work_fetch] request: CPU (0.00 sec, 0.00 inst) AMD/ATI GPU (75450.52 sec, 1.00 inst) 23/04/2019 09:57:40 | Milkyway@Home | Sending scheduler request: To fetch work. 23/04/2019 09:57:40 | Milkyway@Home | Requesting new tasks for AMD/ATI GPU 23/04/2019 09:57:40 | Milkyway@Home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices 23/04/2019 09:57:40 | Milkyway@Home | [sched_op] AMD/ATI GPU work request: 75450.52 seconds; 1.00 devices 23/04/2019 09:57:42 | Milkyway@Home | Scheduler request completed: got 68 new tasks 23/04/2019 09:57:42 | Milkyway@Home | [sched_op] Server version 713 23/04/2019 09:57:42 | Milkyway@Home | Project requested delay of 91 seconds 23/04/2019 09:57:42 | Milkyway@Home | [sched_op] estimated total CPU task duration: 0 seconds 23/04/2019 09:57:42 | Milkyway@Home | [sched_op] estimated total AMD/ATI GPU task duration: 50593 seconds 23/04/2019 09:57:42 | Milkyway@Home | [sched_op] Deferring communication for 00:01:31 23/04/2019 09:57:42 | Milkyway@Home | [sched_op] Reason: requested by project 23/04/2019 09:57:42 | | [work_fetch] Request work fetch: RPC complete |
Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,461,693,501 RAC: 0 |
I have the same problem on both S9000 (2 boards, 4 concurrent tasks) and RX-570 (3 boards, 1 task) systems. I can have 100s of WU queued up and anywhere from 4-5 complete at a time and are reported but "got 0 new tasks" shows up. Eventually, system is idle, I notice the problem and a manual update fixes it. Then anywhere from 200 - 400 get downloaded instantly. My log is not as detailed as the ones I read here. There must be some diagnostic setting I am not using. I looked at BoincTasks to see if its "rules" support making an auto update but there is no 'Work Units Remaining" Type in the rule selection. Expedient workaround would be to use boinccmd.exe to do an update every x minutes but it would have to go on each of my systems and would be a PITA Question: Is this a problem that fixes itself after some time passes? If so, about how many minutes maximum of idle time? I have been playing with a wattmeter and the S9000 outperforms the RX570 5 to 2 (credits per sec) with only slightly more wattage. I will probably add risers for additional S9000 and switch the RX570 to Einstein which consumes far less watts then milkyway |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
I have the same problem on both S9000 (2 boards, 4 concurrent tasks) and RX-570 (3 boards, 1 task) systems. I can have 100s of WU queued up and anywhere from 4-5 complete at a time and are reported but "got 0 new tasks" shows up. Eventually, system is idle, I notice the problem and a manual update fixes it. Then anywhere from 200 - 400 get downloaded instantly. No it seems to be more of a Server side setting as EVERYONE, or almost everyone, is seeing the exact same thing. MW just switched to the latest version of the Boinc Server software so there is probably something obscure causing the problem. Other Admins at other projects have that said some of the settings are not where you think they might be as it's all interconnected. |
Send message Joined: 24 Jan 11 Posts: 715 Credit: 555,456,799 RAC: 38,796 |
If and when #3076 makes it into the client master, the problems with work fetch should clear up. Richard and I signed off on it. Now just have to have the code reviewers give it their blessing and then have the PR merged into master. |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
Hey Keith, Thanks for chiming in. Can you just clear up for me, is this a problem with one of our server settings or is this a client bug? Thanks, Jake |
Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,461,693,501 RAC: 0 |
I reviewed the history log in BoincTasks looking for the largest gap between milkyway completions and found a 10.25 minute gap as shown HERE I have a c# program that does this. This system completes a work unit every 15 seconds on the average so typically I see close to 600 WUs downloaded then the queue goes to 0 and usually just a minute or 2 till the next batch but occasionally a delay as long as 10 minutes it seems. I can live with this. |
Send message Joined: 13 Oct 16 Posts: 112 Credit: 1,174,293,644 RAC: 0 |
200 WUs per GPU up to 600 WUs total per system could be doubled until this bug gets worked out for less idle time between fill-ups :) |
Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,461,693,501 RAC: 0 |
200 WUs per GPU up to 600 WUs total per system could be doubled until this bug gets worked out for less idle time between fill-ups :) Actually, it can be worse. During the 10 or so minutes between fill-ups another project can sneak in and play havoc. My priority projects are science related but if milkyway (100%) goes out, then Einstein (%50) gets a boatload. If both go down my fallback is seti also at %50. One of these days asteroids@home will get an ATI app but I am not holding my breath. Was testing this system, playing with risers, and only allowed milkyway tasks so I got gpu's fully idle during that 10 minutes. |
Send message Joined: 13 Oct 16 Posts: 112 Credit: 1,174,293,644 RAC: 0 |
Using risers for the S9000 ey :) How you cooling those? |
Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,461,693,501 RAC: 0 |
s9000 has almost same form factor as HD-7950 so I bought a few "parts only" HD-7950 and use the cooler. Cooler fits fine but cannot be used in a case as the molding extend too far to the rear. If you look a the photo you see I had to offset the mounting bracket. This system is running 5 concurrent work units using three S9000. The S9100 does not have the same form as the S9000 and a copper shim would be needed which would be impracticable. Blowers are available from 3rd party for S9100 that fit on the back of the S91000. I may try it in this system but I suspect the Z400 will have problems with a 4th board. Power runs 540watts full load and about 140 watts no load. The S9000 and S9100 take only a single 8 pin power connector unlike the 7950 that has 8 + 6 connectors. I have had problem posting images, some sites show fine in preview but not after posting. If a problem add 'www" to the url. |
Send message Joined: 24 Jan 11 Posts: 715 Credit: 555,456,799 RAC: 38,796 |
Hey Keith, Hi Jake, it is a problem with the client code since my #2918 was merged into master around Mar 21. My bug was squashed but the fix introduced a rather unpleasant result in work fetch. https://github.com/BOINC/boinc/issues/3065 The developers persuaded DA to then fix the work_fetch problem in #3076 which he has done. Richard and I both tested simulations in the BOINC Emulator and found that all projects could pull requested work properly even when run dry and cpu cores were idle. DA put in a lot of new routines that do a better job of rr_simulations to figure out work deficits and who should get work and when. Also not sure what version of server codebase you are running now. Know you updated recently. There was a change in #3001 (fix possible overflow in peak FLOPS) has also been merged, and also fixed in server release 1.0.4 that changed work fetch to accommodate the correct GFLOPS calculations in AMD cards based on the current REC (Recent Estimated Credit) mechanism which calculates how much work to request for each project. So there could be both issues with the client and the server software interacting with each other currently. I don't know really all that much about the server software. I have only investigated the client and manager branches. I doubt there is really anything wrong with your settings. But I am willing to accept that there might be something to change. I would defer to someone else more knowledgeable with the servers settings. |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
Thank you Keith, Glad to know its likely an issue with a client and likely not with us. We are running server 1.0.4. Best, Jake |
Send message Joined: 24 Jan 11 Posts: 715 Credit: 555,456,799 RAC: 38,796 |
One possible temp solution is for those who are having issues with getting the client to request tasks when they have run dry is to backlevel to BOINC 7.12.1. That client would not have all the recent changes that disrupted work fetch that is occurring in BOINC 7.14.2. |
Send message Joined: 2 Aug 11 Posts: 13 Credit: 44,453,057 RAC: 0 |
My log is not as detailed as the ones I read here. There must be some diagnostic setting I am not using.Yes. I have used additional debug options in cc_config.xml to get more details in the log about work requests. There is it: <cc_config> <log_flags> <work_fetch_debug>1</work_fetch_debug> <sched_op_debug>1</sched_op_debug> </log_flags> </cc_config> They SPAM a LOT to logs so turned off by default. Question: Is this a problem that fixes itself after some time passes? If so, about how many minutes maximum of idle time?Yes, it fix itself. but only after queue is completely empty. Sometimes immediately after last task from queue completed. Sometimes up to 10 min pass after last tasks finished. It depends on "back off timer". Very first request after last task will be successful. But if there is a backoff timer counting (due to previous failed requests) BOINC will wait before requesting new work again even with empty work queue. |
Send message Joined: 2 Aug 11 Posts: 13 Credit: 44,453,057 RAC: 0 |
If and when #3076 makes it into the client master, the problems with work fetch should clear up. Richard and I signed off on it. Now just have to have the code reviewers give it their blessing and then have the PR merged into master. No. This fix you mention is only for users who use max concurrent options to limit how many tasks from one project can be run simultaneously. And there is a bug (or non optimal behavior at least) when BOINC client does not request new work if it already have enough to fill this limit. E.g user set max concurrent option to 6. (6 tasks max can run simultaneously) and work buffer for 1 day (e.g 100 tasks) but BOINC does not fill buffer if it already have 6 or more tasks while correct value will be 100(or other equivalent of 1 day of work). This patch should fix it. But right now in MW we have different issue: BOINC can not get new work regardless of "max concurrent" setting: - it hit users who do not use this option at all - it gets new work only after work buffer completely empty or if user makes manual update. not when work buffer fall < "max concurrent" as described in patch description and discussion One possible temp solution is for those who are having issues with getting the client to request tasks when they have run dry is to backlevel to BOINC 7.12.1. That client would not have all the recent changes that disrupted work fetch that is occurring in BOINC 7.14.2. You wrong again here. This problem with getting work from MW exist with very old BOINC clients too. I saw it even in v. 7.6.22 - its all the same as in the latest stable 7.14.2 So there are two different uncorrelated problems/issues with work fetch/scheduler. And patch you point to will not fix it. And roll back to prev ver of client not help either. There are a lot of bugs and very strange logic in the BOINC work scheduler - it is know to be the weakest buggy part of the code for years |
Send message Joined: 24 Jan 11 Posts: 715 Credit: 555,456,799 RAC: 38,796 |
I've never had issues getting work from MW in all the time I have been attached. My one and only set and forget project. I have used many different BOINC versions and have never experienced the problem you describe or what Beemer Biker is dealing with. While the #3076 may have fixes to deal with work fetch and use of max_concurrent, I should know because the impetus for 3076 was my #2918 which was just the starting point for #3076. Have you walked the code in work_fetch.cpp? I have. DA put in a ton of new structures for rr_simulation to do a better job of deciding how to request work among hosts attached to multiple projects. I did simulation with the 3076 client using the BOINC emulator as well as actually running it for over a day attached to my four projects, MW included on my test host. Nobody ever ran out of work unless I let tasks run dry on purpose to try and trip up work fetch for a dry project while still maintaining a work commit load for other competing projects. I am just suggesting that to give 3076 a try. Either get the appveyor artifact for Windows or wait for the #3076 to be merged into master. |
Send message Joined: 26 Mar 18 Posts: 24 Credit: 102,912,937 RAC: 0 |
Not sure if there is a definitive fix for this but I'm sitting idle for 10 - 15 minutes at a time waiting on work. If I manually update I get it but that's not feasible in the middle of the night. Long story short I have access to a AI learning machine for 2 - 3 days per month with a Tesla v100. Once I'm done with my actual work I'll run this project. Currently I can churn through WU at a 5.2 second per WU average so I kill all 300 in about 26 minutes. Then my machine sits idle for 10 - 14 minutes, gets another batch of 300, and repeats. You can even see it in the logs for the CPU: That 12 minutes of average idle time every 26 minutes is the equivalent of 220 WU not being done a hour or over 5,000 a day. Granted I only run the machine for a couple days a month but if there is a solution for this I'd appreciate it. Not sure if based on times a host is turning things back in if the WU cache limit can be increased. Obviously I would be happy if I got 1000 WU's at a time. |
Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,461,693,501 RAC: 0 |
That 12 minutes of average idle time every 26 minutes is the equivalent of 220 WU not being done a hour or over 5,000 a day. Granted I only run the machine for a couple days a month but if there is a solution for this I'd appreciate it. This has been on a wish list for a long time. Affects only the higher performance GPUs. Several ways to fix this. All rely on issuing an update after waiting the about 2.5 minutes minimum. The delay is to allow the timeout of the "your request is too soon". :start ping -n 150 127.0.0.1>nul boinccmd --project http://milkyway.cs.rpi.edu/milkyway/ update goto start Alternately you can create a task using task scheduler to run boinccmd every 3 or so minutes. I have not actually tried the above as I have remote systems that use both ubuntu and windows and for me it is easier to use a boinctask "rule" and issue a remote procedure call to do an update when the number of work units drops to zero. This project seems to be the only one with the problem but that is probably the case as the throughput is very high for S9000, s9100, etc like your tesla. here is another view of the "lost" idle time (the white space between completion time). My 4 GPUs run out of data about 2.25 hours and average 8 minute gaps (tall blue bar) before getting more data. |
©2024 Astroinformatics Group