Message boards :
News :
30 Workunit Limit Per Request - Fix Implemented
Message board moderation
Author | Message |
---|---|
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
Hey Everyone, Some users were only able to request 30 workunits at a time max. This is far too low to ensure optimal GPU usage. We have increased the feeder's shared memory size to hopefully increase the number of workunits available to request at any given time. Please let me know here if this has helped the issue. Best, Jake |
Send message Joined: 18 Jul 10 Posts: 76 Credit: 638,216,593 RAC: 52,848 |
Jake - Just tried a "user update" and got another 30 tasks. And still, every time I complete a task and try to "replace it", no tasks are down loaded. |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
Hey wb8ili, Can you give it another try for me and let me know what you see? Best, Jake |
Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,808,181,963 RAC: 0 |
Getting 200 tasks now per request, well done Jake. Was only getting 40-50 of them before this fix. |
Send message Joined: 2 Oct 16 Posts: 167 Credit: 1,008,062,758 RAC: 2,245 |
Jake - This is the issue. |
Send message Joined: 18 Jul 10 Posts: 76 Credit: 638,216,593 RAC: 52,848 |
Jake - I just did a "user request for tasks" and received 51 new ones. As I have written before, every 4 minutes or so I complete a task, report it, request new tasks, and get nothing. I have about 200 tasks now from my user requests so I will have to wait until tomorrow and see if my stock pile bleeds down. |
Send message Joined: 2 Oct 16 Posts: 167 Credit: 1,008,062,758 RAC: 2,245 |
Jake - I've ran out of tasks several times overnight. BOINCTasks history shows a lot of MW tasks, 1 task from Moo or Collatz (my 0% resource share backup projects) then a lot more MW tasks. There were several cycles of this. |
Send message Joined: 18 Jul 10 Posts: 76 Credit: 638,216,593 RAC: 52,848 |
Jake - 27 Mar 2019 20:44:37 Received 51 tasks via a User Requested Update. Now have approx. 200 tasks on-hand. 28 Mar 2019 10:26:00 Since previous time have completed 190 tasks. Automatically reported each completed task and asked for new tasks each time. Received none. Now have 10 tasks on-hand. Will probably run out of work less than an hour. Will report back what happens then. |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
Hey guys, So the current set up allows for users to have up to 200 workunits per GPU on their computer and another 40 workunits per CPU with a maximum of 600 possible workunits. On the server, we try to store a cache of 10,000 workunits. Sometimes when a lot of people request work all at the same time, this cache will run low. So all of the numbers I have listed are tunable. What would you guys recommend for changes to these numbers? Jake |
Send message Joined: 18 Jul 10 Posts: 76 Credit: 638,216,593 RAC: 52,848 |
Here is the rest of my story - 28 Mar 2019 110629 I finished my last task and was out of work 28 Mar 2019 111308 The BOINC Manager (I assume) sent a request for new tasks. Got 200. I am good for the next 15 hours. I was out of work for 6.5 minutes with no user intervention. Jake - something has changed in the recent past (new server?) that MY queue isn't being maintained. I still think it has something to do with requesting tasks too frequently. And, I still think around 6 minutes is the cutoff. However, I don't get a message "Too recent since last request" or similiar. Edit: I checked the log on another of my computers that takes 14 minutes to complete a task and the log shows that that computer gets a new task pretty much every time it finishes one. |
Send message Joined: 22 Apr 09 Posts: 95 Credit: 4,808,181,963 RAC: 0 |
I have been closely monitoring my BOINC machines as well and I can confirm that, despite downloading 200 tasks per request, the app is still running out of work. As wb8ili said, the queue is not maintained and the client goes through those 200 tasks without downloading new ones. The Event Log repeatedly shows something like this: 28/03/2019 20:06:26 | Milkyway@Home | Sending scheduler request: To fetch work. 28/03/2019 20:06:26 | Milkyway@Home | Reporting 8 completed tasks 28/03/2019 20:06:26 | Milkyway@Home | Requesting new tasks for NVIDIA GPU 28/03/2019 20:06:29 | Milkyway@Home | Scheduler request completed: got 0 new tasks Eventually, after the queue is completely cleared and when there are no more tasks to crunch, new 200 tasks are downloaded - but there is always a period of time during which the client is out of work before the queue is refilled. |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
I'm not sure how to force the client to download new work when it returns workunits to the scheduler. I've been searching the BOINC documentation looking for flags to try, but I can't find anything. Anyone have any suggestions? Jake |
Send message Joined: 18 Jul 10 Posts: 76 Credit: 638,216,593 RAC: 52,848 |
Jake - It is timing thing. Somewhere between 4.5 minutes and 14 minutes (6 minutes?), there is a "timer" in Milkyway that stops new tasks from being downloaded unless it is more than the "timer" since the last request. All other projects that I am familiar with have the feature. In CPDN it is one hour. And, all projects that I use reset the timer to the "max" if you request task before the "timer" has expired. I have 6 machines running Milkyway GPU tasks. Only my fastest has this issue (running out of work). Edit: I just checked Einstein and it looks like the "timer" is about 1 minute. |
Send message Joined: 2 Oct 16 Posts: 167 Credit: 1,008,062,758 RAC: 2,245 |
Hey guys, It's not any of these settings. When the server allows work, we get work. But there is a timeout to prevent users from spamming projects with frequent requests. These tasks are so quick tasks are constantly uploading. About ever 30-35 seconds for me. So we are constantly requesting too frequently until all tasks are done, the delay passes and then we can get more work. I still have a PC that has not contacted the server after the upgrade. The sched_reply_milkyway.cs.rpi.edu_milkyway.xml file does not have this line at all in the old version. <next_rpc_delay>600.000000</next_rpc_delay> https://boinc.berkeley.edu/trac/wiki/ProjectOptions#client-control For reference, the entire old version of the file minus some user info. <scheduler_reply> <scheduler_version>707</scheduler_version> <dont_use_dcf/> <master_url>http://milkyway.cs.rpi.edu/milkyway/</master_url> <request_delay>91.000000</request_delay> <project_name>Milkyway@Home</project_name> <project_preferences> <resource_share>10</resource_share> <no_cpu>1</no_cpu> <no_ati>0</no_ati> <no_cuda>0</no_cuda> <project_specific> <max_gfx_cpu_pct>20</max_gfx_cpu_pct> <gpu_target_frequency>60</gpu_target_frequency> <nbody_graphics_poll_period>30</nbody_graphics_poll_period> <nbody_graphics_float_speed>5</nbody_graphics_float_speed> <nbody_graphics_textured_point_size>250</nbody_graphics_textured_point_size> <nbody_graphics_point_point_size>40</nbody_graphics_point_point_size> </project_specific> <venue name="home"> <resource_share>50</resource_share> <no_cpu>0</no_cpu> <no_ati>1</no_ati> <no_cuda>1</no_cuda> <project_specific> <max_gfx_cpu_pct>20</max_gfx_cpu_pct> <gpu_target_frequency>60</gpu_target_frequency> <nbody_graphics_poll_period>30</nbody_graphics_poll_period> <nbody_graphics_float_speed>5</nbody_graphics_float_speed> <nbody_graphics_textured_point_size>250</nbody_graphics_textured_point_size> <nbody_graphics_point_point_size>40</nbody_graphics_point_point_size> </project_specific> </venue> </project_preferences> <result_ack> <name>de_modfit_sim19fixed_bundle4_4s_NoContraintsWithDisk260_3_1533467104_9447502_1</name> </result_ack> <result_ack> <name>de_modfit_sim19fixed_bundle4_4s_NoContraintsWithDisk260_1_1533467104_9241919_1</name> </result_ack> </scheduler_reply> |
Send message Joined: 24 Jan 11 Posts: 715 Credit: 555,471,277 RAC: 38,543 |
I think mmonnin has found the problem for fast clients. The rpc_delay is much too long for fast turnaround clients. They can exhaust all 200 tasks in the ten minute span before next connection. This is a server side parameter that project staff can alter. Seti has a 303 second rpc_delay which is borderline too long for fast clients also. |
Send message Joined: 2 Oct 16 Posts: 167 Credit: 1,008,062,758 RAC: 2,245 |
Well for my 280x it takes over 2 hours but there are quite a few requests within 10min. I wish for 30min. :) I'd guess a PC would need to run 1 task for over 10min to not run into the issue. These 4 lines come every 2-4 task completions . Some complete, none are downloaded. Queue runs dry. Moo/Collatz take over 10min for a task and a new set of MW tasks arrive. 362062 Milkyway@Home 3/28/2019 11:01:04 PM Sending scheduler request: To fetch work. 362063 Milkyway@Home 3/28/2019 11:01:04 PM Reporting 2 completed tasks 362064 Milkyway@Home 3/28/2019 11:01:04 PM Requesting new tasks for AMD/ATI GPU 362065 Milkyway@Home 3/28/2019 11:01:06 PM Scheduler request completed: got 0 new tasks Can the server distinguish between auto updates like those above and user updates so that the former could have a lower limit than possible user spam? The log file mentions a user update but does the server know? |
Send message Joined: 24 Jul 12 Posts: 40 Credit: 7,123,301,054 RAC: 0 |
I also think mmonnin has found the problem for fast clients. On my fastest system I complete the 200 WUs in 40 min, getting no new work for requests each time complete WUs are reported. After all WUs are completed the my BOINC client does not request any new work for 10 minutes. So 40 min computing and 10 minutes sitting idle is not very efficient!! I would certaintly like to see the next_rpc_delay reduced. Joe |
Send message Joined: 7 May 14 Posts: 57 Credit: 206,540,646 RAC: 5 |
hi, I need atleast 8days worth of WU's stored, i need offline wu's for atleast 2days at a time, im doing 19s per WU_cheers |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
hi, I need atleast 8days worth of WU's stored, i need offline wu's for atleast 2days at a time, im doing 19s per WU_cheers Unless you have a very slow gpu that won't happen here at MilkyWay, they don't want to tie up that many workunits on one machine at one time. The key here is to return a wu so you can get another wu, recently that has been a minor problem but the Admin has been working very hard to get it back up and running like it did under the old Server side software version. |
Send message Joined: 2 Aug 11 Posts: 13 Credit: 44,453,057 RAC: 0 |
I think mmonnin has found the problem for fast clients. The rpc_delay is much too long for fast turnaround clients. They can exhaust all 200 tasks in the ten minute span before next connection. 200 computed tasks in less 10 minutes? It is not possible even for fastest machines. Very best computers with few modern powerful GPUs working in parallel dedicated to the single project of MW can do 200 tasks "only" in ~20-40 min. Real problem is not rpc_delay by itself but mismatch between <next_rpc_delay> = 600 sec and <request_delay> = 91 sec. Server currently asks client to wait at-least 91 sec before next request, but "ban" (does not give new tasks and reset countdown timer back to 600 sec) if <600 sec passed after previous request. Fast computers report completed and request new tasks every few min (as they think it is OK to do so as the server ask to wait only 91 sec, so sending new request after 120 sec for example looks OK) and get "banned" every time because <600 sec passed from latest request. With <request_delay> > <next_rpc_delay> there will be no such problems. So INCREASE of <request_delay> to >600 too (shown as "communication deferred countdown timer" in project status of BOINC client ) sec should fix this problem. Also it will reduce server load significantly. Or decrease server side timeout. In either case client slide delay should be bigger or atleast equal to the server side delay/timeout to avoid false detection of fast clients as "misbehaving" ones. |
©2024 Astroinformatics Group