scheduler request timeout when reporting more than 1 result at once

Author	Message
xii5ku Send message Joined: 1 Jan 17 Posts: 36 Credit: 101,863,131 RAC: 209,675	Message 76999 - Posted: 2 Apr 2024, 20:03:06 UTC Last modified: 2 Apr 2024, 20:06:53 UTC I ran MilkyWay on two computers or so until March 20, then diverted to other projects. I don't recall any issues with MilkyWay from before March 20. This weekend I turned one of the computers back on to MilkyWay (a fast and large computer), and soon this computer got into a situation in which all of the scheduler requests to report results (and to request new work) timed out. I noticed that I had <max_tasks_reported>40</max_tasks_reported> in cc_config.xml (which isn't a lot) and decreased it now. Turns out that 2, 3 or 4 sometimes get through without timeout, but not always. None of my attempts with more than 4 tasks reported at once succeeded; they are all timing out. It seems as if the attempts with 2, 3, or 4 at once have a better chance of succeeding if the client does not request new work in the same transaction. But I am not sure about it. However, those attempts still time out very often, therefore I temporarily set "no new work" and <max_tasks_reported>1</max_tasks_reported>. And that seems never to time out. The one big downside here is that I've got ~1,800 results to report by now. I am forcing a scheduler request every 15 seconds now in hope to get this backlog cleared over night. :-( BTW, the current median task duration on this computer is about 6 minutes currently, and it runs 32 MilkyWay tasks concurrently. That is, in steady state, it would have a mean time between task completions of maybe 11 seconds or so. (Depends on the rolling average of the task sizes, obviously.) With MilkyWay's server-side request_backoff of 91 seconds, i.e. the client normally issuing requests at most every ~96 seconds or so by itself, this client would normally need to report 8 or more results at once within a singe scheduler request, if it were to run MilkyWay at full time with the current small tasks. ID: 76999 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 708 Credit: 544,737,330 RAC: 91,532	Message 77000 - Posted: 2 Apr 2024, 21:06:35 UTC Last modified: 2 Apr 2024, 21:09:08 UTC This an OLD problem that has existed at MW since its inception. It is impossible to report a task and receive a replacement task in the same scheduler connection. It can't be done without employing custom clients and workarounds. It takes two scheduler connections, first to report a completed task and then to receive the replacement task(s) on the next scheduler connection. And when you have depleted your cache and reported the last task, MW forces the client into a mandatory 10 minute backoff before allowing a new scheduler connection to replenish your cache. This was always the main complaint of high production clients when the Separation work was available. Nothing has changed in the scheduler code but the longer running N-body tasks mostly eliminate the problem because no tasks finish faster than the default 91 second scheduler connection interval. But your high production host has duplicated exactly the same issue that the Separation tasks caused. ID: 77000 · Rating: 0 · rate: / Reply Quote

xii5ku Send message Joined: 1 Jan 17 Posts: 36 Credit: 101,863,131 RAC: 209,675	Message 77001 - Posted: 3 Apr 2024, 6:24:52 UTC Last modified: 3 Apr 2024, 6:30:42 UTC My trouble were scheduler transaction failures by timeouts though, not regular responses which did not grant work and mandated a large backoff period. BTW, there is a non-negligible possibility that these timeouts were caused by my lowlife internet service provider, not by MilkyWay's project server. -------- Now this morning, as my stash of results ready to report was almost depleted, I switched the client back to <max_tasks_reported>40</max_tasks_reported>. It issued two final requests of this size and both of them succeeded = did not fail with timeout this time. I then let it request and receive new work and will leave the client doing its thing for now. I changed <report_results_immediately> from 0 to 1 though, maybe this will help to reduce the buildup of large quantities of results ready to report. Alas my internet link is not very reliable (which is why I prefer to run BOINC with a work buffer size which is deeper than half a day, during which I am not at home to fix things), hence it could happen any time again that the client needs to report the results of several hours worth of work. During the hour before I have to leave home, the client is now reporting >10 results per request on average (peaked at 31 results), and all of these scheduler transactions succeeded. Crossing fingers that it continues to work while I'm away. None of the mentioned successful requests of this morning combined the reporting of results and the requesting of new work though. The first of such combined requests of today is going to happen in my absence... ID: 77001 · Rating: 0 · rate: / Reply Quote

xii5ku Send message Joined: 1 Jan 17 Posts: 36 Credit: 101,863,131 RAC: 209,675	Message 77002 - Posted: 3 Apr 2024, 18:59:46 UTC Everything has been working nicely today. According to logs, the client has been reporting results at the maximum request rate which the server admits. As to be expected, there were 3...11 results in each scheduler request. And about hourly, the buffer of runnable tasks dipped below 1000, the client requested more work, and got more work. No scheduler request timeouts today, as far as I can see. I will keep <report_results_immediately>1</report_results_immediately> on this host as long as I have it active at MilkyWay and the tasks are so small as they are currently. ID: 77002 · Rating: 0 · rate: / Reply Quote

xii5ku Send message Joined: 1 Jan 17 Posts: 36 Credit: 101,863,131 RAC: 209,675	Message 77005 - Posted: 4 Apr 2024, 6:33:34 UTC Last modified: 4 Apr 2024, 6:44:20 UTC The problem reappeared. When I woke up today, my first look was at the powermeter of the host, and I could tell right away that it was running NFS@home instead of MilkyWay. (I have NFS configured as backup project with 0% resource share.) Turned out that the client had stopped reporting results at some point between 4 and 5 AM, had now accumulated >600 good results and >500 error results ("process exited with code 13 (0xd, -243)", see thread 5097), and the client had chosen to back off of further scheduler requests for >1 d. Obviously, the computation errors made the client stop reporting the results. But that's not the issue here. Rather: When I started clearing this reporting backlog by manually forced project updates, scheduler requests with 40 results to report + requests for new work succeeded at first. But after the host reached a certain level of tasks in progress (that is, sum of downloading/downloaded/uploading/uploaded/error tasks) of ~1600 or so, scheduler requests with 40 as well as with 30 results to report at once failed with timeout. This failure happened regardless whether or not new work was requested in the same request. Reporting 20 results at once still worked. So I forced such requests every 10 seconds until all results were reported. I will now leave the client alone again, with <max_tasks_reported>20</max_tasks_reported> and <report_results_immediately>1</report_results_immediately>, plus I am running a monitoring script which checks every 91+5 seconds if the number of results ready to report exceeds 20, and if so, forces a project update. ID: 77005 · Rating: 0 · rate: / Reply Quote

xii5ku Send message Joined: 1 Jan 17 Posts: 36 Credit: 101,863,131 RAC: 209,675	Message 77083 - Posted: 22 Apr 2024, 5:01:11 UTC Last modified: 22 Apr 2024, 5:17:46 UTC I didn't run MilkyWay@Home for a week but restarted it half a day ago. And the problem reappeared today at 00:00 UTC. This happened: – At this point, the host had 1003 tasks in progress, reported 7 results, and requested more work. – The server assigned 1009 ! tasks to the host, and the host had now 2005 tasks in progress. – From then on, almost all scheduler requests to report results (with max_tasks_reported=20) failed with timeout. – I discovered this situation at 04:30 UTC, at which time the host still had 1752 tasks in progress, even though a background script had been forcing scheduler requests every 97 seconds as soon as more than 20 results ready to report had been piling up. To recover, I switched temporarily to max_tasks_reported=1, set the project to no new work, and engaged a script to report a result every ten seconds. Most of these requests succeed, but occasionally even these single-result reports time out. I will let the computer run another project while I am away from home today. After that I will perhaps set up multiple client instances on this computer in order to be able to maintain about half a day work buffer depth on it without having so many tasks in progress per client. ID: 77083 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3323 Credit: 521,026,825 RAC: 40,478	Message 77085 - Posted: 22 Apr 2024, 9:40:27 UTC - in response to Message 77083. I didn't run MilkyWay@Home for a week but restarted it half a day ago. And the problem reappeared today at 00:00 UTC. This happened: – At this point, the host had 1003 tasks in progress, reported 7 results, and requested more work. – The server assigned 1009 ! tasks to the host, and the host had now 2005 tasks in progress. – From then on, almost all scheduler requests to report results (with max_tasks_reported=20) failed with timeout. – I discovered this situation at 04:30 UTC, at which time the host still had 1752 tasks in progress, even though a background script had been forcing scheduler requests every 97 seconds as soon as more than 20 results ready to report had been piling up. To recover, I switched temporarily to max_tasks_reported=1, set the project to no new work, and engaged a script to report a result every ten seconds. Most of these requests succeed, but occasionally even these single-result reports time out. I will let the computer run another project while I am away from home today. After that I will perhaps set up multiple client instances on this computer in order to be able to maintain about half a day work buffer depth on it without having so many tasks in progress per client. Your location is hidden but do you think it could be the internet itself that's not cooperating between you and the MilkyWay Servers in Wisconsin? There are alot of reports on the various News sites about internet cables are physically down meaning routing options are limited at times. My thought is can you do an IP scan of the MilkyWay Server when it's not cooperating and see if it's up and running normally or if it's slow too. ID: 77085 · Rating: 0 · rate: / Reply Quote

xii5ku Send message Joined: 1 Jan 17 Posts: 36 Credit: 101,863,131 RAC: 209,675	Message 77091 - Posted: 22 Apr 2024, 20:56:21 UTC - in response to Message 77085. mikey wrote: Your location is hidden but do you think it could be the internet itself that's not cooperating between you and the MilkyWay Servers in Wisconsin? I am based in Germany, and I do have internet service of terrible quality via coaxial cable, which is why I prefer >0.5 days deep work buffers. However: During these situations with scheduler request timeouts between this host and MilkyWay, the MW web site keeps responding very promptly, including dynamic web pages with database access such as the result tables. Second: If I reduce max_tasks_reported as far down as necessary, MW's scheduler responds very quickly to the client again too (to most but not all requests). And third: When I manage to gradually get the number of tasks in progress down again during recovery from these situations, I can gradually increase max_tasks_reported again too. The correlation between likelihood of scheduler request timeout, max_tasks_reported, and number of tasks in progress is really evident in my observations. (The three occasions reported here were with ~1,800, ~1,600, and ~2,000 tasks in progress when — or a while after — the described condition started.) Obviously, only large hosts with large work buffer setting will ever receive so many tasks from the server. I, like most everyone else, am running a stock client which does not request more work once there are 1,000 runnable tasks buffered. ID: 77091 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3323 Credit: 521,026,825 RAC: 40,478	Message 77093 - Posted: 23 Apr 2024, 11:09:59 UTC - in response to Message 77091. mikey wrote: Your location is hidden but do you think it could be the internet itself that's not cooperating between you and the MilkyWay Servers in Wisconsin? I am based in Germany, and I do have internet service of terrible quality via coaxial cable, which is why I prefer >0.5 days deep work buffers. However: During these situations with scheduler request timeouts between this host and MilkyWay, the MW web site keeps responding very promptly, including dynamic web pages with database access such as the result tables. Second: If I reduce max_tasks_reported as far down as necessary, MW's scheduler responds very quickly to the client again too (to most but not all requests). And third: When I manage to gradually get the number of tasks in progress down again during recovery from these situations, I can gradually increase max_tasks_reported again too. The correlation between likelihood of scheduler request timeout, max_tasks_reported, and number of tasks in progress is really evident in my observations. (The three occasions reported here were with ~1,800, ~1,600, and ~2,000 tasks in progress when — or a while after — the described condition started.) Obviously, only large hosts with large work buffer setting will ever receive so many tasks from the server. I, like most everyone else, am running a stock client which does not request more work once there are 1,000 runnable tasks buffered. I wonder if the Server has a move on to the next client setting after it's been connected more than x seconds? That way most people can get and return their tasks within that x seconds but you, because of your large cache of tasks both needed and returning, takes longer and you get disconnected? Whatever the problem it sounds like it's on the Server side as my 17 desktop pc's can get and return tasks just fine. ID: 77093 · Rating: 0 · rate: / Reply Quote

xii5ku Send message Joined: 1 Jan 17 Posts: 36 Credit: 101,863,131 RAC: 209,675	Message 77111 - Posted: 28 Apr 2024, 10:12:01 UTC Last modified: 28 Apr 2024, 10:57:50 UTC Each scheduler request to report results or/and to request more work includes a bunch of other client-side information. Among it is a list of all tasks which the client currently has got in its work buffer. Therefore, if a client has got a large number of tasks in progress, the data payload within the scheduler request gets respectively larger (by about 190 bytes per task in progress; and more than that per result which is being reported). That is, my host, having unusually many tasks in progress, sent scheduler requests with unusually large data size. And from a certain size on, the likelihood of request timeouts jumped up. (I am not sure at which size that is precisely; seems to be in the ballpark of ~300 kBytes.) Now what caused these timeouts? The MilkyWay@Home server kept responding very quickly to other requests, as I mentioned. That is, at least server-side database performance does not appear to be an issue. It is very well possible that the way how my Internet Service Provider is throttling the speed on the upstream channels of my coaxial cable connection is at fault. It seems as if my ISP changed something to the worse in this regard during the course of 2023. — Or it could be related to transcontinental routing problems which you mentioned in #77085. But if so, only large upstream transfers are affected, not small ones, nor downstream transfers of any size. — In general, I have been able to upload large result files of other projects in early 2023, but suddenly had great difficulties with it since later in 2023, but only to project servers outside of Europe. I am able to work around this for result file uploads by setting an upload speed limit in the BOINC client. But this does evidently not help with scheduler requests with large payload. So far, MilkyWay@Home is the only project at which I noticed this problem with large scheduler requests. This may or may not be related to MilkyWay@Home, as I don't know if I ever had so many tasks in progress on a single host in any other project (probably not), ever since my mentioned trouble with upstream transfers to overseas servers started to happen. Edit: @Kevin Roux, does the MilkyWay@Home HTTP server have a size limit for HTTP POST request body size configured to something at the order of magnitude of the mentioned ~300 kBytes? (I don't know HTTP and its implementations well enough to say whether or not this can be a problem. I would think that the server responds with an error status if the client issues a too large POST request, e.g. 413 Payload Too Large, rather than not responding at all. But I may be wrong.) Note to self: Reproduce the problem with <http_debug> enabled in the client. ID: 77111 · Rating: 0 · rate: / Reply Quote

Kevin Roux Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 9 Aug 22 Posts: 74 Credit: 1,299,765 RAC: 8,488	Message 77118 - Posted: 30 Apr 2024, 13:43:57 UTC - in response to Message 77111. @Kevin Roux, does the MilkyWay@Home HTTP server have a size limit for HTTP POST request body size configured to something at the order of magnitude of the mentioned ~300 kBytes? (I don't know HTTP and its implementations well enough to say whether or not this can be a problem. I would think that the server responds with an error status if the client issues a too large POST request, e.g. 413 Payload Too Large, rather than not responding at all. But I may be wrong.) I will ask the IT team that set up the HTTP settings for the server and get back to you when I have more information. ID: 77118 · Rating: 0 · rate: / Reply Quote

Kevin Roux Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 9 Aug 22 Posts: 74 Credit: 1,299,765 RAC: 8,488	Message 77122 - Posted: 3 May 2024, 19:25:47 UTC - in response to Message 77118. @Kevin Roux, does the MilkyWay@Home HTTP server have a size limit for HTTP POST request body size configured to something at the order of magnitude of the mentioned ~300 kBytes? (I don't know HTTP and its implementations well enough to say whether or not this can be a problem. I would think that the server responds with an error status if the client issues a too large POST request, e.g. 413 Payload Too Large, rather than not responding at all. But I may be wrong.) I will ask the IT team that set up the HTTP settings for the server and get back to you when I have more information. The configurations for this setting have been left at their default. There is no limit set on the http.conf for MilkyWay@home and the post_max_size for PHP is set to a default of 8MB so nothing on the order of ~300kBytes as far as I can tell. I am not exactly sure what the problem is. Is this something that is still happening? ID: 77122 · Rating: 0 · rate: / Reply Quote