1 of my 6 nodes, keeps timing out.

Author	Message
Peter Dragon Send message Joined: 27 Feb 22 Posts: 18 Credit: 2,967,695 RAC: 0	Message 73076 - Posted: 22 Apr 2022, 20:41:38 UTC Hey guys, A greenhorn here, and my first post. I run a 6 node Docker Swarm dedicated to just Milkyway@home. All 6 nodes used to run tasks with out issue, until around sometimes last month. I have seen some of the challenges described in other threads, so Im uncertain if my issue is related. But around that time, 1 out of the 6 nodes stopped completing tasks, and started throwing timeout errors like seen in the results page below. https://milkyway.cs.rpi.edu/milkyway/results.php?hostid=924870 If I destroy the docker service and rebuild, once tasks are assigned I appear to get the same issue, only the on a different node each time. 1 out of the 6 throws the time out errors. As far as I can tell, it doesn't "appear" to be a issue on my side. But I wanted to get some feedback from the community and your thoughts. This is not a pressing matter, just looking to better understand the behavior. CPU Info Per Node: Intel Xeon E312xx (Sandy Bridge, IBRS update) [Family 6 Model 42 Stepping 1] (2 processors) Thanks again! -PD ID: 73076 · Rating: 0 · rate: / Reply Quote

HRFMguy Send message Joined: 12 Nov 21 Posts: 236 Credit: 575,038,236 RAC: 0	Message 73077 - Posted: 23 Apr 2022, 0:33:13 UTC - in response to Message 73076. Myself I don't think it's an issue worth worrying about. I have quite a few of those myself. but since no CPU time was expended, no harm no foul at your end. I think you are fine. (from one noob to another) ID: 73077 · Rating: 0 · rate: / Reply Quote

.clair. Send message Joined: 3 Mar 13 Posts: 84 Credit: 779,527,712 RAC: 0	Message 73078 - Posted: 23 Apr 2022, 10:04:44 UTC Looking at the time the server gave you to compleat the tasks , you do return them quickly , the difference between `sent` and `deadline` was only 90 minits , crazy short deadline , no chance to crunch them most tasks have a deadline of several days . ID: 73078 · Rating: 0 · rate: / Reply Quote

S984s5KN6muKjYePgfqf7F37RiXw5f... Send message Joined: 8 May 09 Posts: 3339 Credit: 524,356,366 RAC: 16,807	Message 73079 - Posted: 23 Apr 2022, 12:03:37 UTC - in response to Message 73078. Looking at the time the server gave you to compleat the tasks , you do return them quickly , the difference between `sent` and `deadline` was only 90 minits , crazy short deadline , no chance to crunch them most tasks have a deadline of several days . I think you are looking at the wrong lines: Sent 10 Apr 2022, 2:51:56 UTC Report deadline 22 Apr 2022, 2:51:56 UTC He has 12 days to return them BUT returned them: Received 10 Apr 2022, 6:43:39 UTC That just means he has a very small cache size and can zoom thru the tasks very quickly and then get more tasks ID: 73079 · Rating: 0 · rate: / Reply Quote

S984s5KN6muKjYePgfqf7F37RiXw5f... Send message Joined: 8 May 09 Posts: 3339 Credit: 524,356,366 RAC: 16,807	Message 73080 - Posted: 23 Apr 2022, 12:15:57 UTC - in response to Message 73076. Hey guys, A greenhorn here, and my first post. I run a 6 node Docker Swarm dedicated to just Milkyway@home. All 6 nodes used to run tasks with out issue, until around sometimes last month. I have seen some of the challenges described in other threads, so Im uncertain if my issue is related. But around that time, 1 out of the 6 nodes stopped completing tasks, and started throwing timeout errors like seen in the results page below. https://milkyway.cs.rpi.edu/milkyway/results.php?hostid=924870 If I destroy the docker service and rebuild, once tasks are assigned I appear to get the same issue, only the on a different node each time. 1 out of the 6 throws the time out errors. As far as I can tell, it doesn't "appear" to be a issue on my side. But I wanted to get some feedback from the community and your thoughts. This is not a pressing matter, just looking to better understand the behavior. CPU Info Per Node: Intel Xeon E312xx (Sandy Bridge, IBRS update) [Family 6 Model 42 Stepping 1] (2 processors) Thanks again! -PD When I look at the task you linked to I see this: Validate state Checked, but no consensus yet That just means you are waiting for a wingman to get and finish the same task so they can see if your machine crunched the task accurately or not. MilkyWay uses the term 'Validation Inconclusive' to indicate a wingman is needed to validate your task, everyone has them and some people have THOUSANDS of them due to the Server problems of a few weeks ago that are continuing even today. The problem for you is the Wingmans task gets added to the end of the current queue and the Project makes over 10k tasks per day. If you look at the end of the tasks name you will see a _0 that means it's the initial task, a _1 means it's the first wingman task, while successive numbers mean even more wingmen are needed to figure out any discrepancies in the results, ie your machine could be just fine while the _1 persons machine is flaky so they could send to a 3rd or th or more people to try and see if the task is the problem or the computers. Some people overclock their pc's to wild extremes thinking faster is better, and it can be for gaming, but for crunching it can throw in errors. ID: 73080 · Rating: 0 · rate: / Reply Quote

Peter Dragon Send message Joined: 27 Feb 22 Posts: 18 Credit: 2,967,695 RAC: 0	Message 73097 - Posted: 24 Apr 2022, 20:05:20 UTC - in response to Message 73080. Thanks for the responses, and the explanations! ID: 73097 · Rating: 0 · rate: / Reply Quote

S984s5KN6muKjYePgfqf7F37RiXw5f... Send message Joined: 8 May 09 Posts: 3339 Credit: 524,356,366 RAC: 16,807	Message 73102 - Posted: 25 Apr 2022, 13:22:05 UTC - in response to Message 73097. Thanks for the responses, and the explanations! +1 ID: 73102 · Rating: 0 · rate: / Reply Quote

Peter Dragon Send message Joined: 27 Feb 22 Posts: 18 Credit: 2,967,695 RAC: 0	Message 73103 - Posted: 25 Apr 2022, 14:06:21 UTC - in response to Message 73102. Last modified: 25 Apr 2022, 14:08:23 UTC So a quick update, as I think I found the cause. These docker images I run pretty tight as they are running in a virtual environment on vIRT/KVM. And the one node in question I noticed only had 989Megs free, opposed to the 1.2G it normally has. So this past weekend I did a quick cleanup of its installer cache (apt clean) and It now looks like its functioning as expected now. In the past I would see no CPU time spent as the behavior, but now after the clean up I see CPU churning away. So it hit me, I've been getting hit but the compute preference for "disk" that reads "Leave at least 1 GB free" So lesson learned, that's on me lol! ID: 73103 · Rating: 0 · rate: / Reply Quote

S984s5KN6muKjYePgfqf7F37RiXw5f... Send message Joined: 8 May 09 Posts: 3339 Credit: 524,356,366 RAC: 16,807	Message 73127 - Posted: 26 Apr 2022, 11:10:10 UTC - in response to Message 73103. So a quick update, as I think I found the cause. These docker images I run pretty tight as they are running in a virtual environment on vIRT/KVM. And the one node in question I noticed only had 989Megs free, opposed to the 1.2G it normally has. So this past weekend I did a quick cleanup of its installer cache (apt clean) and It now looks like its functioning as expected now. In the past I would see no CPU time spent as the behavior, but now after the clean up I see CPU churning away. So it hit me, I've been getting hit but the compute preference for "disk" that reads "Leave at least 1 GB free" So lesson learned, that's on me lol! That's pretty cool that you found and were able to fix that, congratulations not alot of people would have come to that as quickly and certainly none of us thought of it. ID: 73127 · Rating: 0 · rate: / Reply Quote