Message boards :
Number crunching :
1 of my 6 nodes, keeps timing out.
Message board moderation
Author | Message |
---|---|
Send message Joined: 27 Feb 22 Posts: 18 Credit: 2,967,695 RAC: 0 |
Hey guys, A greenhorn here, and my first post. I run a 6 node Docker Swarm dedicated to just Milkyway@home. All 6 nodes used to run tasks with out issue, until around sometimes last month. I have seen some of the challenges described in other threads, so Im uncertain if my issue is related. But around that time, 1 out of the 6 nodes stopped completing tasks, and started throwing timeout errors like seen in the results page below. https://milkyway.cs.rpi.edu/milkyway/results.php?hostid=924870 If I destroy the docker service and rebuild, once tasks are assigned I appear to get the same issue, only the on a different node each time. 1 out of the 6 throws the time out errors. As far as I can tell, it doesn't "appear" to be a issue on my side. But I wanted to get some feedback from the community and your thoughts. This is not a pressing matter, just looking to better understand the behavior. CPU Info Per Node: Intel Xeon E312xx (Sandy Bridge, IBRS update) [Family 6 Model 42 Stepping 1] (2 processors) Thanks again! -PD |
Send message Joined: 12 Nov 21 Posts: 236 Credit: 575,037,238 RAC: 35,415 |
Myself I don't think it's an issue worth worrying about. I have quite a few of those myself. but since no CPU time was expended, no harm no foul at your end. I think you are fine. (from one noob to another) |
Send message Joined: 3 Mar 13 Posts: 84 Credit: 779,527,603 RAC: 22,637 |
Looking at the time the server gave you to compleat the tasks , you do return them quickly , the difference between `sent` and `deadline` was only 90 minits , crazy short deadline , no chance to crunch them most tasks have a deadline of several days . |
Send message Joined: 8 May 09 Posts: 3315 Credit: 519,951,988 RAC: 21,328 |
Looking at the time the server gave you to compleat the tasks , you do return them quickly , I think you are looking at the wrong lines: Sent 10 Apr 2022, 2:51:56 UTC Report deadline 22 Apr 2022, 2:51:56 UTC He has 12 days to return them BUT returned them: Received 10 Apr 2022, 6:43:39 UTC That just means he has a very small cache size and can zoom thru the tasks very quickly and then get more tasks |
Send message Joined: 8 May 09 Posts: 3315 Credit: 519,951,988 RAC: 21,328 |
Hey guys, When I look at the task you linked to I see this: Validate state Checked, but no consensus yet That just means you are waiting for a wingman to get and finish the same task so they can see if your machine crunched the task accurately or not. MilkyWay uses the term 'Validation Inconclusive' to indicate a wingman is needed to validate your task, everyone has them and some people have THOUSANDS of them due to the Server problems of a few weeks ago that are continuing even today. The problem for you is the Wingmans task gets added to the end of the current queue and the Project makes over 10k tasks per day. If you look at the end of the tasks name you will see a _0 that means it's the initial task, a _1 means it's the first wingman task, while successive numbers mean even more wingmen are needed to figure out any discrepancies in the results, ie your machine could be just fine while the _1 persons machine is flaky so they could send to a 3rd or th or more people to try and see if the task is the problem or the computers. Some people overclock their pc's to wild extremes thinking faster is better, and it can be for gaming, but for crunching it can throw in errors. |
Send message Joined: 27 Feb 22 Posts: 18 Credit: 2,967,695 RAC: 0 |
Thanks for the responses, and the explanations! |
Send message Joined: 8 May 09 Posts: 3315 Credit: 519,951,988 RAC: 21,328 |
Thanks for the responses, and the explanations! +1 |
Send message Joined: 27 Feb 22 Posts: 18 Credit: 2,967,695 RAC: 0 |
So a quick update, as I think I found the cause. These docker images I run pretty tight as they are running in a virtual environment on vIRT/KVM. And the one node in question I noticed only had 989Megs free, opposed to the 1.2G it normally has. So this past weekend I did a quick cleanup of its installer cache (apt clean) and It now looks like its functioning as expected now. In the past I would see no CPU time spent as the behavior, but now after the clean up I see CPU churning away. So it hit me, I've been getting hit but the compute preference for "disk" that reads "Leave at least 1 GB free" So lesson learned, that's on me lol! |
Send message Joined: 8 May 09 Posts: 3315 Credit: 519,951,988 RAC: 21,328 |
So a quick update, as I think I found the cause. These docker images I run pretty tight as they are running in a virtual environment on vIRT/KVM. And the one node in question I noticed only had 989Megs free, opposed to the 1.2G it normally has. So this past weekend I did a quick cleanup of its installer cache (apt clean) and It now looks like its functioning as expected now. That's pretty cool that you found and were able to fix that, congratulations not alot of people would have come to that as quickly and certainly none of us thought of it. |
©2024 Astroinformatics Group