Welcome to MilkyWay@home

1 of my 6 nodes, keeps timing out.

Message boards : Number crunching : 1 of my 6 nodes, keeps timing out.
Message board moderation

To post messages, you must log in.

AuthorMessage
Peter Dragon
Avatar

Send message
Joined: 27 Feb 22
Posts: 18
Credit: 2,967,695
RAC: 0
Message 73076 - Posted: 22 Apr 2022, 20:41:38 UTC

Hey guys,

A greenhorn here, and my first post. I run a 6 node Docker Swarm dedicated to just Milkyway@home. All 6 nodes used to run tasks with out issue, until around sometimes last month. I have seen some of the challenges described in other threads, so Im uncertain if my issue is related.

But around that time, 1 out of the 6 nodes stopped completing tasks, and started throwing timeout errors like seen in the results page below.

https://milkyway.cs.rpi.edu/milkyway/results.php?hostid=924870

If I destroy the docker service and rebuild, once tasks are assigned I appear to get the same issue, only the on a different node each time. 1 out of the 6 throws the time out errors. As far as I can tell, it doesn't "appear" to be a issue on my side. But I wanted to get some feedback from the community and your thoughts.

This is not a pressing matter, just looking to better understand the behavior.

CPU Info Per Node:
Intel Xeon E312xx (Sandy Bridge, IBRS update) [Family 6 Model 42 Stepping 1]
(2 processors)

Thanks again!
-PD
ID: 73076 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile HRFMguy

Send message
Joined: 12 Nov 21
Posts: 236
Credit: 575,037,238
RAC: 35,415
Message 73077 - Posted: 23 Apr 2022, 0:33:13 UTC - in response to Message 73076.  

Myself I don't think it's an issue worth worrying about. I have quite a few of those myself. but since no CPU time was expended, no harm no foul at your end. I think you are fine. (from one noob to another)
ID: 73077 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
.clair.

Send message
Joined: 3 Mar 13
Posts: 84
Credit: 779,527,603
RAC: 22,637
Message 73078 - Posted: 23 Apr 2022, 10:04:44 UTC

Looking at the time the server gave you to compleat the tasks , you do return them quickly ,
the difference between `sent` and `deadline` was only 90 minits , crazy short deadline , no chance to crunch them
most tasks have a deadline of several days .
ID: 73078 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,951,988
RAC: 21,328
Message 73079 - Posted: 23 Apr 2022, 12:03:37 UTC - in response to Message 73078.  

Looking at the time the server gave you to compleat the tasks , you do return them quickly ,
the difference between `sent` and `deadline` was only 90 minits , crazy short deadline , no chance to crunch them
most tasks have a deadline of several days .


I think you are looking at the wrong lines:
Sent 10 Apr 2022, 2:51:56 UTC
Report deadline 22 Apr 2022, 2:51:56 UTC

He has 12 days to return them BUT returned them:
Received 10 Apr 2022, 6:43:39 UTC

That just means he has a very small cache size and can zoom thru the tasks very quickly and then get more tasks
ID: 73079 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,951,988
RAC: 21,328
Message 73080 - Posted: 23 Apr 2022, 12:15:57 UTC - in response to Message 73076.  

Hey guys,

A greenhorn here, and my first post. I run a 6 node Docker Swarm dedicated to just Milkyway@home. All 6 nodes used to run tasks with out issue, until around sometimes last month. I have seen some of the challenges described in other threads, so Im uncertain if my issue is related.

But around that time, 1 out of the 6 nodes stopped completing tasks, and started throwing timeout errors like seen in the results page below.

https://milkyway.cs.rpi.edu/milkyway/results.php?hostid=924870

If I destroy the docker service and rebuild, once tasks are assigned I appear to get the same issue, only the on a different node each time. 1 out of the 6 throws the time out errors. As far as I can tell, it doesn't "appear" to be a issue on my side. But I wanted to get some feedback from the community and your thoughts.

This is not a pressing matter, just looking to better understand the behavior.

CPU Info Per Node:
Intel Xeon E312xx (Sandy Bridge, IBRS update) [Family 6 Model 42 Stepping 1]
(2 processors)

Thanks again! -PD


When I look at the task you linked to I see this:
Validate state Checked, but no consensus yet

That just means you are waiting for a wingman to get and finish the same task so they can see if your machine crunched the task accurately or not.

MilkyWay uses the term 'Validation Inconclusive' to indicate a wingman is needed to validate your task, everyone has them and some people have THOUSANDS of them due to the Server problems of a few weeks ago that are continuing even today. The problem for you is the Wingmans task gets added to the end of the current queue and the Project makes over 10k tasks per day. If you look at the end of the tasks name you will see a _0 that means it's the initial task, a _1 means it's the first wingman task, while successive numbers mean even more wingmen are needed to figure out any discrepancies in the results, ie your machine could be just fine while the _1 persons machine is flaky so they could send to a 3rd or th or more people to try and see if the task is the problem or the computers. Some people overclock their pc's to wild extremes thinking faster is better, and it can be for gaming, but for crunching it can throw in errors.
ID: 73080 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Dragon
Avatar

Send message
Joined: 27 Feb 22
Posts: 18
Credit: 2,967,695
RAC: 0
Message 73097 - Posted: 24 Apr 2022, 20:05:20 UTC - in response to Message 73080.  

Thanks for the responses, and the explanations!
ID: 73097 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,951,988
RAC: 21,328
Message 73102 - Posted: 25 Apr 2022, 13:22:05 UTC - in response to Message 73097.  

Thanks for the responses, and the explanations!


+1
ID: 73102 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Dragon
Avatar

Send message
Joined: 27 Feb 22
Posts: 18
Credit: 2,967,695
RAC: 0
Message 73103 - Posted: 25 Apr 2022, 14:06:21 UTC - in response to Message 73102.  
Last modified: 25 Apr 2022, 14:08:23 UTC

So a quick update, as I think I found the cause. These docker images I run pretty tight as they are running in a virtual environment on vIRT/KVM. And the one node in question I noticed only had 989Megs free, opposed to the 1.2G it normally has. So this past weekend I did a quick cleanup of its installer cache (apt clean) and It now looks like its functioning as expected now.

In the past I would see no CPU time spent as the behavior, but now after the clean up I see CPU churning away. So it hit me, I've been getting hit but the compute preference for "disk" that reads "Leave at least 1 GB free"

So lesson learned, that's on me lol!
ID: 73103 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,951,988
RAC: 21,328
Message 73127 - Posted: 26 Apr 2022, 11:10:10 UTC - in response to Message 73103.  

So a quick update, as I think I found the cause. These docker images I run pretty tight as they are running in a virtual environment on vIRT/KVM. And the one node in question I noticed only had 989Megs free, opposed to the 1.2G it normally has. So this past weekend I did a quick cleanup of its installer cache (apt clean) and It now looks like its functioning as expected now.

In the past I would see no CPU time spent as the behavior, but now after the clean up I see CPU churning away. So it hit me, I've been getting hit but the compute preference for "disk" that reads "Leave at least 1 GB free"

So lesson learned, that's on me lol!


That's pretty cool that you found and were able to fix that, congratulations not alot of people would have come to that as quickly and certainly none of us thought of it.
ID: 73127 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : 1 of my 6 nodes, keeps timing out.

©2024 Astroinformatics Group