Message boards :
Number crunching :
Validation inconclusive
Message board moderation
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 18 · Next
Author | Message |
---|---|
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
I'm looking into things. Not sure why so much is going to validation pending. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
There were a couple server processes that had been querying the DB for a couple hundred hours. I've killed those and hopefully things will start moving again. |
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,900,464 RAC: 0 |
Thanks for looking… |
Send message Joined: 24 Jan 11 Posts: 715 Credit: 556,513,213 RAC: 50,935 |
Most of the server processes are not running. None of the work is processing on any of my hosts. All hosts in multi-hour backoffs. |
Send message Joined: 16 Mar 10 Posts: 213 Credit: 108,654,404 RAC: 20,578 |
There were a couple server processes that had been querying the DB for a couple hundred hours. I've killed those and hopefully things will start moving again. Tom, Attempts to access the server to report work are being met by "Feeder not running" (and the Server Status page is still stuck showing most server processes not running and other statistics from around 13:40 UTC... [Edit. I see Keith Myers has flagged this up whilst I was researching this message - hey, ho...] Cheers - Al. P.S. do we know what the offending processes were? |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Are you still getting the "feeder not running" issue? |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
I'm turning things off and back on. In the past when people had the feeder error, this has fixed it. |
Send message Joined: 16 Mar 10 Posts: 213 Credit: 108,654,404 RAC: 20,578 |
Are you still getting the "feeder not running" issue? Nope -- and some folks had actually acknowledged yesterday's "fix" in the "Feeder Not Running" thread. Sorry; one of us (which probably means me) should've also posted an acknowledgment here... There seems to be a different problem now, doesn't there? It'll be interesting to see if a full shut-down/restart will have helped solve the delays in validator activity (even N-Body tasks have been sat in "Pending validation" for up to and occasionally beyond an hour...). Also, not necessarily connected to anything in particular, I noticed that the task data part of the status page had only updated once per 12 hours since 13:37 UTC on the 17th -- once at 01:24 and again at 12:57 on 18th. Perhaps a restart will change whatever is stopping it updating the task data cache... And, if I might, an enquiry: is N-Body supposed to be using adaptive replication? If so, it doesn't seem to be working :-) Cheers - Al. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Yeah, hopefully restarting the project processes will do something for that problem. I can always try a full system restart too. I have no idea where to even start to look if it's a validator problem, because the validator is running fine and the logs don't show any issues. is N-Body supposed to be using adaptive replication? If so, it doesn't seem to be working :-) I think everyone's adaptive replication scores got ruined when I cancelled millions of jobs after the disk failure, and it will take a while for the system to recognize that you are returning good WUs again. Until that happens, I think it will just always check every WU twice, which sucks. I suppose I could always tinker with this in the DB but there's no way of knowing how that will impact the algorithm going forwards. |
Send message Joined: 16 Mar 10 Posts: 213 Credit: 108,654,404 RAC: 20,578 |
Yeah, hopefully restarting the project processes will do something for that problem. I can always try a full system restart too. I have no idea where to even start to look if it's a validator problem, because the validator is running fine and the logs don't show any issues. Tom - thanks for the reply. In theory, the count of good WUs should be maintained as tasks get validated, and not be influenced by prior activity; unless there's something strange going on there's no real reason the count should be depleted! However, there are several possibilities for "strange" that could result from something having broken in the application configuration (perhaps when the WU generator ran wild?) I did a [partial] code dive a while back to see how adaptive replication was implemented, and if I recall correctly there were at least three reasons it might not actually invoke adaptive replication, but I didn't follow up on that at the time. If I get the chance, I might have another go :-) Again, thanks for the reply. Cheers - Al. |
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,900,464 RAC: 0 |
Is anything actually getting validated ? There seems to be a delay now in Separation tasks getting validated. I have given up on my Simulation backlog which is well over a month old and nothing has altered there at all. |
Send message Joined: 14 May 22 Posts: 7 Credit: 8,077,321 RAC: 0 |
I do not know if anything is getting validated. This is my oldest. There are 184 others. Workunit 458806994 name de_modfit_86_bundle5_3s_south_pt2_2_1652888500_3384632 application Milkyway@home Separation created 21 May 2022, 6:48:25 UTC minimum quorum 1 initial replication 1 max # of error/total/success tasks 2, 9, 6 validation Pending Task Computer Sent Time reported explain Status Run time CPU time Credit Application 279025918 928280 22 May 2022, 8:24:29 UTC 22 May 2022, 23:13:27 UTC Completed, waiting for validation 2,791.94 2,777.26 pending Milkyway@home Separation v1.46 x86_64-pc-linux-gnu |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
I do not know if anything is getting validated. This is my oldest. There are 184 others. My oldest is from the 27th of April and there are 853 others just like it. |
Send message Joined: 18 Feb 10 Posts: 57 Credit: 222,589,579 RAC: 4,769 |
It's almost 2 days since I've got any task validated. There's over 1.5m workunits waiting for validation, don't think it's going good. |
Send message Joined: 27 Feb 22 Posts: 18 Credit: 2,967,695 RAC: 0 |
It's almost 2 days since I've got any task validated. There's over 1.5m workunits waiting for validation, don't think it's going good. My task validations have been completing as of 24 hours ago. 23 May 2022, 4:07:46 UTC 23 May 2022, 15:43:21 UTC Completed and validated |
Send message Joined: 18 Feb 10 Posts: 57 Credit: 222,589,579 RAC: 4,769 |
Yes I also have some validated tasks from May 23 now, they weren't there a few hours ago. btw workunits waiting for validation is now almost 1.8m. |
Send message Joined: 16 Mar 10 Posts: 213 Credit: 108,654,404 RAC: 20,578 |
Some time around 17:00 to 17:30 UTC today the transitioner backlog that has been plaguing the system for several days finally cleared; the peaking of the "waiting for validation" figure probably reflects the last of the backlog getting cleared out! Now the transitioner isn't taking up as much system time (in particular, database accesses!) the Separation validator should be able to catch up (though it's anyone's guess how quickly it will do so!) The N-Body validator didn't seem to be as badly backed up at any time, and now second results for N-Body seem to go Valid almost as soon as they hit the server! We now need to hope that with the number of unsent tasks still falling at a reasonable rate there won't be another reason for another substantial transitioner backlog to occur; otherwise, we could see another bout of excess work unit creation :-( Cheers - Al. |
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,900,464 RAC: 0 |
I notice my backlog of Nbody Simulations has started reducing albeit slowly, but good to see. |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
I notice my backlog of Nbody Simulations has started reducing albeit slowly, but good to see. Mine hasn't changed in a couple of days so I sure hope that's right. |
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,900,464 RAC: 0 |
I notice my backlog of Nbody Simulations has started reducing albeit slowly, but good to see. Gone down by about 50-60, all today’s WU’s have been validated. |
©2024 Astroinformatics Group