Welcome to MilkyWay@home

Validation inconclusive

Message boards : Number crunching : Validation inconclusive
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 18 · Next

AuthorMessage
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 73562 - Posted: 17 May 2022, 15:06:18 UTC

I'm looking into things. Not sure why so much is going to validation pending.
ID: 73562 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 73563 - Posted: 17 May 2022, 15:09:45 UTC

There were a couple server processes that had been querying the DB for a couple hundred hours. I've killed those and hopefully things will start moving again.
ID: 73563 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,900,464
RAC: 0
Message 73564 - Posted: 17 May 2022, 15:51:34 UTC - in response to Message 73563.  

Thanks for looking…
ID: 73564 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 715
Credit: 557,048,127
RAC: 42,886
Message 73565 - Posted: 17 May 2022, 16:31:37 UTC
Last modified: 17 May 2022, 16:32:18 UTC

Most of the server processes are not running. None of the work is processing on any of my hosts. All hosts in multi-hour backoffs.
ID: 73565 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 213
Credit: 109,122,597
RAC: 29,833
Message 73567 - Posted: 17 May 2022, 16:59:40 UTC - in response to Message 73563.  
Last modified: 17 May 2022, 17:01:15 UTC

There were a couple server processes that had been querying the DB for a couple hundred hours. I've killed those and hopefully things will start moving again.

Tom,

Attempts to access the server to report work are being met by "Feeder not running" (and the Server Status page is still stuck showing most server processes not running and other statistics from around 13:40 UTC...

[Edit. I see Keith Myers has flagged this up whilst I was researching this message - hey, ho...]

Cheers - Al.

P.S. do we know what the offending processes were?
ID: 73567 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 73587 - Posted: 18 May 2022, 15:24:28 UTC

Are you still getting the "feeder not running" issue?
ID: 73587 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 73588 - Posted: 18 May 2022, 15:25:42 UTC

I'm turning things off and back on. In the past when people had the feeder error, this has fixed it.
ID: 73588 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 213
Credit: 109,122,597
RAC: 29,833
Message 73589 - Posted: 18 May 2022, 15:51:27 UTC - in response to Message 73587.  

Are you still getting the "feeder not running" issue?

Nope -- and some folks had actually acknowledged yesterday's "fix" in the "Feeder Not Running" thread. Sorry; one of us (which probably means me) should've also posted an acknowledgment here...

There seems to be a different problem now, doesn't there? It'll be interesting to see if a full shut-down/restart will have helped solve the delays in validator activity (even N-Body tasks have been sat in "Pending validation" for up to and occasionally beyond an hour...).

Also, not necessarily connected to anything in particular, I noticed that the task data part of the status page had only updated once per 12 hours since 13:37 UTC on the 17th -- once at 01:24 and again at 12:57 on 18th. Perhaps a restart will change whatever is stopping it updating the task data cache...

And, if I might, an enquiry: is N-Body supposed to be using adaptive replication? If so, it doesn't seem to be working :-)

Cheers - Al.
ID: 73589 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 73590 - Posted: 18 May 2022, 16:09:53 UTC
Last modified: 18 May 2022, 16:10:08 UTC

Yeah, hopefully restarting the project processes will do something for that problem. I can always try a full system restart too. I have no idea where to even start to look if it's a validator problem, because the validator is running fine and the logs don't show any issues.

is N-Body supposed to be using adaptive replication? If so, it doesn't seem to be working :-)

I think everyone's adaptive replication scores got ruined when I cancelled millions of jobs after the disk failure, and it will take a while for the system to recognize that you are returning good WUs again. Until that happens, I think it will just always check every WU twice, which sucks. I suppose I could always tinker with this in the DB but there's no way of knowing how that will impact the algorithm going forwards.
ID: 73590 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 213
Credit: 109,122,597
RAC: 29,833
Message 73591 - Posted: 18 May 2022, 16:27:30 UTC - in response to Message 73590.  

Yeah, hopefully restarting the project processes will do something for that problem. I can always try a full system restart too. I have no idea where to even start to look if it's a validator problem, because the validator is running fine and the logs don't show any issues.

is N-Body supposed to be using adaptive replication? If so, it doesn't seem to be working :-)

I think everyone's adaptive replication scores got ruined when I cancelled millions of jobs after the disk failure, and it will take a while for the system to recognize that you are returning good WUs again. Until that happens, I think it will just always check every WU twice, which sucks. I suppose I could always tinker with this in the DB but there's no way of knowing how that will impact the algorithm going forwards.

Tom - thanks for the reply.

In theory, the count of good WUs should be maintained as tasks get validated, and not be influenced by prior activity; unless there's something strange going on there's no real reason the count should be depleted! However, there are several possibilities for "strange" that could result from something having broken in the application configuration (perhaps when the WU generator ran wild?)

I did a [partial] code dive a while back to see how adaptive replication was implemented, and if I recall correctly there were at least three reasons it might not actually invoke adaptive replication, but I didn't follow up on that at the time. If I get the chance, I might have another go :-)

Again, thanks for the reply.

Cheers - Al.
ID: 73591 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,900,464
RAC: 0
Message 73673 - Posted: 24 May 2022, 7:51:22 UTC - in response to Message 73591.  

Is anything actually getting validated ? There seems to be a delay now in Separation tasks getting validated. I have given up on my Simulation backlog which is well over a month old and nothing has altered there at all.
ID: 73673 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 14 May 22
Posts: 7
Credit: 8,077,321
RAC: 0
Message 73674 - Posted: 24 May 2022, 10:28:19 UTC - in response to Message 73673.  

I do not know if anything is getting validated. This is my oldest. There are 184 others.

Workunit 458806994
name 	de_modfit_86_bundle5_3s_south_pt2_2_1652888500_3384632
application 	Milkyway@home Separation
created 	21 May 2022, 6:48:25 UTC
minimum quorum 	1
initial replication 	1
max # of error/total/success tasks 	2, 9, 6
validation 	Pending
Task Computer	Sent	Time reported explain	Status	
Run time CPU time	Credit	
Application
279025918 	928280 	22 May 2022, 8:24:29 UTC 	22 May 2022, 23:13:27 UTC 	Completed, waiting for validation 	2,791.94 	2,777.26 	pending 	
Milkyway@home Separation v1.46
x86_64-pc-linux-gnu

ID: 73674 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3339
Credit: 524,010,781
RAC: 0
Message 73678 - Posted: 24 May 2022, 11:54:23 UTC - in response to Message 73674.  

I do not know if anything is getting validated. This is my oldest. There are 184 others.

Application
279025918 928280 22 May 2022, 8:24:29 UTC 22 May 2022, 23:13:27 UTC Completed, waiting for validation 2,791.94 2,777.26 pending
Milkyway@home Separation v1.46
x86_64-pc-linux-gnu[/code]


My oldest is from the 27th of April and there are 853 others just like it.
ID: 73678 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
JohnDK
Avatar

Send message
Joined: 18 Feb 10
Posts: 57
Credit: 222,683,194
RAC: 6,447
Message 73681 - Posted: 24 May 2022, 16:53:56 UTC

It's almost 2 days since I've got any task validated. There's over 1.5m workunits waiting for validation, don't think it's going good.
ID: 73681 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Dragon
Avatar

Send message
Joined: 27 Feb 22
Posts: 18
Credit: 2,967,695
RAC: 0
Message 73682 - Posted: 24 May 2022, 19:23:13 UTC - in response to Message 73681.  
Last modified: 24 May 2022, 19:26:29 UTC

It's almost 2 days since I've got any task validated. There's over 1.5m workunits waiting for validation, don't think it's going good.


My task validations have been completing as of 24 hours ago.

23 May 2022, 4:07:46 UTC 23 May 2022, 15:43:21 UTC Completed and validated
ID: 73682 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
JohnDK
Avatar

Send message
Joined: 18 Feb 10
Posts: 57
Credit: 222,683,194
RAC: 6,447
Message 73683 - Posted: 24 May 2022, 20:28:07 UTC

Yes I also have some validated tasks from May 23 now, they weren't there a few hours ago.

btw workunits waiting for validation is now almost 1.8m.
ID: 73683 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 213
Credit: 109,122,597
RAC: 29,833
Message 73684 - Posted: 24 May 2022, 22:54:49 UTC

Some time around 17:00 to 17:30 UTC today the transitioner backlog that has been plaguing the system for several days finally cleared; the peaking of the "waiting for validation" figure probably reflects the last of the backlog getting cleared out!

Now the transitioner isn't taking up as much system time (in particular, database accesses!) the Separation validator should be able to catch up (though it's anyone's guess how quickly it will do so!) The N-Body validator didn't seem to be as badly backed up at any time, and now second results for N-Body seem to go Valid almost as soon as they hit the server!

We now need to hope that with the number of unsent tasks still falling at a reasonable rate there won't be another reason for another substantial transitioner backlog to occur; otherwise, we could see another bout of excess work unit creation :-(

Cheers - Al.
ID: 73684 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,900,464
RAC: 0
Message 73757 - Posted: 30 May 2022, 7:07:37 UTC - in response to Message 73684.  

I notice my backlog of Nbody Simulations has started reducing albeit slowly, but good to see.
ID: 73757 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3339
Credit: 524,010,781
RAC: 0
Message 73758 - Posted: 30 May 2022, 10:55:02 UTC - in response to Message 73757.  

I notice my backlog of Nbody Simulations has started reducing albeit slowly, but good to see.


Mine hasn't changed in a couple of days so I sure hope that's right.
ID: 73758 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,900,464
RAC: 0
Message 73760 - Posted: 30 May 2022, 15:46:50 UTC - in response to Message 73758.  

I notice my backlog of Nbody Simulations has started reducing albeit slowly, but good to see.


Mine hasn't changed in a couple of days so I sure hope that's right.


Gone down by about 50-60, all today’s WU’s have been validated.
ID: 73760 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 18 · Next

Message boards : Number crunching : Validation inconclusive

©2024 Astroinformatics Group