Welcome to MilkyWay@home

Validation inconclusive

Message boards : Number crunching : Validation inconclusive
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 18 · Next

AuthorMessage
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,943,276
RAC: 22,449
Message 72509 - Posted: 5 Apr 2022, 11:14:50 UTC - in response to Message 72508.  

.....
He has already been told this many times, but he does not seem to be willing to unsderstand !
Excuse my bluntness - I just had to say it ...
Have a nice day


Look here old chap,

Your bluntness cannot be excused.
I understand fairly well how MW@H should work (also thanks to the useful comments from other contributors here). However, it's pretty obvious that the project does not work as intended at the moment, does it?
Now, some of these inconclusive WUs started to appear in the 'error' pile (several hundred of these already) as 'timed out - no response'. They are not going to be 'validated at some point', are they? That's what I call a pure waste of my time and resources.
In my opinion, it's useful when there is a problem to point that out. Whether the admins want or can solve it is another matter.
From my 'non-understanding' point of view, the admins should have canceled (deleted, purged... as a more knowledgeable person you can insert the right term here) all these compromised tasks and started with completely new and fresh WUs. Instead, they are circulating millions of these WUs from a month ago again and again through the hosts resulting in the current waste of time and money.

So, I'd suggest that, if you don't have anything useful to contribute, just keep your comments to yourself.


If you will also Read the News Forum you will see that the Server lost a harddrive a while back and replacing it took time and not getting it back upto speed is not going very well BECAUSE we keep using the remaining harddrives to get and return work. They have already had a day here and a day there to try and make things go faster but it just isn't. And THEN we have people complaining that their own tasks aren't going to be validated in time so the Admins capitulate and turn the Server back on again slowing down the new drive being part of the whole Raid group again.

In short if it bothers you that much try crunching something at Einstein or Collatz until MilkyWay is caught back up again. Unfortunately they can not afford a full time Admin so the Admin also teaches at the University and that involves more than just classroom work as any teacher can tell you.
ID: 72509 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 256
Credit: 604,411,638
RAC: 0
Message 72511 - Posted: 5 Apr 2022, 11:28:09 UTC - in response to Message 72509.  

@Mikey: +1 Thanks - he still does not understand the situation ...

@Max: -1 Have a couple of beers, relax and enjoy life ...
ID: 72511 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Max_Pirx

Send message
Joined: 13 Dec 17
Posts: 46
Credit: 2,421,362,376
RAC: 0
Message 72513 - Posted: 5 Apr 2022, 14:00:07 UTC - in response to Message 72509.  


.................................
If you will also Read the News Forum you will see that the Server lost a harddrive a while back and replacing it took time and not getting it back upto speed is not going very well BECAUSE we keep using the remaining harddrives to get and return work. They have already had a day here and a day there to try and make things go faster but it just isn't. And THEN we have people complaining that their own tasks aren't going to be validated in time so the Admins capitulate and turn the Server back on again slowing down the new drive being part of the whole Raid group again.

In short if it bothers you that much try crunching something at Einstein or Collatz until MilkyWay is caught back up again. Unfortunately they can not afford a full time Admin so the Admin also teaches at the University and that involves more than just classroom work as any teacher can tell you.


Well, I am well aware of all that and I was merely pointing out that there is still a problem with the 'pending' tasks with the idea that the admins might want to do something about it (if they are so inclined). Also, most of these are WUs that I have completed weeks ago (even a month ago), so I don't see how it'll help if I decide to switch to Collatz full time now, (or even a week ago, or even a couple of weeks ago).

In any case, I don't see this topic as a place for other contributors to comment on my understanding of the inner workings of MW@H, hence my previous post.
ID: 72513 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kiska

Send message
Joined: 31 Mar 12
Posts: 94
Credit: 151,909,396
RAC: 12,496
Message 72515 - Posted: 5 Apr 2022, 15:39:55 UTC - in response to Message 72502.  

I have an unit #2146934216 which is dated 3 Feb 2021 but still in status : Completed, waiting for validation. funny and seems it is forgotten.


Can't seem to find that unit, so I guess check the number and try again?

https://www.youtube.com/watch?v=EfDCHMn77cc
ID: 72515 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kiska

Send message
Joined: 31 Mar 12
Posts: 94
Credit: 151,909,396
RAC: 12,496
Message 72516 - Posted: 5 Apr 2022, 16:01:59 UTC - in response to Message 72508.  

.....
He has already been told this many times, but he does not seem to be willing to unsderstand !
Excuse my bluntness - I just had to say it ...
Have a nice day


Look here old chap,

Your bluntness cannot be excused.
I understand fairly well how MW@H should work (also thanks to the useful comments from other contributors here). However, it's pretty obvious that the project does not work as intended at the moment, does it?
Now, some of these inconclusive WUs started to appear in the 'error' pile (several hundred of these already) as 'timed out - no response'. They are not going to be 'validated at some point', are they? That's what I call a pure waste of my time and resources.
In my opinion, it's useful when there is a problem to point that out. Whether the admins want or can solve it is another matter.
From my 'non-understanding' point of view, the admins should have canceled (deleted, purged... as a more knowledgeable person you can insert the right term here) all these compromised tasks and started with completely new and fresh WUs. Instead, they are circulating millions of these WUs from a month ago again and again through the hosts resulting in the current waste of time and money.

So, I'd suggest that, if you don't have anything useful to contribute, just keep your comments to yourself.


Timed out - no response is a error for the volunteer, since we had some trouble with the server what likely happened your client simply never received the tasks in question. And as such your machines never wasted time or compute resources, all that happens is the tasks will timeout naturally and be processed by someone else eventually. Also this issue isn't just isolated to MW@H. Seti@home has a forum topic on this subject: https://setiathome.berkeley.edu/forum_thread.php?id=84176

As for a technical response:
nginx/apache/whatever web server has a default timeout(I think its 60 seconds), so the scheduler has that amount of time to do "stuff" and respond to your request, if it doesn't complete it in this time, the web server kills the process and well this can have negative consequences such as lost tasks.

As for "compromised" tasks they aren't, someone else will get the replacement and the workunit will be validated eventually. What you suggest on "cancelling" a workunit may have one or more tasks, and if one or more of those tasks has been completed by another volunteer that would be a waste of compute resources if the whole WU was cancelled.

Eg this task has 2 returned results(at the time of this message) https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=391325693 and unfortunately didn't validate against each other, suggesting to cancel this workunit as an example would have wasted resources
ID: 72516 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 208
Credit: 105,441,644
RAC: 36,812
Message 72517 - Posted: 5 Apr 2022, 17:08:17 UTC

Timed out - no response is a error for the volunteer, since we had some trouble with the server what likely happened your client simply never received the tasks in question. And as such your machines never wasted time or compute resources, all that happens is the tasks will timeout naturally and be processed by someone else eventually. Also this issue isn't just isolated to MW@H. Seti@home has a forum topic on this subject: https://setiathome.berkeley.edu/forum_thread.php?id=84176

Although "Timed out - no response" can also indicate that work was sent but the recipient never replied (turned off machine and never re-connected?), the quote from Kiska's message above highlights the reason that folks are seeing this at MW@Home... I'll bet that if you look at your "Timed out" errors there'll be a large number (if not all) of them where the sent date was 20th March, as that seems to be when it was at its worst....

There are posts in various threads here referring to these "ghost" or "orphaned" tasks, which would first have manifested as the web site claiming a [much] higher number of tasks in progress than the user could see in BOINC Manager (or equivalent). Eventually, that discrepancy in numbers should have turned into an equal count of Errors (timed out!)

Cheers - Al.
ID: 72517 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 208
Credit: 105,441,644
RAC: 36,812
Message 72518 - Posted: 5 Apr 2022, 17:22:51 UTC - in response to Message 72502.  

I have an unit #2146934216 which is dated 3 Feb 2021 but still in status : Completed, waiting for validation. funny and seems it is forgotten.
In late January/early February 2021 the MW project had problems because the work unit numbers had grown so large that the database could not store them in a 32-bit integer any longer! Plenty of messages on the topic back then, but newcomers might be unaware...

The solution involved taking the project offline and renumbering everything; however, some results of some work units got isolated for one reason or another, so the system did not clean out the result records properly. Those records are likely to linger until someone with a lot of patience goes through and sifts them out (messy and time-consuming) or the database can be completely reconstructed from scratch (don't know how feasible that is if it actually contains live data...)

So yes, your record is "forgotten" -- just ignore it.

Cheers - Al.

P.S. I have over 100 old result records in that state :-)
ID: 72518 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Wrend
Avatar

Send message
Joined: 4 Nov 12
Posts: 96
Credit: 251,528,484
RAC: 0
Message 72528 - Posted: 6 Apr 2022, 2:16:46 UTC

My main concern regarding this matter is that my computer has had invalid work unites for this project in the past when my GPUs were overheating and throttling themselves, and I'm currently using a dynamic clock rate on them based on load to help keep them cooler overall. But so far nothing seems to have had an error nor been listed as invalid, so I guess I'm in the clear.
ID: 72528 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
GolfSierra

Send message
Joined: 11 Mar 22
Posts: 42
Credit: 21,902,543
RAC: 0
Message 72532 - Posted: 6 Apr 2022, 7:25:01 UTC

The number of tasks listed as "validation inconclusive" has raised to 1922 out of 4604 total. This is a pretty bad proportion.
ID: 72532 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,943,276
RAC: 22,449
Message 72533 - Posted: 6 Apr 2022, 11:35:34 UTC - in response to Message 72517.  

Timed out - no response is a error for the volunteer, since we had some trouble with the server what likely happened your client simply never received the tasks in question. And as such your machines never wasted time or compute resources, all that happens is the tasks will timeout naturally and be processed by someone else eventually. Also this issue isn't just isolated to MW@H. Seti@home has a forum topic on this subject: https://setiathome.berkeley.edu/forum_thread.php?id=84176

Although "Timed out - no response" can also indicate that work was sent but the recipient never replied (turned off machine and never re-connected?), the quote from Kiska's message above highlights the reason that folks are seeing this at MW@Home... I'll bet that if you look at your "Timed out" errors there'll be a large number (if not all) of them where the sent date was 20th March, as that seems to be when it was at its worst....

There are posts in various threads here referring to these "ghost" or "orphaned" tasks, which would first have manifested as the web site claiming a [much] higher number of tasks in progress than the user could see in BOINC Manager (or equivalent). Eventually, that discrepancy in numbers should have turned into an equal count of Errors (timed out!)

Cheers - Al.


I also had tasks that "timed out" when a machine crashes and I can't get it back up and running in time, or even when the harddrive crashes and there's no way to get them back.
ID: 72533 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Walker Chu

Send message
Joined: 16 Aug 09
Posts: 2
Credit: 1,181,606
RAC: 0
Message 72536 - Posted: 6 Apr 2022, 14:17:47 UTC - in response to Message 72518.  

I have an unit #2146934216 which is dated 3 Feb 2021 but still in status : Completed, waiting for validation. funny and seems it is forgotten.
In late January/early February 2021 the MW project had problems because the work unit numbers had grown so large that the database could not store them in a 32-bit integer any longer! Plenty of messages on the topic back then, but newcomers might be unaware...

The solution involved taking the project offline and renumbering everything; however, some results of some work units got isolated for one reason or another, so the system did not clean out the result records properly. Those records are likely to linger until someone with a lot of patience goes through and sifts them out (messy and time-consuming) or the database can be completely reconstructed from scratch (don't know how feasible that is if it actually contains live data...)

So yes, your record is "forgotten" -- just ignore it.

Cheers - Al.

P.S. I have over 100 old result records in that state :-)


Thanks for the information. I am not complain but just wonder why the unit is sit there for a year. ok, I ignore it.
ID: 72536 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kiska

Send message
Joined: 31 Mar 12
Posts: 94
Credit: 151,909,396
RAC: 12,496
Message 72537 - Posted: 6 Apr 2022, 16:11:02 UTC - in response to Message 72532.  

The number of tasks listed as "validation inconclusive" has raised to 1922 out of 4604 total. This is a pretty bad proportion.


You think you have it bad. Mine is 14687 out of 29648 total tasks. But heres the thing, I trust that the server software will sort itself out in due time sending replacement tasks
ID: 72537 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,853
RAC: 272
Message 72656 - Posted: 10 Apr 2022, 11:37:48 UTC

Personally I think there is something wrong with Nbody Simulation processing. All the ones I did went straight to Validation Inconclusive but the second task is shown as unsent in every case. Nbody simulation numbers are not really altering by that much, it is always around 13 Million regardless of the amount processed in a day whereas the Separation volumes get processed within a day usually.
ID: 72656 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,943,276
RAC: 22,449
Message 72659 - Posted: 10 Apr 2022, 12:49:39 UTC - in response to Message 72656.  

Personally I think there is something wrong with Nbody Simulation processing. All the ones I did went straight to Validation Inconclusive but the second task is shown as unsent in every case. Nbody simulation numbers are not really altering by that much, it is always around 13 Million regardless of the amount processed in a day whereas the Separation volumes get processed within a day usually.


YES MilkyWay lost a harddrive and it's been more than a week long ordeal to get the Raid setup to see and send data to the new drive, some people have said there's has taken more than 30 days and they weren't running a Boinc Project at the same time on the remaining drives. I did it once and it took more than a week and I was just playing around with how Raid works. All that means that EVERYTHING else is suffering as MilkyWay tries to keep the Project going.
ID: 72659 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,853
RAC: 272
Message 72660 - Posted: 10 Apr 2022, 14:14:44 UTC - in response to Message 72659.  

Personally I think there is something wrong with Nbody Simulation processing. All the ones I did went straight to Validation Inconclusive but the second task is shown as unsent in every case. Nbody simulation numbers are not really altering by that much, it is always around 13 Million regardless of the amount processed in a day whereas the Separation volumes get processed within a day usually.


YES MilkyWay lost a harddrive and it's been more than a week long ordeal to get the Raid setup to see and send data to the new drive, some people have said there's has taken more than 30 days and they weren't running a Boinc Project at the same time on the remaining drives. I did it once and it took more than a week and I was just playing around with how Raid works. All that means that EVERYTHING else is suffering as MilkyWay tries to keep the Project going.


Thanks for that. I know about the disk issue. My observations are around Nbody Simulation where tasks marked as validation inconclusive are shown as unsent. At what point will they get sent, if ever . At around 13 Million tasks it will lucky if the backlog is cleared by Christmas.
ID: 72660 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 256
Credit: 604,411,638
RAC: 0
Message 72661 - Posted: 10 Apr 2022, 15:18:21 UTC - in response to Message 72660.  


... My observations are around Nbody Simulation where tasks marked as validation inconclusive are shown as unsent. At what point will they get sent, if ever ...

I am somewhat confused about your "validation inconclusive" definition of a task (maybe I read you wrong?).

As I understand it, the task has finished OK, but is waiting for a second (wingman) task to be sent.
This second task has nothing to do with the first one, except that the first one is waiting for the second one, in order to
be then confirmed as valid or not.
So, solong the second task has not been generated and/or sent, one has to wait till this happens.

I understand your post in such a way, as that you are saying that the finished, but inconclusive, task has NOT yet been sent.
This is not the case. It has been sent and is just waiting for a second or even a third one to be compared with (the results I mean).
The task naming convention is _0 for the first task, _1 for the second one and so on.

The "minimum quorum" means the number of tasks that need the "same" results to qualify for validity.
The initial replication, I guess, means how many maximum tasks will try the same calculation before the whole
work unit is completely "thrown" away.

Now, why the wingman tasks have not yet been sent, I have no idea.
That may be an error, which has not yet been discovered (the sending of a further task) by the admin.
ID: 72661 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
unixchick
Avatar

Send message
Joined: 21 Feb 22
Posts: 66
Credit: 817,008
RAC: 0
Message 72663 - Posted: 10 Apr 2022, 15:22:23 UTC - in response to Message 72660.  

The problem is as stated that the queue is sooooo huge for nbody. The resends get put at the end of the queue, so we need to work through the many many new WUs before we get to the resends. There were 18 million in this queue, and it is now down to 13 million, and some portion of that is resends, so it hard to say when we will get to the resend portion of the queue.

Just keep moving forward. Your Validation Inconclusives will rise, but this isn't a problem (it's a feature). Once we get to the portion of the queue with the resends, then it will rapidly fall.
ID: 72663 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,853
RAC: 272
Message 72664 - Posted: 10 Apr 2022, 15:23:52 UTC - in response to Message 72661.  


... My observations are around Nbody Simulation where tasks marked as validation inconclusive are shown as unsent. At what point will they get sent, if ever ...

I am somewhat confused about your "validation inconclusive" definition of a task (maybe I read you wrong?).

As I understand it, the task has finished OK, but is waiting for a second (wingman) task to be sent.
This second task has nothing to do with the first one, except that the first one is waiting for the second one, in order to
be then confirmed as valid or not.
So, solong the second task has not been generated and/or sent, one has to wait till this happens.

I understand your post in such a way, as that you are saying that the finished, but inconclusive, task has NOT yet been sent.
This is not the case. It has been sent and is just waiting for a second or even a third one to be compared with (the results I mean).
The task naming convention is _0 for the first task, _1 for the second one and so on.

The "minimum quorum" means the number of tasks that need the "same" results to qualify for validity.
The initial replication, I guess, means how many maximum tasks will try the same calculation before the whole
work unit is completely "thrown" away.

Now, why the wingman tasks have not yet been sent, I have no idea.
That may be an error, which has not yet been discovered (the sending of a further task) by the admin.


Thanks that’s my point the wingman tasks are all shown as unsent.
ID: 72664 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 256
Credit: 604,411,638
RAC: 0
Message 72665 - Posted: 10 Apr 2022, 15:24:01 UTC - in response to Message 72663.  

+1
ID: 72665 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 13 Oct 21
Posts: 43
Credit: 225,018,835
RAC: 9,849
Message 72669 - Posted: 11 Apr 2022, 0:21:39 UTC

Second tasks of N-Body Work Units are being sent out but at a very slow pace. I've recently gotten some tasks for WUs that had the first task completed almost a month ago. I think that, like has been mentioned before, the 13+ million queue needs to be processed first before things can go back to normal. It'll take time though since N-Body is CPU only and so tasks take longer and not as many people process them. There are 5 times as many users in the last 24 hours for Separation compared to N-Body.

What may help is recruiting some of those BOINC teams that do various marathons and sprints focusing on one project at a time. Getting a few hundred or maybe thousand volunteers to focus on N-Body for a period of time would clear the large queue quickly. One uncertainty is whether the server will be able to handle the high jump in traffic.
ID: 72669 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 18 · Next

Message boards : Number crunching : Validation inconclusive

©2024 Astroinformatics Group