Message boards :
Number crunching :
Validation inconclusive
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 18 · Next
Author | Message |
---|---|
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 1 |
..... If you will also Read the News Forum you will see that the Server lost a harddrive a while back and replacing it took time and not getting it back upto speed is not going very well BECAUSE we keep using the remaining harddrives to get and return work. They have already had a day here and a day there to try and make things go faster but it just isn't. And THEN we have people complaining that their own tasks aren't going to be validated in time so the Admins capitulate and turn the Server back on again slowing down the new drive being part of the whole Raid group again. In short if it bothers you that much try crunching something at Einstein or Collatz until MilkyWay is caught back up again. Unfortunately they can not afford a full time Admin so the Admin also teaches at the University and that involves more than just classroom work as any teacher can tell you. |
Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0 |
@Mikey: +1 Thanks - he still does not understand the situation ... @Max: -1 Have a couple of beers, relax and enjoy life ... |
Send message Joined: 13 Dec 17 Posts: 46 Credit: 2,421,362,376 RAC: 0 |
Well, I am well aware of all that and I was merely pointing out that there is still a problem with the 'pending' tasks with the idea that the admins might want to do something about it (if they are so inclined). Also, most of these are WUs that I have completed weeks ago (even a month ago), so I don't see how it'll help if I decide to switch to Collatz full time now, (or even a week ago, or even a couple of weeks ago). In any case, I don't see this topic as a place for other contributors to comment on my understanding of the inner workings of MW@H, hence my previous post. |
Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,177 RAC: 37 |
I have an unit #2146934216 which is dated 3 Feb 2021 but still in status : Completed, waiting for validation. funny and seems it is forgotten. Can't seem to find that unit, so I guess check the number and try again? https://www.youtube.com/watch?v=EfDCHMn77cc |
Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,177 RAC: 37 |
..... Timed out - no response is a error for the volunteer, since we had some trouble with the server what likely happened your client simply never received the tasks in question. And as such your machines never wasted time or compute resources, all that happens is the tasks will timeout naturally and be processed by someone else eventually. Also this issue isn't just isolated to MW@H. Seti@home has a forum topic on this subject: https://setiathome.berkeley.edu/forum_thread.php?id=84176 As for a technical response: nginx/apache/whatever web server has a default timeout(I think its 60 seconds), so the scheduler has that amount of time to do "stuff" and respond to your request, if it doesn't complete it in this time, the web server kills the process and well this can have negative consequences such as lost tasks. As for "compromised" tasks they aren't, someone else will get the replacement and the workunit will be validated eventually. What you suggest on "cancelling" a workunit may have one or more tasks, and if one or more of those tasks has been completed by another volunteer that would be a waste of compute resources if the whole WU was cancelled. Eg this task has 2 returned results(at the time of this message) https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=391325693 and unfortunately didn't validate against each other, suggesting to cancel this workunit as an example would have wasted resources |
Send message Joined: 16 Mar 10 Posts: 211 Credit: 108,149,897 RAC: 4,885 |
Timed out - no response is a error for the volunteer, since we had some trouble with the server what likely happened your client simply never received the tasks in question. And as such your machines never wasted time or compute resources, all that happens is the tasks will timeout naturally and be processed by someone else eventually. Also this issue isn't just isolated to MW@H. Seti@home has a forum topic on this subject: https://setiathome.berkeley.edu/forum_thread.php?id=84176 Although "Timed out - no response" can also indicate that work was sent but the recipient never replied (turned off machine and never re-connected?), the quote from Kiska's message above highlights the reason that folks are seeing this at MW@Home... I'll bet that if you look at your "Timed out" errors there'll be a large number (if not all) of them where the sent date was 20th March, as that seems to be when it was at its worst.... There are posts in various threads here referring to these "ghost" or "orphaned" tasks, which would first have manifested as the web site claiming a [much] higher number of tasks in progress than the user could see in BOINC Manager (or equivalent). Eventually, that discrepancy in numbers should have turned into an equal count of Errors (timed out!) Cheers - Al. |
Send message Joined: 16 Mar 10 Posts: 211 Credit: 108,149,897 RAC: 4,885 |
I have an unit #2146934216 which is dated 3 Feb 2021 but still in status : Completed, waiting for validation. funny and seems it is forgotten.In late January/early February 2021 the MW project had problems because the work unit numbers had grown so large that the database could not store them in a 32-bit integer any longer! Plenty of messages on the topic back then, but newcomers might be unaware... The solution involved taking the project offline and renumbering everything; however, some results of some work units got isolated for one reason or another, so the system did not clean out the result records properly. Those records are likely to linger until someone with a lot of patience goes through and sifts them out (messy and time-consuming) or the database can be completely reconstructed from scratch (don't know how feasible that is if it actually contains live data...) So yes, your record is "forgotten" -- just ignore it. Cheers - Al. P.S. I have over 100 old result records in that state :-) |
Send message Joined: 4 Nov 12 Posts: 96 Credit: 251,528,484 RAC: 0 |
My main concern regarding this matter is that my computer has had invalid work unites for this project in the past when my GPUs were overheating and throttling themselves, and I'm currently using a dynamic clock rate on them based on load to help keep them cooler overall. But so far nothing seems to have had an error nor been listed as invalid, so I guess I'm in the clear. |
Send message Joined: 11 Mar 22 Posts: 42 Credit: 21,902,543 RAC: 0 |
The number of tasks listed as "validation inconclusive" has raised to 1922 out of 4604 total. This is a pretty bad proportion. |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 1 |
Timed out - no response is a error for the volunteer, since we had some trouble with the server what likely happened your client simply never received the tasks in question. And as such your machines never wasted time or compute resources, all that happens is the tasks will timeout naturally and be processed by someone else eventually. Also this issue isn't just isolated to MW@H. Seti@home has a forum topic on this subject: https://setiathome.berkeley.edu/forum_thread.php?id=84176 I also had tasks that "timed out" when a machine crashes and I can't get it back up and running in time, or even when the harddrive crashes and there's no way to get them back. |
Send message Joined: 16 Aug 09 Posts: 2 Credit: 1,198,788 RAC: 623 |
I have an unit #2146934216 which is dated 3 Feb 2021 but still in status : Completed, waiting for validation. funny and seems it is forgotten.In late January/early February 2021 the MW project had problems because the work unit numbers had grown so large that the database could not store them in a 32-bit integer any longer! Plenty of messages on the topic back then, but newcomers might be unaware... Thanks for the information. I am not complain but just wonder why the unit is sit there for a year. ok, I ignore it. |
Send message Joined: 31 Mar 12 Posts: 96 Credit: 152,502,177 RAC: 37 |
The number of tasks listed as "validation inconclusive" has raised to 1922 out of 4604 total. This is a pretty bad proportion. You think you have it bad. Mine is 14687 out of 29648 total tasks. But heres the thing, I trust that the server software will sort itself out in due time sending replacement tasks |
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,900,464 RAC: 0 |
Personally I think there is something wrong with Nbody Simulation processing. All the ones I did went straight to Validation Inconclusive but the second task is shown as unsent in every case. Nbody simulation numbers are not really altering by that much, it is always around 13 Million regardless of the amount processed in a day whereas the Separation volumes get processed within a day usually. |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 1 |
Personally I think there is something wrong with Nbody Simulation processing. All the ones I did went straight to Validation Inconclusive but the second task is shown as unsent in every case. Nbody simulation numbers are not really altering by that much, it is always around 13 Million regardless of the amount processed in a day whereas the Separation volumes get processed within a day usually. YES MilkyWay lost a harddrive and it's been more than a week long ordeal to get the Raid setup to see and send data to the new drive, some people have said there's has taken more than 30 days and they weren't running a Boinc Project at the same time on the remaining drives. I did it once and it took more than a week and I was just playing around with how Raid works. All that means that EVERYTHING else is suffering as MilkyWay tries to keep the Project going. |
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,900,464 RAC: 0 |
Personally I think there is something wrong with Nbody Simulation processing. All the ones I did went straight to Validation Inconclusive but the second task is shown as unsent in every case. Nbody simulation numbers are not really altering by that much, it is always around 13 Million regardless of the amount processed in a day whereas the Separation volumes get processed within a day usually. Thanks for that. I know about the disk issue. My observations are around Nbody Simulation where tasks marked as validation inconclusive are shown as unsent. At what point will they get sent, if ever . At around 13 Million tasks it will lucky if the backlog is cleared by Christmas. |
Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0 |
I am somewhat confused about your "validation inconclusive" definition of a task (maybe I read you wrong?). As I understand it, the task has finished OK, but is waiting for a second (wingman) task to be sent. This second task has nothing to do with the first one, except that the first one is waiting for the second one, in order to be then confirmed as valid or not. So, solong the second task has not been generated and/or sent, one has to wait till this happens. I understand your post in such a way, as that you are saying that the finished, but inconclusive, task has NOT yet been sent. This is not the case. It has been sent and is just waiting for a second or even a third one to be compared with (the results I mean). The task naming convention is _0 for the first task, _1 for the second one and so on. The "minimum quorum" means the number of tasks that need the "same" results to qualify for validity. The initial replication, I guess, means how many maximum tasks will try the same calculation before the whole work unit is completely "thrown" away. Now, why the wingman tasks have not yet been sent, I have no idea. That may be an error, which has not yet been discovered (the sending of a further task) by the admin. |
Send message Joined: 21 Feb 22 Posts: 66 Credit: 817,008 RAC: 0 |
The problem is as stated that the queue is sooooo huge for nbody. The resends get put at the end of the queue, so we need to work through the many many new WUs before we get to the resends. There were 18 million in this queue, and it is now down to 13 million, and some portion of that is resends, so it hard to say when we will get to the resend portion of the queue. Just keep moving forward. Your Validation Inconclusives will rise, but this isn't a problem (it's a feature). Once we get to the portion of the queue with the resends, then it will rapidly fall. |
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,900,464 RAC: 0 |
Thanks that’s my point the wingman tasks are all shown as unsent. |
Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0 |
+1 |
Send message Joined: 13 Oct 21 Posts: 44 Credit: 226,639,318 RAC: 19,186 |
Second tasks of N-Body Work Units are being sent out but at a very slow pace. I've recently gotten some tasks for WUs that had the first task completed almost a month ago. I think that, like has been mentioned before, the 13+ million queue needs to be processed first before things can go back to normal. It'll take time though since N-Body is CPU only and so tasks take longer and not as many people process them. There are 5 times as many users in the last 24 hours for Separation compared to N-Body. What may help is recruiting some of those BOINC teams that do various marathons and sprints focusing on one project at a time. Getting a few hundred or maybe thousand volunteers to focus on N-Body for a period of time would clear the large queue quickly. One uncertainty is whether the server will be able to handle the high jump in traffic. |
©2024 Astroinformatics Group