Message boards :
Number crunching :
Huge number of 'Validation inconclusive' WUs
Message board moderation
Author | Message |
---|---|
Send message Joined: 13 Dec 17 Posts: 46 Credit: 2,421,362,376 RAC: 0 |
Hello, at present I have more than 2000 'validation inconclusive' WU (MilkyWay@Home v1.46 (opencl_ati_101)), these are of the 'Unsent' variety, on my three machines: https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=764666&offset=0&show_names=0&state=3&appid= https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=765378&offset=0&show_names=0&state=3&appid= https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=763440&offset=0&show_names=0&state=3&appid= Any idea what's going on? Many thanks, max |
Send message Joined: 8 May 09 Posts: 3319 Credit: 520,322,259 RAC: 20,816 |
Hello, I'm over 600 myself, going to put a stop to my crunching here until it comes back down again!! |
Send message Joined: 16 Nov 14 Posts: 16 Credit: 335,683,507 RAC: 0 |
Over 6300+ here. |
Send message Joined: 24 Aug 17 Posts: 8 Credit: 223,957,930 RAC: 0 |
I've got almost 2000 here. There are 1.4M unsent tasks according to the server status here: https://milkyway.cs.rpi.edu/milkyway/server_status.php Maybe wait for the next few days to see what happen next? |
Send message Joined: 2 Oct 16 Posts: 167 Credit: 1,005,838,225 RAC: 49,078 |
My count has gone down since last night. From 2.2k to 1.7k. |
Send message Joined: 8 May 09 Posts: 3319 Credit: 520,322,259 RAC: 20,816 |
My count has gone down since last night. From 2.2k to 1.7k. Mine came down almost 40%, but on my the list the first dozen are still "unsent" and I stopped crunching yesterday after I posted. Either a ton of us stopped crunching or there are some significant problems with the way things work. I'm not creating new "inconclusives" but one would think the existing workunits should have been sent out long before now. |
Send message Joined: 13 Dec 17 Posts: 46 Credit: 2,421,362,376 RAC: 0 |
Slight decrease of my inconclusive WUs, but still have around 2000 of them. I just started more detailed monitoring - picked up few of those and will watch them to seen when (if ever) they will be sent out again. |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
Hey Everyone, Looks like the workunit generator went a little overboard over the weekend and clogged things up. The validation inconclusives should clear our of the queue over the next two or three days as the server works through the backlog. On another note, the new RAM for the server should be here tomorrow and that should help prevent this from happening again. Jake |
Send message Joined: 24 Aug 17 Posts: 8 Credit: 223,957,930 RAC: 0 |
I'm speculating that it may take some time since for the 2nd quorum, the task number is not sequential or at least close to the first task number but is spread very far apart. See one example here for one of my WUs: https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1579238967 For this WU, my task number is 2,264,897,549 and my soon-to-be wingman task number for this same WU is 2,266,161,449. That's about 1,263,790 difference. This is about the current unsent tasks as reported by the server status. I'm also suspecting that no one has started crunching task number 2,266,xxx,xxx yet?[/url] |
Send message Joined: 8 May 09 Posts: 3319 Credit: 520,322,259 RAC: 20,816 |
Hey Everyone, I sure hope so!! Thanks for working on this AND letting us know what's going on!!! |
Send message Joined: 2 Oct 16 Posts: 167 Credit: 1,005,838,225 RAC: 49,078 |
I saw the site was down earlier today, a couple of hours ago. Count is now 2.5k so even higher than last night. Just a single 280x so pretty much most of them. |
Send message Joined: 8 May 09 Posts: 3319 Credit: 520,322,259 RAC: 20,816 |
I saw the site was down earlier today, a couple of hours ago. Count is now 2.5k so even higher than last night. Just a single 280x so pretty much most of them. I now have 211 inconclusive but 69 of those haven't even been sent to another computer yet!! Until that catches up my pc's will stay elsewhere and the units I have in my cache won't be crunched! Sorry guys but I don't like the idea of MW making NEW workunits and sending them out BEFORE the existing workunits are sent out!! I have 211 workunits I have already completed and are waiting for validation, and 69 of those haven't even been SENT to another computer yet!!! I haven't even gotten a new workunit in almost 2 days now!! |
Send message Joined: 13 Dec 17 Posts: 46 Credit: 2,421,362,376 RAC: 0 |
I saw the site was down earlier today, a couple of hours ago. Count is now 2.5k so even higher than last night. Just a single 280x so pretty much most of them. The same with me.... mine are steadily going up, 3500+ at the moment. Pretty much all of my completed WUs end up as inconclusive (Unsent). So, I am moving my machines away from MW, don't see any point in keeping them crunching here. |
Send message Joined: 8 May 09 Posts: 3319 Credit: 520,322,259 RAC: 20,816 |
I saw the site was down earlier today, a couple of hours ago. Count is now 2.5k so even higher than last night. Just a single 280x so pretty much most of them. I will bring mine back, I always seem too anyway, but they need to fix the problems so I'm not just spinning my wheels!! |
Send message Joined: 26 Mar 18 Posts: 24 Credit: 102,912,937 RAC: 0 |
I'm confused. I mean they will eventually be validated so what does it matter, correct? |
Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,461,693,501 RAC: 0 |
Hello, Problem is bug in program. Looking at the set of errors, the first user I looked at had 3 titans but had almost 20,000 errors. If all tasks all error out the number of "pending" will rise to the total number of work units. I thought only 80 were allowed per day. Even with 3 titans my math suggests it should have taken a week at 80 per 24 hours. All 19,736 was from May7, 5am to may8, 1300 |
Send message Joined: 8 May 09 Posts: 3319 Credit: 520,322,259 RAC: 20,816 |
Hello, No it's 80 per gpu, but if you buzz right thru them at 2 seconds each you can zip thru thousands of them per day, all errors of course!!! Jake wants us to send him the link to computers like the one you found so he can put it on the 'suspicious' list. Unfortunately when he did it automatically a while back LOTS of people couldn't bring new gpu's on here to crunch so it's now a manual process. |
Send message Joined: 26 Mar 18 Posts: 24 Credit: 102,912,937 RAC: 0 |
I was crunching on 3 Tesla v100's at the same time. A WU took around 35 seconds while running 6 at a time. So averaging a little under 6 seconds per WU x 3 cards which averages out to under 2 seconds per WU. Thats over 43,000 a day. Now I only ran like this for a couple hours (testing) but my inconclusive total was over 2,000 in those couple hours. Out of those I threw 6 errors and 2 invalids, I have 134 still at inconclusive, and all the rest validated. I still don't know what they issue is you guys are posting about..... |
Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,461,693,501 RAC: 0 |
I was crunching on 3 Tesla v100's at the same time. A WU took around 35 seconds while running 6 at a time. So averaging a little under 6 seconds per WU x 3 cards which averages out to under 2 seconds per WU. Thats over 43,000 a day. Now I only ran like this for a couple hours (testing) but my inconclusive total was over 2,000 in those couple hours. Out of those I threw 6 errors and 2 invalids, I have 134 still at inconclusive, and all the rest validated. The original post was about inconclusive validations and you are correct in that if you wait them out all will eventually be validated. OTOH, the post I made was to point out that the first system listed had 3 titans and 19,000 of the work units error'ed which is not the same as inconclusive validations. Looking at task details one observes that OpenCL was unable to find any nVidia devices although the system had 3 titans. MY S9100 is nowhere as fast as your tesla. It did cost me under $300 and used only a single 8pin power connector which is a plus. Your computers are hidden, but you might want to run my program at http://new.stateson.net/HostProjectStats to get an accurate measurement of completion time. |
Send message Joined: 26 Mar 18 Posts: 24 Credit: 102,912,937 RAC: 0 |
Oh...they are not "my" Tesla's. I wish. I got permission to play with the machines they were in. Thats also why they are hidden...didn't want the host names to get out. And I no longer have access to them but I might get another chance in the future. Here is the output from your program for a v100 16Gb SXM2 interface: Run Time CPU Time Credit (sec) (sec) 32.6 28.1 231.2 33.6 29.5 227.3 36.6 32.0 229.3 24.5 21.1 228.5 35.6 30.8 230.8 29.6 26.3 227.6 29.6 24.4 227.6 32.7 28.5 227.7 30.5 26.5 227.6 30.5 26.6 227.7 33.7 29.3 227.7 30.6 27.3 231.2 31.6 28.4 227.6 40.6 35.8 229.3 34.6 30.4 229.7 32.5 26.3 227.6 25.5 22.1 227.6 33.7 30.2 227.7 34.5 31.2 229.4 33.7 28.9 229.4 ---------------------------------- AVG: 32.3 28.2 228.6 STD: 3.5 3.3 1.3 Keep in mind I'm running 6 WU at a time in the above. Here is from a p100 16Gb PCIe interface: Run Time CPU Time Credit (sec) (sec) 53.4 51.4 227.3 61.5 59.4 229.1 53.4 51.5 227.3 59.5 57.6 227.3 48.4 46.9 227.2 54.4 52.7 227.3 65.5 63.8 229.1 57.4 55.8 227.3 49.3 46.3 227.2 57.4 54.6 227.3 60.4 58.3 227.3 50.4 48.8 227.3 54.4 52.4 228.1 55.4 53.6 227.3 53.4 51.3 227.2 53.4 51.9 228.1 56.4 54.7 229.3 57.4 55.3 227.3 57.6 49.3 227.3 55.6 48.1 227.2 ---------------------------------- AVG: 55.7 53.2 227.6 STD: 4.0 4.3 0.7 Also running 6 WU at a time. I do not have 20 consecutive valid results using 1 per WU but I might be able to borrow a v100 for 20 minutes to get it if you think it would make a big difference. I know overall 1 WU at a time is slower since the card is barely loaded with 1. |
©2024 Astroinformatics Group