Welcome to MilkyWay@home

Huge number of 'Validation inconclusive' WUs

Message boards : Number crunching : Huge number of 'Validation inconclusive' WUs
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Max_Pirx

Send message
Joined: 13 Dec 17
Posts: 46
Credit: 2,421,362,376
RAC: 0
Message 67109 - Posted: 19 Feb 2018, 21:42:10 UTC
Last modified: 19 Feb 2018, 21:44:17 UTC

Hello,
at present I have more than 2000 'validation inconclusive' WU (MilkyWay@Home v1.46 (opencl_ati_101)), these are of the 'Unsent' variety, on my three machines:

https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=764666&offset=0&show_names=0&state=3&appid=
https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=765378&offset=0&show_names=0&state=3&appid=
https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=763440&offset=0&show_names=0&state=3&appid=

Any idea what's going on?

Many thanks,
max
ID: 67109 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,946,492
RAC: 22,331
Message 67110 - Posted: 19 Feb 2018, 21:59:24 UTC - in response to Message 67109.  

Hello,
at present I have more than 2000 'validation inconclusive' WU (MilkyWay@Home v1.46 (opencl_ati_101)), these are of the 'Unsent' variety, on my three machines:

https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=764666&offset=0&show_names=0&state=3&appid=
https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=765378&offset=0&show_names=0&state=3&appid=
https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=763440&offset=0&show_names=0&state=3&appid=

Any idea what's going on?

Many thanks,
max


I'm over 600 myself, going to put a stop to my crunching here until it comes back down again!!
ID: 67110 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
iwajabitw

Send message
Joined: 16 Nov 14
Posts: 16
Credit: 335,683,507
RAC: 0
Message 67111 - Posted: 20 Feb 2018, 0:59:39 UTC

Over 6300+ here.
ID: 67111 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pututu

Send message
Joined: 24 Aug 17
Posts: 8
Credit: 223,957,930
RAC: 0
Message 67112 - Posted: 20 Feb 2018, 1:11:34 UTC - in response to Message 67111.  

I've got almost 2000 here. There are 1.4M unsent tasks according to the server status here: https://milkyway.cs.rpi.edu/milkyway/server_status.php

Maybe wait for the next few days to see what happen next?
ID: 67112 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Oct 16
Posts: 162
Credit: 1,004,380,693
RAC: 17,392
Message 67113 - Posted: 20 Feb 2018, 10:40:53 UTC

My count has gone down since last night. From 2.2k to 1.7k.
ID: 67113 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,946,492
RAC: 22,331
Message 67114 - Posted: 20 Feb 2018, 11:22:55 UTC - in response to Message 67113.  

My count has gone down since last night. From 2.2k to 1.7k.


Mine came down almost 40%, but on my the list the first dozen are still "unsent" and I stopped crunching yesterday after I posted. Either a ton of us stopped crunching or there are some significant problems with the way things work. I'm not creating new "inconclusives" but one would think the existing workunits should have been sent out long before now.
ID: 67114 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Max_Pirx

Send message
Joined: 13 Dec 17
Posts: 46
Credit: 2,421,362,376
RAC: 0
Message 67115 - Posted: 20 Feb 2018, 12:16:40 UTC

Slight decrease of my inconclusive WUs, but still have around 2000 of them.
I just started more detailed monitoring - picked up few of those and will watch them to seen when (if ever) they will be sent out again.
ID: 67115 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 25 Feb 13
Posts: 580
Credit: 94,200,158
RAC: 0
Message 67116 - Posted: 20 Feb 2018, 15:16:10 UTC

Hey Everyone,

Looks like the workunit generator went a little overboard over the weekend and clogged things up. The validation inconclusives should clear our of the queue over the next two or three days as the server works through the backlog.

On another note, the new RAM for the server should be here tomorrow and that should help prevent this from happening again.

Jake
ID: 67116 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
pututu

Send message
Joined: 24 Aug 17
Posts: 8
Credit: 223,957,930
RAC: 0
Message 67117 - Posted: 20 Feb 2018, 15:19:18 UTC

I'm speculating that it may take some time since for the 2nd quorum, the task number is not sequential or at least close to the first task number but is spread very far apart. See one example here for one of my WUs:
https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1579238967

For this WU, my task number is 2,264,897,549 and my soon-to-be wingman task number for this same WU is 2,266,161,449. That's about 1,263,790 difference. This is about the current unsent tasks as reported by the server status. I'm also suspecting that no one has started crunching task number 2,266,xxx,xxx yet?[/url]
ID: 67117 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,946,492
RAC: 22,331
Message 67118 - Posted: 20 Feb 2018, 16:01:54 UTC - in response to Message 67116.  

Hey Everyone,

Looks like the workunit generator went a little overboard over the weekend and clogged things up. The validation inconclusives should clear our of the queue over the next two or three days as the server works through the backlog.

On another note, the new RAM for the server should be here tomorrow and that should help prevent this from happening again.

Jake


I sure hope so!! Thanks for working on this AND letting us know what's going on!!!
ID: 67118 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Oct 16
Posts: 162
Credit: 1,004,380,693
RAC: 17,392
Message 67119 - Posted: 20 Feb 2018, 22:09:50 UTC

I saw the site was down earlier today, a couple of hours ago. Count is now 2.5k so even higher than last night. Just a single 280x so pretty much most of them.
ID: 67119 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,946,492
RAC: 22,331
Message 67120 - Posted: 21 Feb 2018, 9:44:33 UTC - in response to Message 67119.  
Last modified: 21 Feb 2018, 10:13:47 UTC

I saw the site was down earlier today, a couple of hours ago. Count is now 2.5k so even higher than last night. Just a single 280x so pretty much most of them.


I now have 211 inconclusive but 69 of those haven't even been sent to another computer yet!! Until that catches up my pc's will stay elsewhere and the units I have in my cache won't be crunched! Sorry guys but I don't like the idea of MW making NEW workunits and sending them out BEFORE the existing workunits are sent out!! I have 211 workunits I have already completed and are waiting for validation, and 69 of those haven't even been SENT to another computer yet!!! I haven't even gotten a new workunit in almost 2 days now!!
ID: 67120 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Max_Pirx

Send message
Joined: 13 Dec 17
Posts: 46
Credit: 2,421,362,376
RAC: 0
Message 67121 - Posted: 21 Feb 2018, 10:11:30 UTC - in response to Message 67119.  

I saw the site was down earlier today, a couple of hours ago. Count is now 2.5k so even higher than last night. Just a single 280x so pretty much most of them.

The same with me.... mine are steadily going up, 3500+ at the moment. Pretty much all of my completed WUs end up as inconclusive (Unsent). So, I am moving my machines away from MW, don't see any point in keeping them crunching here.
ID: 67121 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,946,492
RAC: 22,331
Message 67122 - Posted: 21 Feb 2018, 10:15:10 UTC - in response to Message 67121.  

I saw the site was down earlier today, a couple of hours ago. Count is now 2.5k so even higher than last night. Just a single 280x so pretty much most of them.


The same with me.... mine are steadily going up, 3500+ at the moment. Pretty much all of my completed WUs end up as inconclusive (Unsent). So, I am moving my machines away from MW, don't see any point in keeping them crunching here.


I will bring mine back, I always seem too anyway, but they need to fix the problems so I'm not just spinning my wheels!!
ID: 67122 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
vseven

Send message
Joined: 26 Mar 18
Posts: 24
Credit: 102,912,937
RAC: 0
Message 67431 - Posted: 8 May 2018, 12:34:51 UTC

I'm confused. I mean they will eventually be validated so what does it matter, correct?
ID: 67431 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 67434 - Posted: 8 May 2018, 15:09:09 UTC - in response to Message 67109.  

Hello,
at present I have more than 2000 'validation inconclusive' WU (MilkyWay@Home v1.46 (opencl_ati_101)), these are of the 'Unsent' variety, on my three machines:

https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=764666&offset=0&show_names=0&state=3&appid=
https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=765378&offset=0&show_names=0&state=3&appid=
https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=763440&offset=0&show_names=0&state=3&appid=

Any idea what's going on?

Many thanks,
max


Problem is bug in program. Looking at the set of errors, the first user I looked at had 3 titans but had almost 20,000 errors. If all tasks all error out the number of "pending" will rise to the total number of work units.

I thought only 80 were allowed per day. Even with 3 titans my math suggests it should have taken a week at 80 per 24 hours. All 19,736 was from May7, 5am to may8, 1300
ID: 67434 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,946,492
RAC: 22,331
Message 67436 - Posted: 9 May 2018, 12:21:11 UTC - in response to Message 67434.  

Hello,
at present I have more than 2000 'validation inconclusive' WU (MilkyWay@Home v1.46 (opencl_ati_101)), these are of the 'Unsent' variety, on my three machines:

https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=764666&offset=0&show_names=0&state=3&appid=
https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=765378&offset=0&show_names=0&state=3&appid=
https://milkyway.cs.rpi.edu/milkyway//results.php?hostid=763440&offset=0&show_names=0&state=3&appid=

Any idea what's going on?

Many thanks,
max


Problem is bug in program. Looking at the set of errors, the first user I looked at had 3 titans but had almost 20,000 errors. If all tasks all error out the number of "pending" will rise to the total number of work units.

I thought only 80 were allowed per day. Even with 3 titans my math suggests it should have taken a week at 80 per 24 hours. All 19,736 was from May7, 5am to may8, 1300


No it's 80 per gpu, but if you buzz right thru them at 2 seconds each you can zip thru thousands of them per day, all errors of course!!! Jake wants us to send him the link to computers like the one you found so he can put it on the 'suspicious' list. Unfortunately when he did it automatically a while back LOTS of people couldn't bring new gpu's on here to crunch so it's now a manual process.
ID: 67436 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
vseven

Send message
Joined: 26 Mar 18
Posts: 24
Credit: 102,912,937
RAC: 0
Message 67440 - Posted: 9 May 2018, 13:03:34 UTC - in response to Message 67436.  

I was crunching on 3 Tesla v100's at the same time. A WU took around 35 seconds while running 6 at a time. So averaging a little under 6 seconds per WU x 3 cards which averages out to under 2 seconds per WU. Thats over 43,000 a day. Now I only ran like this for a couple hours (testing) but my inconclusive total was over 2,000 in those couple hours. Out of those I threw 6 errors and 2 invalids, I have 134 still at inconclusive, and all the rest validated.


I still don't know what they issue is you guys are posting about.....
ID: 67440 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 67447 - Posted: 9 May 2018, 22:35:30 UTC - in response to Message 67440.  
Last modified: 9 May 2018, 22:48:59 UTC

I was crunching on 3 Tesla v100's at the same time. A WU took around 35 seconds while running 6 at a time. So averaging a little under 6 seconds per WU x 3 cards which averages out to under 2 seconds per WU. Thats over 43,000 a day. Now I only ran like this for a couple hours (testing) but my inconclusive total was over 2,000 in those couple hours. Out of those I threw 6 errors and 2 invalids, I have 134 still at inconclusive, and all the rest validated.


I still don't know what they issue is you guys are posting about.....


The original post was about inconclusive validations and you are correct in that if you wait them out all will eventually be validated.

OTOH, the post I made was to point out that the first system listed had 3 titans and 19,000 of the work units error'ed which is not the same as inconclusive validations.

Looking at task details one observes that OpenCL was unable to find any nVidia devices although the system had 3 titans.

MY S9100 is nowhere as fast as your tesla. It did cost me under $300 and used only a single 8pin power connector which is a plus.

Your computers are hidden, but you might want to run my program at
http://new.stateson.net/HostProjectStats to get an accurate measurement of completion time.
ID: 67447 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
vseven

Send message
Joined: 26 Mar 18
Posts: 24
Credit: 102,912,937
RAC: 0
Message 67449 - Posted: 10 May 2018, 20:31:55 UTC - in response to Message 67447.  
Last modified: 10 May 2018, 20:32:45 UTC

Oh...they are not "my" Tesla's. I wish. I got permission to play with the machines they were in. Thats also why they are hidden...didn't want the host names to get out. And I no longer have access to them but I might get another chance in the future.

Here is the output from your program for a v100 16Gb SXM2 interface:
        Run Time     CPU Time     Credit
         (sec)         (sec)
            32.6          28.1        231.2
            33.6          29.5        227.3
            36.6          32.0        229.3
            24.5          21.1        228.5
            35.6          30.8        230.8
            29.6          26.3        227.6
            29.6          24.4        227.6
            32.7          28.5        227.7
            30.5          26.5        227.6
            30.5          26.6        227.7
            33.7          29.3        227.7
            30.6          27.3        231.2
            31.6          28.4        227.6
            40.6          35.8        229.3
            34.6          30.4        229.7
            32.5          26.3        227.6
            25.5          22.1        227.6
            33.7          30.2        227.7
            34.5          31.2        229.4
            33.7          28.9        229.4
         ----------------------------------
AVG:        32.3          28.2        228.6
STD:         3.5           3.3          1.3

Keep in mind I'm running 6 WU at a time in the above.

Here is from a p100 16Gb PCIe interface:

        Run Time     CPU Time     Credit
         (sec)         (sec)
            53.4          51.4        227.3
            61.5          59.4        229.1
            53.4          51.5        227.3
            59.5          57.6        227.3
            48.4          46.9        227.2
            54.4          52.7        227.3
            65.5          63.8        229.1
            57.4          55.8        227.3
            49.3          46.3        227.2
            57.4          54.6        227.3
            60.4          58.3        227.3
            50.4          48.8        227.3
            54.4          52.4        228.1
            55.4          53.6        227.3
            53.4          51.3        227.2
            53.4          51.9        228.1
            56.4          54.7        229.3
            57.4          55.3        227.3
            57.6          49.3        227.3
            55.6          48.1        227.2
         ----------------------------------
AVG:        55.7          53.2        227.6
STD:         4.0           4.3          0.7


Also running 6 WU at a time.

I do not have 20 consecutive valid results using 1 per WU but I might be able to borrow a v100 for 20 minutes to get it if you think it would make a big difference. I know overall 1 WU at a time is slower since the card is barely loaded with 1.
ID: 67449 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Huge number of 'Validation inconclusive' WUs

©2024 Astroinformatics Group