Welcome to MilkyWay@home

Server Downtime March 28, 2022 (12 hours starting 00:00 UTC)


Advanced search

Message boards : News : Server Downtime March 28, 2022 (12 hours starting 00:00 UTC)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 15 · Next

AuthorMessage
Kiska

Send message
Joined: 31 Mar 12
Posts: 88
Credit: 149,784,912
RAC: 64,881
100 million credit badge10 year member badge
Message 72460 - Posted: 2 Apr 2022, 20:23:52 UTC - in response to Message 72458.  
Last modified: 2 Apr 2022, 20:35:51 UTC

I get the feeling that the server thinks there are 17M jobs ready to send out, so it doesn't make more jobs. However, I cancelled all of those jobs in order to try to clear the validation backlog. I'm not sure where the jobs are stuck, but I will turn things off and reset their transition times, and see if that clears them.

The jobs should get removed from the DB once they are cancelled, but they just might not have been transitioned yet because they haven't gone out to volunteers (because they were cancelled)


I would say don't touch the database at all, I am going to spin a few machines to clear this issue

EDIT: There are really, 17M tasks ready to send. There is no validation backlog

EDIT2: If there wasn't 17M tasks then the following graph won't be decreasing or showing any trend:

Time is in UTC

EDIT3: Here is last 7 days of the same graph, above graph is last 3 days
ID: 72460 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileBill F
Avatar

Send message
Joined: 4 Jul 09
Posts: 53
Credit: 13,459,758
RAC: 12,870
10 million credit badge13 year member badge
Message 72461 - Posted: 3 Apr 2022, 0:38:21 UTC

All of the credits stagger out of the system when they are ready... I have been credited with over 140K in the last 3 days.

Bill F
ID: 72461 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kiska

Send message
Joined: 31 Mar 12
Posts: 88
Credit: 149,784,912
RAC: 64,881
100 million credit badge10 year member badge
Message 72462 - Posted: 3 Apr 2022, 3:25:13 UTC
Last modified: 3 Apr 2022, 4:20:17 UTC

Congratulations on having the transitioner remake at least 120K tasks:
ID: 72462 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 741
Credit: 334,467,381
RAC: 102,007
300 million credit badge11 year member badge
Message 72463 - Posted: 3 Apr 2022, 3:30:38 UTC - in response to Message 72462.  

Congratulations on having the transitioner remake at least 120K tasks:
I get the feeling you understand Boinc servers. Perhaps you could remotely control the MW server?
ID: 72463 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BillK

Send message
Joined: 14 Mar 21
Posts: 3
Credit: 683,951
RAC: 5,572
500 thousand credit badge1 year member badge
Message 72464 - Posted: 3 Apr 2022, 5:35:49 UTC - in response to Message 72461.  

I have 450 "validation inconclusive" tasks on 4/3. No valid, no invalid. Is there any hope?

Bill K
ID: 72464 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 214
Credit: 131,234,744
RAC: 17,912
100 million credit badge5 year member badgeextraordinary contributions badge
Message 72465 - Posted: 3 Apr 2022, 6:51:43 UTC - in response to Message 72464.  

I have 450 "validation inconclusive" tasks on 4/3. No valid, no invalid. Is there any hope?

Bill K

What do you mean by 4/3 ?

Check out the workunit, there you can see your "wingmen" and deduct the reason for the status and quorum to be fulfilled.
Also if another send is on its way or to be scheduled.
ID: 72465 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 214
Credit: 131,234,744
RAC: 17,912
100 million credit badge5 year member badgeextraordinary contributions badge
Message 72467 - Posted: 3 Apr 2022, 14:12:28 UTC - in response to Message 72300.  

It says 600 in progress:
State: All (82650) · In progress (600) · Validation pending (28154) · Validation inconclusive (8569) · Valid (45223) · Invalid (0) · Error (104)
Application: All (82650) · Milkyway@home N-Body Simulation (0) · Milkyway@home Separation (82650)
I wish I could find out where they are.
None of my rigs show any tasks.

Stopped BOINC on all rigs (aka PC).
Restarted my rigs.
Started BOINC on all rigs.
NOTHING!!

After requesting work the Event Log either says that there is no work available or just no work gotten.

Hope Tom will get to his office early.

Something is wrong.

OK, now these 600 "lost" tasks, which i couldn't find anywhere, are erroring out with "Timed out - no response".
They never started.

They are all from 22. March 2022.
Nothing lost for me - except my error rate is going up.
Just glad "problem" is solved in a "harmless" manner.
ID: 72467 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 186
Credit: 2,375,109
RAC: 5,136
2 million credit badge11 year member badge
Message 72468 - Posted: 3 Apr 2022, 15:26:08 UTC

I had 2 WU's validated from 8Th March .
ID: 72468 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 741
Credit: 334,467,381
RAC: 102,007
300 million credit badge11 year member badge
Message 72469 - Posted: 3 Apr 2022, 19:21:50 UTC - in response to Message 72467.  

OK, now these 600 "lost" tasks, which i couldn't find anywhere, are erroring out with "Timed out - no response".
They never started.

They are all from 22. March 2022.
Nothing lost for me - except my error rate is going up.
Just glad "problem" is solved in a "harmless" manner.
Hopefully Tom can scrounge as much data as he can from things that went upside down so our processing was meaningful. If not, we'll just have to do it again. Shit happens.
ID: 72469 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
poppinfresh99

Send message
Joined: 28 Feb 22
Posts: 14
Credit: 1,577,132
RAC: 10,087
1 million credit badge
Message 72470 - Posted: 3 Apr 2022, 19:34:21 UTC - in response to Message 72465.  
Last modified: 3 Apr 2022, 19:35:04 UTC

I have 450 "validation inconclusive" tasks on 4/3. No valid, no invalid. Is there any hope?

Bill K

What do you mean by 4/3 ?

Check out the workunit, there you can see your "wingmen" and deduct the reason for the status and quorum to be fulfilled.
Also if another send is on its way or to be scheduled.


I assume 4/3 is April 3...
https://en.as.com/en/2022/01/01/latest_news/1641063320_406325.html

I also am not getting valid tasks (I only run N-Body Simulation). Here are my tasks...
State: All (2591) · In progress (72) · Validation pending (0) · Validation inconclusive (2518) · Valid (1) · Invalid (0) · Error (0)
Application: All (2591) · Milkyway@home N-Body Simulation (2591) · Milkyway@home Separation (0)


The workunits all look like the following one (Workunit 403393471)...
minimum quorum 	1
initial replication 	2

175677525 	921221 	3 Apr 2022, 4:06:43 UTC 	3 Apr 2022, 13:44:29 UTC 	Completed, validation inconclusive 	491.58 	1,527.09 	pending 	Milkyway@home N-Body Simulation v1.82 (mt) windows_x86_64

200923667 	--- 	--- 	--- 	Unsent 	--- 	--- 	--- 	---


The single valid task I have is when *I* was the wingman (on April 1).
ID: 72470 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Max_Pirx

Send message
Joined: 13 Dec 17
Posts: 46
Credit: 1,958,205,364
RAC: 2,608,306
1 billion credit badge4 year member badge
Message 72471 - Posted: 3 Apr 2022, 19:34:48 UTC

Most of my current and past work went to 'validation inconclusive' pile. Quite disappointing. The WUs are duplicated but for some reason both results are unsatisfactory. The third copy of the WUs are just 'unsent'. Such a waste of time and resources.
ID: 72471 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 741
Credit: 334,467,381
RAC: 102,007
300 million credit badge11 year member badge
Message 72472 - Posted: 3 Apr 2022, 20:16:44 UTC - in response to Message 72470.  

What do you mean by 4/3 ?
I assume 4/3 is April 3...
Or more correctly March 4th. Date, month, year, increasing order. Month, date, year, pure ludicrousy.
ID: 72472 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 741
Credit: 334,467,381
RAC: 102,007
300 million credit badge11 year member badge
Message 72473 - Posted: 3 Apr 2022, 20:17:55 UTC - in response to Message 72471.  

Most of my current and past work went to 'validation inconclusive' pile. Quite disappointing. The WUs are duplicated but for some reason both results are unsatisfactory. The third copy of the WUs are just 'unsent'. Such a waste of time and resources.
I have suggested we all chip in and buy some up to date hardware. Quite how Tom came up with 10 grand just for SSDs I don't know.
ID: 72473 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kiska

Send message
Joined: 31 Mar 12
Posts: 88
Credit: 149,784,912
RAC: 64,881
100 million credit badge10 year member badge
Message 72475 - Posted: 3 Apr 2022, 21:23:38 UTC - in response to Message 72473.  

Most of my current and past work went to 'validation inconclusive' pile. Quite disappointing. The WUs are duplicated but for some reason both results are unsatisfactory. The third copy of the WUs are just 'unsent'. Such a waste of time and resources.
I have suggested we all chip in and buy some up to date hardware. Quite how Tom came up with 10 grand just for SSDs I don't know.


Or you could rent some cloud resources and attach them to the project to help speed up clearing the ready to send queue?

I am renting out some azure and aws offerings to help speed this up a bit
ID: 72475 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
unixchick
Avatar

Send message
Joined: 21 Feb 22
Posts: 66
Credit: 641,540
RAC: 6,541
500 thousand credit badge
Message 72477 - Posted: 3 Apr 2022, 21:54:45 UTC

Having an inconclusive WU isn't a waste. It just means that the result is hard to confirm and another copy of the WU will be sent out, and then usually with the 3rd result, it will become valid (at least that is the pattern for me). It isn't a waste, the valid result will be found, and credit will be issued.

The separation queue of WUs being sent is just under 3 million, I'm currently getting resends from March 30, so hopefully the resends generated from your WUs yesterday will be sent out in a day or two.
ID: 72477 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 151
Credit: 85,323,318
RAC: 55,174
50 million credit badge12 year member badgeextraordinary contributions badge
Message 72478 - Posted: 3 Apr 2022, 21:56:56 UTC - in response to Message 72473.  

I have suggested we all chip in and buy some up to date hardware. Quite how Tom came up with 10 grand just for SSDs I don't know.

Peter,

Enterprise SSDs are designed to meet much more stressful usage situations than the sort of SSD that we might have in a PC or laptop... There are typically less bits stored per cell, a far higher level of under-provisioning to allow for the eventual failure of memory cells, and lots more error-detection and correction logic; also there needs to be some sort of mechanism for protection against unexpected power loss. All of those push up the price!

Assuming one doesn't just buy the cheapest items labelled "Enterprise SSD" typical UK prices seem to be about £250 per Terabyte; if those prices are truly representative of what might be available and usable by the MW server's RAID system (without needing to replace that as well) £10,000 would get about 40 Terabytes -- how much user storage that would provide would depend on the RAID version, number of redundant drives, and so on...

Obviously, I can't know what prices might be available in the USA, or what the Computing technical people at RPI will be willing to acquire, so the above sizing is merely indicative... :-)

Cheers - Al.

P.S. If I/O bandwidth doesn't slow things down, there's a fair chance that in a single server BOINC environment memory bandwidth will become a problem instead; dividing work between multiple servers can help with that. More expense :-)
ID: 72478 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker
Avatar

Send message
Joined: 5 Jul 11
Posts: 741
Credit: 334,467,381
RAC: 102,007
300 million credit badge11 year member badge
Message 72479 - Posted: 3 Apr 2022, 22:46:10 UTC - in response to Message 72478.  

Enterprise SSDs are designed to meet much more stressful usage situations than the sort of SSD that we might have in a PC or laptop... There are typically less bits stored per cell, a far higher level of under-provisioning to allow for the eventual failure of memory cells, and lots more error-detection and correction logic; also there needs to be some sort of mechanism for protection against unexpected power loss. All of those push up the price!

Assuming one doesn't just buy the cheapest items labelled "Enterprise SSD" typical UK prices seem to be about £250 per Terabyte; if those prices are truly representative of what might be available and usable by the MW server's RAID system (without needing to replace that as well) £10,000 would get about 40 Terabytes -- how much user storage that would provide would depend on the RAID version, number of redundant drives, and so on...

Obviously, I can't know what prices might be available in the USA, or what the Computing technical people at RPI will be willing to acquire, so the above sizing is merely indicative... :-)

Cheers - Al.

P.S. If I/O bandwidth doesn't slow things down, there's a fair chance that in a single server BOINC environment memory bandwidth will become a problem instead; dividing work between multiple servers can help with that. More expense :-)
£250 a TB seems about right to me. £100 a TB for desktop, Enterprise starts at £150, so a decent one £250 sounds ok. The missing variable here is how much storage they need, I don't know what that is. It's $10,000 Tom quoted, which is £7,600, which would be 30TB. At the moment I think they use 3 disks, 1 redundant, so 30TB of SSD would provide 20TB of storage. The work units they send out are pretty small, but there are millions of them, and we don't know how big the source data is or how much needs to be stored afterwards. But perhaps only the user-facing bit of storage needs to be SSD? Long term storage over a few months of collected data can go on slow disks. At any rate, many of us chipping in can create a lot of money, he did say he was going to put the donations page on the homepage, I didn't even know they took donations.
ID: 72479 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 214
Credit: 131,234,744
RAC: 17,912
100 million credit badge5 year member badgeextraordinary contributions badge
Message 72484 - Posted: 4 Apr 2022, 8:34:28 UTC

Donations can be happily made here:

https://securelb.imodules.com/s/1225/giving/index.aspx?sid=1225&gid=1&pgid=3676

Don't forget to put a checkmark in the box near the bottom, so that the contributions only go directly to Milkyway!
ID: 72484 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BillK

Send message
Joined: 14 Mar 21
Posts: 3
Credit: 683,951
RAC: 5,572
500 thousand credit badge1 year member badge
Message 72498 - Posted: 5 Apr 2022, 0:27:19 UTC - in response to Message 72472.  

3 of my 505 Inconclusive went to Valid. It's getting caught up!

Bill K
ID: 72498 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 186
Credit: 2,375,109
RAC: 5,136
2 million credit badge11 year member badge
Message 72512 - Posted: 5 Apr 2022, 13:49:49 UTC - in response to Message 72498.  

Some of mine too. Have a lot today that got validated straight away as well. Slightly worried the waiting for validation number on the server has shot up to over 55000, maybe it’s just timing.
ID: 72512 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 15 · Next

Message boards : News : Server Downtime March 28, 2022 (12 hours starting 00:00 UTC)

©2022 Astroinformatics Group