Welcome to MilkyWay@home

Server Downtime 3/21 1PM EST


Advanced search

Message boards : News : Server Downtime 3/21 1PM EST
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Max_Pirx

Send message
Joined: 13 Dec 17
Posts: 46
Credit: 1,627,717,962
RAC: 1,996,518
1 billion credit badge4 year member badge
Message 72174 - Posted: 23 Mar 2022, 11:06:58 UTC - in response to Message 72172.  

May your completed WUs are still in the validation queue


maybe, or maybe not... one can only guess
ID: 72174 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 157
Credit: 128,083,833
RAC: 23,230
100 million credit badge5 year member badgeextraordinary contributions badge
Message 72175 - Posted: 23 Mar 2022, 11:50:07 UTC

... I just heard a rumor (over the grapevine), that certain tasks are being deliberately deleted (of course before they are validated) ...
Which is good, because that way the queue can be cleared of strange tasks from unhappy users ...

So I am very worried, that soon (in a couple of weeks) all of mine will be gone ...
Oh, dear ...

To the rest of the crunchers: Have a nice day!
ID: 72175 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 5 Jul 11
Posts: 705
Credit: 273,646,354
RAC: 217,984
200 million credit badge11 year member badge
Message 72176 - Posted: 23 Mar 2022, 11:54:49 UTC - in response to Message 72175.  

... I just heard a rumor (over the grapevine), that certain tasks are being deliberately deleted (of course before they are validated) ...
Which is good, because that way the queue can be cleared of strange tasks from unhappy users ...

So I am very worried, that soon (in a couple of weeks) all of mine will be gone ...
Oh, dear ...

To the rest of the crunchers: Have a nice day!
Either your joke was really funny or I've had too much homebrew.
ID: 72176 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilemikey
Avatar

Send message
Joined: 8 May 09
Posts: 2870
Credit: 465,668,815
RAC: 68,892
300 million credit badge13 year member badgeextraordinary contributions badge
Message 72177 - Posted: 23 Mar 2022, 12:00:41 UTC - in response to Message 72175.  

... I just heard a rumor (over the grapevine), that certain tasks are being deliberately deleted (of course before they are validated) ...
Which is good, because that way the queue can be cleared of strange tasks from unhappy users ...

So I am very worried, that soon (in a couple of weeks) all of mine will be gone ...
Oh, dear ...

To the rest of the crunchers: Have a nice day!


I can't imagine the blowback MW would get and Boinc as well if MW started deleting tasks just because someone bitched and moaned in a forum!! That alone could set Boinc back 10 years as far as the upper level crunchers with the 100+ cpu cores and the tip of the spear gpu's let alone here at MW. IOW I believe that this rumor is just BS and you can safely ignore it!!
ID: 72177 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 157
Credit: 128,083,833
RAC: 23,230
100 million credit badge5 year member badgeextraordinary contributions badge
Message 72178 - Posted: 23 Mar 2022, 12:01:17 UTC - in response to Message 72176.  

Either your joke was really funny or I've had too much homebrew.

... pass some of that homebrew over ...
ID: 72178 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 157
Credit: 128,083,833
RAC: 23,230
100 million credit badge5 year member badgeextraordinary contributions badge
Message 72180 - Posted: 23 Mar 2022, 12:05:25 UTC - in response to Message 72177.  

Mikey and Peter:

... that was a joke ...

Enjoy life !! Please ...
ID: 72180 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 5 Jul 11
Posts: 705
Credit: 273,646,354
RAC: 217,984
200 million credit badge11 year member badge
Message 72181 - Posted: 23 Mar 2022, 12:08:16 UTC - in response to Message 72178.  
Last modified: 23 Mar 2022, 12:12:07 UTC

Either your joke was really funny or I've had too much homebrew.

... pass some of that homebrew over ...
Are you quite quite sure? It's 23% alcohol. Not sure how legal it is. Not sure what the shipping cost would be (assuming you aren't in the UK)?

EDIT: You're in Germany. It costs £6 per litre to send to you. Assuming your alcohol in the shops is of a similar price to the UK, you can get the equivalent for £6.50 in the shops. No, we don't use your Euro rubbish over here. Freedom!
ID: 72181 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Wailing Angus Beef

Send message
Joined: 24 Dec 07
Posts: 26
Credit: 1,062,492,557
RAC: 2,937,533
1 billion credit badge14 year member badge
Message 72186 - Posted: 23 Mar 2022, 13:42:26 UTC

No admin has the time to intentionally delete WUs of specific users. And Max_Pirx need not worry. He simply needs to look at his stats at one of the stats sites, like BOINCStats, and see he is indeed earning cobblestones for this project.
https://www.boincstats.com/stats/61/user/detail/1307417/lastDays
ID: 72186 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
poppinfresh99

Send message
Joined: 28 Feb 22
Posts: 13
Credit: 462,336
RAC: 4,042
100 thousand credit badge
Message 72187 - Posted: 23 Mar 2022, 14:27:31 UTC - in response to Message 72186.  

The following thread from Universe@home says their project would have failed with HDDs instead of their SSDs. Perhaps MilkyWay@home's server should switch (at least partially) to SSDs?
https://universeathome.pl/universe/forum_thread.php?id=627

The thread also explains that people are coming from the paused WCG project. I am one of those (sorry!). I have put my CPUs towards the "N-Body Simulation" here.
ID: 72187 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 5 Jul 11
Posts: 705
Credit: 273,646,354
RAC: 217,984
200 million credit badge11 year member badge
Message 72188 - Posted: 23 Mar 2022, 14:34:02 UTC - in response to Message 72187.  

The following thread from Universe@home says their project would have failed with HDDs instead of their SSDs. Perhaps MilkyWay@home's server should switch (at least partially) to SSDs?
https://universeathome.pl/universe/forum_thread.php?id=627

The thread also explains that people are coming from the paused WCG project. I am one of those (sorry!). I have put my CPUs towards the "N-Body Simulation" here.
The SSD was invented a decade ago, I can't believe anyone is still using hard disks for servers!
ID: 72188 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kiska

Send message
Joined: 31 Mar 12
Posts: 55
Credit: 90,728,893
RAC: 592,461
50 million credit badge10 year member badge
Message 72189 - Posted: 23 Mar 2022, 15:58:47 UTC - in response to Message 72188.  

The following thread from Universe@home says their project would have failed with HDDs instead of their SSDs. Perhaps MilkyWay@home's server should switch (at least partially) to SSDs?
https://universeathome.pl/universe/forum_thread.php?id=627

The thread also explains that people are coming from the paused WCG project. I am one of those (sorry!). I have put my CPUs towards the "N-Body Simulation" here.
The SSD was invented a decade ago, I can't believe anyone is still using hard disks for servers!


I can think of a reason or 2.

Big data sets being one of them. Also cost being the other reason. HDD have one of the lowest cost per TB stored compared to SSDs.

Also if you're using consumer SSDs their rated endurance is quite low in a server environment. eg 960GB WD Enterprise SSD: https://www.newegg.com/western-digital-gold-960gb/p/20-250-139 vs a 1TB consumer SSD https://www.newegg.com/western-digital-1tb-black-sn850-nvme/p/N82E16820250161
The 1 TB consumer SSD has endurance of 600TBW while the enterprise drive is 1.4PBW

If the server is running ZFS, it would be a good idea to utilise L2ARC on a SSD it'll increase performance without too much cost
ID: 72189 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 5 Jul 11
Posts: 705
Credit: 273,646,354
RAC: 217,984
200 million credit badge11 year member badge
Message 72192 - Posted: 23 Mar 2022, 17:15:39 UTC - in response to Message 72189.  

I can think of a reason or 2.

Big data sets being one of them. Also cost being the other reason. HDD have one of the lowest cost per TB stored compared to SSDs.
Yeah, 5 times cheaper and 50 times slower. Not a reasonable choice.

Also if you're using consumer SSDs their rated endurance is quite low in a server environment. eg 960GB WD Enterprise SSD: https://www.newegg.com/western-digital-gold-960gb/p/20-250-139 vs a 1TB consumer SSD https://www.newegg.com/western-digital-1tb-black-sn850-nvme/p/N82E16820250161
The 1 TB consumer SSD has endurance of 600TBW while the enterprise drive is 1.4PBW
Enterprise SSDs are not much more expensive. Rosetta is on them, Universe (a small low budget project) is on them, Sidock (another small low budget project) is on them.
ID: 72192 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 5 Jul 11
Posts: 705
Credit: 273,646,354
RAC: 217,984
200 million credit badge11 year member badge
Message 72193 - Posted: 23 Mar 2022, 17:16:38 UTC - in response to Message 72189.  

I can think of a reason or 2.

Big data sets being one of them. Also cost being the other reason. HDD have one of the lowest cost per TB stored compared to SSDs.
Yeah, 5 times cheaper and 50 times slower. Not a reasonable choice.

Also if you're using consumer SSDs their rated endurance is quite low in a server environment. eg 960GB WD Enterprise SSD: https://www.newegg.com/western-digital-gold-960gb/p/20-250-139 vs a 1TB consumer SSD https://www.newegg.com/western-digital-1tb-black-sn850-nvme/p/N82E16820250161
The 1 TB consumer SSD has endurance of 600TBW while the enterprise drive is 1.4PBW
Enterprise SSDs are not much more expensive. Rosetta is on them, Universe (a small low budget project) is on them, Sidock (another small low budget project) is on them.

By the time the SSDs wear out, there will be ones 5 times better available anyway.
ID: 72193 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 347
Credit: 77,442,135
RAC: 148,423
50 million credit badge3 year member badge
Message 72197 - Posted: 23 Mar 2022, 18:30:48 UTC

Hey all, I'm going to address a bunch of recent comments:

am guessing you'll keep server stats(*) exports on a 12 hourly export while the server is running a bit behind?

*ie server_status.php and /stats/ dir


I haven't made any changes to the frequency of the status page updates. I think that some of the DB queries that need to be done to update that page take a long time when there are a lot of stale WUs in the system, like when the transitioner backlog is large. This can impact the frequency that the page updates, especially when I'm killing slow DB tasks (which might be related to updating that page).

Come on Tom, please admit what century this server was made in. Rebuilds should be a couple of hours. List some specs, I dare you.


Looks like 4 SCSI SMC3108 HDDs that are running in RAID 5. The reason that the rebuild is taking so long is because the DB is constantly updating. The current server hardware was purchases 3 years ago, I believe.

But I want to know why Tom is using outdated equipment and why he won't reveal what the specs are. A server which is this slow is absurd.


I'm not hiding some conspiracy, and I'm not embarrassed about the hardware specs. I don't have control over what hardware is on the server, although I am able to relay problems to my supervisor. I have been working closely with them regarding this most recent drive failure.

Well, it's the other way around (more or less) - we (at least some of us) are spending our money/resources to help with their research and it's only fair to get the best possible use of our resources and not waste our money.
I do MW@H because I like that kind of science and I am interested in the broader topic (cosmology). However, if my machine time and money are going to waste here I'd like to know that so that I can rearrange my priorities.
The only updates from the admins are very vague - 'just going to switch this and that off... going to purge whatever... things are going to be slow for a bit... the server is a bit unstable... etc.' which is not very helpful. Besides, whatever they do on their end, does not improve things on our end so far. The server keeps limping with erratic WU supply and, as far as I can tell, wasting our money and resources.
I'd like to see more detailed updates on what is wrong with the server, are there any plans on how to fix it, what is the time scale, etc. Even if the admins have no idea what is wrong and how to fix it (nothing wrong if this is the case), it is only fair to inform us so that we can plan accordingly.


Thank you very much for your volunteering! We certainly appreciate it. Your time and effort are not going to waste - recently we just published work that came from the Nbody application, and we are working hard on publishing recent results from the Separation application. These things just take a lot of time and effort on our end that can't really be crowd-sourced.

Regarding the updates, I tend to make them somewhat vague because I figure that people don't want to read tech jargon. I can be more specific if people would prefer that. In the past I've given overviews and then gone in-depth for some topics, so I can try to do that more as well.

Right now we are just hoping that when the drive rebuilds and the current backlog of tasks clears, that we will return to service "as normal". However, if that is not the case (it's taking a long time for things to get back to normal), then we will have to figure out what is causing these problems. I'm not sure the exact cause for a lot of it - I figured that it was related to the drive failure - but it could be something else. If and when we decide to take more action, I will communicate that with you.

I'm not too upset, I can just turn Einstein/Universe on aswell to keep everything doing something. But I agree that communication is nice. I'd love to know the specs of the server (Rosetta has them openly displayed on the webpage), what device failed, how much it would cost to buy something better, and if we should perhaps have a whip-round for some cash to get it. Clearly this server is just on the limit of managing, and with one disk missing it collapses in a heap. Faster hardware would make everything run smoothly all the time.


It's disappointing that the server appears to be fragile enough that this drive failure caused such a large issue. We are trying to discuss ways of avoiding this in the future, but it is not a conversation that has a fast turnaround time.

But where is the science output from the completed tasks in the last 3-4 weeks?


A single separation run takes several months to complete, and then the data has to get analyzed, sometimes more runs need to go up, and then we have to write the publication, get peer reviewed, put out a press release, and all while teaching/taking classes and also working on other research projects. The point is, 3-4 weeks is not a reasonable amount of time to expect scientific feedback. We do have someone working on updating the science pages, because those are very behind. I'm not sure how far along that process is though.

According to the server stats, nearly 4 million tasks got validated recently (at least they disappeared from the validation pending stock), but did not materialize as valid tasks with credit... where are they?


Did you end up getting the credit? I sure hope so, otherwise I would want to try to figure something out to make sure that you all got the credit you deserve.

Enterprise SSDs are not much more expensive. Rosetta is on them, Universe (a small low budget project) is on them, Sidock (another small low budget project) is on them.


We're looking into purchasing SSDs, although it would still be several thousand dollars to upgrade. No guarantee that this will go anywhere, though.
ID: 72197 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileTom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 347
Credit: 77,442,135
RAC: 148,423
50 million credit badge3 year member badge
Message 72198 - Posted: 23 Mar 2022, 18:36:22 UTC

I'd like to remind everyone that I'm not in charge of this project, I don't control funding or hardware, or even do a substantial amount of programming for the project. I'm just a grad student who works on the project, but my thesis isn't actually on MilkyWay@home Separation. I just happen to be the most public-facing person in the group, so it seems that I'm in charge of a lot more than I actually am.

I come from a physics background, and while I have IT/programming experience, I am not adept at dealing with a lot of the server bugs. Additionally, I am only able to dedicate ~10% or less of my time to the server and the project, so I apologize when it takes work a long time to get done. In my opinion we need to hire a graduate student or other staff member to work on this project full time. That's what the volunteers deserve. Most of the time I'm just trying to keep the server working so that the science can still get done.

I don't want this to sound like I'm making excuses! I just want people to remember that I'm not trying to hide anything, defraud you or your time, or have any insincere goals. I'm just another overworked grad student trying to do what they can to help things move along. :)

Apologies for the long stint of problems that we've had lately, and I'll try to be more communicative about specifics moving forward so that we can work on these issues together.
ID: 72198 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 157
Credit: 128,083,833
RAC: 23,230
100 million credit badge5 year member badgeextraordinary contributions badge
Message 72199 - Posted: 23 Mar 2022, 18:41:10 UTC - in response to Message 72197.  

Thanks Tom for the infos!
...
We're looking into purchasing SSDs, although it would still be several thousand dollars to upgrade. No guarantee that this will go anywhere, though.

What do you mean by this? You probably won't switch to SSDs?

Well, you can count me in on donating!
ID: 72199 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 138
Credit: 1,649,207
RAC: 8,731
1 million credit badge10 year member badge
Message 72200 - Posted: 23 Mar 2022, 18:43:50 UTC - in response to Message 72198.  

Thanks for that explanation, really appreciated.
ID: 72200 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[TA]Skillz

Send message
Joined: 28 May 17
Posts: 9
Credit: 676,741,448
RAC: 5,662,335
500 million credit badge5 year member badge
Message 72201 - Posted: 23 Mar 2022, 18:45:32 UTC

Would shutting the project down for a day (or however long it takes) help speed up the rebuilding process?

Seems to me you can just stop sending out work, collect all the work currently out, and rebuild the new hard drive much faster than trying to keep things going while it rebuilds.

Even going as far as taking the project/server completely offline once all outstanding work has been returned to let the server rebuild itself without the DB constantly changing.
ID: 72201 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
San-Fernando-Valley

Send message
Joined: 13 Apr 17
Posts: 157
Credit: 128,083,833
RAC: 23,230
100 million credit badge5 year member badgeextraordinary contributions badge
Message 72202 - Posted: 23 Mar 2022, 18:50:34 UTC - in response to Message 72198.  

Tom:

You are doing fine.
Thanks for your time and efforts.
We (well, at least I) appreciate it.

But it is good, that you reminded the crunchers how the situation actually is!
And what your part is and what you have to deal with.

As we say here: You are sitting between two chairs.

Cheers -
ID: 72202 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 5 Jul 11
Posts: 705
Credit: 273,646,354
RAC: 217,984
200 million credit badge11 year member badge
Message 72205 - Posted: 23 Mar 2022, 19:10:55 UTC - in response to Message 72198.  

I'd like to remind everyone that I'm not in charge of this project, I don't control funding or hardware, or even do a substantial amount of programming for the project. I'm just a grad student who works on the project, but my thesis isn't actually on MilkyWay@home Separation. I just happen to be the most public-facing person in the group, so it seems that I'm in charge of a lot more than I actually am.
Ok, thanks for letting us know, I thought you were the head guy, it does say "Project administrator, Project developer, Project tester, Project scientist" against your username!

Several thousand sounds a lot for SSDs. What's the storage capacity of the RAID set?
ID: 72205 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : News : Server Downtime 3/21 1PM EST

©2022 Astroinformatics Group