Server Downtime March 28, 2022 (12 hours starting 00:00 UTC)

Author	Message
Keith Myers Send message Joined: 24 Jan 11 Posts: 738 Credit: 565,320,242 RAC: 15,927	Message 72732 - Posted: 12 Apr 2022, 19:13:51 UTC Looks like the servers are getting caught up at least for Separation. My Pendings and Inconclusives stacks has reduced to a third it was previously. Finally going to show some RAC increase after the past two weeks of stagnation. ID: 72732 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 377,180,214 RAC: 758	Message 72737 - Posted: 13 Apr 2022, 3:49:00 UTC - in response to Message 72728. Last modified: 13 Apr 2022, 4:03:09 UTC Nope, tested it, it can write or read that continuously. Even if what you were saying were true I would think the server has big peak loads which an SSD would handle well, And then of course there's the obvious advantage of almost zero seek time. With a server the disk needs to access multiple parts at once. Moving the heads all over the place is just absurd for 1000 users trying to get different things at the same time. I see you didn't test long enough then. Here is a Tom's hardware review of the Crucial P1 SSD I mentioned: https://www.tomshardware.com/reviews/crucial-p1-nvme-ssd-qlc,5852-3.html I'll quote a snippet Official write specifications are only part of the performance picture. Most SSD makers implement an SLC cache buffer, which is a fast area of SLC-programmed flash that absorbs incoming data. Sustained write speeds can suffer tremendously once the workload spills outside of the SLC cache and into the "native" TLC or QLC flash. We hammer the SSDs with sequential writes for 15 minutes to measure both the size of the SLC buffer and performance after the buffer is saturated. 1TB variant The Intel 660p is faster than the P1 for the first 20 seconds of this heavy write workload, but after that, the P1 took the lead until the buffer was full. Crucialâ€™s P1 wrote 149GB of data before its write speed degraded from 1.7GB/s down to an average of 106MB/s. 500GB variant Crucialâ€™s P1 features a rather large SLC write cache. It helps the SSD to absorb about 73GB of data at a rate of 1GB/s before it fills. This is sufficient for most consumer workloads, but after that, performance suffers drastically. We all know when you add more bits to a NAND cell, write performance suffers without an SLC cache. But in the Crucial P1â€™s case, performance is dreadful. After its SLC cache exhausts, the native direct to QLC write speed is just 60MB/s on average. I was talking about the real world, not benchmarks. And my WD doesn't ever go below 400MB/sec, considerably more than the 130MB/sec of my hard disk. And that disk virtually never achieves 130, because that has to be one single sequential access, again not real world. The other access is in a physically different place on the drive! Now, I've got a very cheap 1TB NVME, the cheapest I could find. Change that to a larger one with say 4TB, even with the same tech, just 4 of them in parallel, which is what the bigger ones do, and it could write 4 times as fast. Universe got SSDs, the admin says it was night and day. Rosetta runs 72 SSDs. Yeah right, try accessing even two files at once. The heads jump back and forth and your 250 becomes 2. Now consider MW has thousands of users. And pray tell me what businesses were using before the avant of SSDs. A single 15k rpm drive can do about 700 TPS on a database that should be sufficient for MySQL for the BOINC server to work with. Also seti@home worked fine off of HDDs and that project was the true definition of shoehorn budget. Before SSDs we weren't transferring that much data. No idea why you think I've never seen enterprise setups. I've worked in universities and schools with 1000s of users accessing the server. I've used disks and SSDs. But I bought decent stuff that was nowhere near it's maximum capabilities in normal use, so when disks failed, it rebuilt them easily. You did get a disk controller that handles that right? You didn't have the CPU doing all the work? What do you think ZFS is? For some context ZFS best works when ZFS-to-disk path is present so that ZFS can determine disk health, detect corruption, disk handling etc. And hardware RAID cards present a barrier to this direct access. So you are correct in that we are letting the CPU handle all of that work and works quite well, while 2 weeks is a long time to rebuild 7PB we weren't in any danger of losing data and the other reason is so we can use it as a practical demonstration to current ICT students(we still need to teach and what better way to do that than a real world example using real data). And we are going to use ZFS to build out our new data centre for SKA data, I believe the build is 500PB of raw HDD storage and something like 440PB of usable space. hint hint ZFS previously stood for zettabyte file system It's a way to get round outdated Heath Robinson equipment you found in the skip. Oooh 2003. Hint: it's now 2022. Ultra 320 is older than my Renault which is worth Â£150. Sealed lead acid(SLA) batteries is a 1930s era technology, yet they are still being manufactured for cars, etc. And especially UPS batteries, since I am working as a part time freight handler, I get to pick up said devices(that being UPS's and their batteries). And APC, Eaton plus whoever else makes UPS's still sell them brand new with SLA batteries! You can get just as much current from an SLA as any other battery, so they're fine if you just want power but aren't concerned about size or weight. But they're utterly useless for say electric cars due to their weight and charge density. I use them for my UPS. Actually I use non-sealed ones as they're far cheaper. What's the danger of my UPS turning upside down? ID: 72737 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 377,180,214 RAC: 758	Message 72738 - Posted: 13 Apr 2022, 3:50:52 UTC - in response to Message 72732. Looks like the servers are getting caught up at least for Separation. My Pendings and Inconclusives stacks has reduced to a third it was previously. Finally going to show some RAC increase after the past two weeks of stagnation. Not sure what's going on. Tom said the rebuild had finished, and we saw the validation queue dropping, then it started increasing, now it's dropping again. The hardware must be as good now as it was before xmas, so what's changed? There aren't any projects I know of down that would cause an influx of more users here. ID: 72738 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0	Message 72752 - Posted: 13 Apr 2022, 10:41:15 UTC - in response to Message 72738. Looks like the servers are getting caught up at least for Separation. My Pendings and Inconclusives stacks has reduced to a third it was previously. Finally going to show some RAC increase after the past two weeks of stagnation. Not sure what's going on. Tom said the rebuild had finished, and we saw the validation queue dropping, then it started increasing, now it's dropping again. The hardware must be as good now as it was before xmas, so what's changed? There aren't any projects I know of down that would cause an influx of more users here. World Community Grid is still down ID: 72752 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 377,180,214 RAC: 758	Message 72754 - Posted: 13 Apr 2022, 10:47:17 UTC - in response to Message 72752. Looks like the servers are getting caught up at least for Separation. My Pendings and Inconclusives stacks has reduced to a third it was previously. Finally going to show some RAC increase after the past two weeks of stagnation. Not sure what's going on. Tom said the rebuild had finished, and we saw the validation queue dropping, then it started increasing, now it's dropping again. The hardware must be as good now as it was before xmas, so what's changed? There aren't any projects I know of down that would cause an influx of more users here. World Community Grid is still down That wouldn't produce an influx of GPU people as they don't do much GPU work. I assume it's the GPU seperation stuff that overloads the MW server. ID: 72754 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 738 Credit: 565,320,242 RAC: 15,927	Message 72763 - Posted: 13 Apr 2022, 19:23:40 UTC Normally a database access is the slowest part of any BOINC transaction. And right now the hugely inflated database from the 13M N-body tasks are slowing everything down. Until that gets back down to historical norm levels, I don't think the project will become healthy again. I have zero pendings now on my Separation work. Processing is back to normal with one in - - one out sequences. ID: 72763 · Rating: 0 · rate: / Reply Quote

unixchick Send message Joined: 21 Feb 22 Posts: 66 Credit: 817,008 RAC: 0	Message 72783 - Posted: 14 Apr 2022, 13:57:20 UTC - in response to Message 72763. Normally a database access is the slowest part of any BOINC transaction. And right now the hugely inflated database from the 13M N-body tasks are slowing everything down. Until that gets back down to historical norm levels, I don't think the project will become healthy again. I have zero pendings now on my Separation work. Processing is back to normal with one in - - one out sequences. I'm seeing this too Keith. I can finally make MW the main project and get separation WUs reliably again. Just wanted to confirm and say a Thank you to Tom! ID: 72783 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 377,180,214 RAC: 758	Message 72810 - Posted: 15 Apr 2022, 6:07:34 UTC - in response to Message 72763. Normally a database access is the slowest part of any BOINC transaction. And right now the hugely inflated database from the 13M N-body tasks are slowing everything down. Until that gets back down to historical norm levels, I don't think the project will become healthy again. I have zero pendings now on my Separation work. Processing is back to normal with one in - - one out sequences. We'd better all grab N-bodys then. ID: 72810 · Rating: 0 · rate: / Reply Quote

San-Fernando-Valley Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0	Message 72817 - Posted: 15 Apr 2022, 8:19:27 UTC - in response to Message 72810. ... We'd better all grab N-bodys then. Not really, or? I've tried to do CPU N-Body and I'm just getting (very many) _0 tasks only. Would this not just make the situation worse? ID: 72817 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 377,180,214 RAC: 758	Message 72819 - Posted: 15 Apr 2022, 8:35:01 UTC - in response to Message 72817. ... We'd better all grab N-bodys then. Not really, or? I've tried to do CPU N-Body and I'm just getting (very many) _0 tasks only. Would this not just make the situation worse? I thought the problem was the big database of unsent Nbody work. If we used up half of that, the database would be smaller and quicker to access? ID: 72819 · Rating: 0 · rate: / Reply Quote

San-Fernando-Valley Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0	Message 72821 - Posted: 15 Apr 2022, 8:43:47 UTC - in response to Message 72819. ... We'd better all grab N-bodys then. Not really, or? I've tried to do CPU N-Body and I'm just getting (very many) _0 tasks only. Would this not just make the situation worse? I thought the problem was the big database of unsent Nbody work. If we used up half of that, the database would be smaller and quicker to access? Well, depends - I was thinking of the "validation inconclusive" tasks, which are waiting for a long time. So, the work generator is turned off. No real new tasks are beeing generated. Now we have to work ourselves through the queue, hoping to get, every now and then, a resend? OK, got it now! Wonder how long that will take? ID: 72821 · Rating: 0 · rate: / Reply Quote

Septimus Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,905,403 RAC: 0	Message 72825 - Posted: 15 Apr 2022, 9:16:34 UTC - in response to Message 72821. Last modified: 15 Apr 2022, 9:37:22 UTC I made a suggestion in the other thread that everybody,all users which seem to number 6,000 a day on average across both projects, do 2,000 each and see if that helps. Depends on how many CPUâ€™s people throw at them, I find 6 on an I7 is optimum, takes about 3.75 mins a unit. My feeling is nothing will get validated and there will be 27 Million tasks in the data base but I hope I am wrong. ID: 72825 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 377,180,214 RAC: 758	Message 72827 - Posted: 15 Apr 2022, 10:44:30 UTC - in response to Message 72825. I made a suggestion in the other thread that everybody,all users which seem to number 6,000 a day on average across both projects, do 2,000 each and see if that helps. Depends on how many CPUâ€™s people throw at them, I find 6 on an I7 is optimum, takes about 3.75 mins a unit. My feeling is nothing will get validated and there will be 27 Million tasks in the data base but I hope I am wrong. Engaging 100 cores..... if it breaks the server, it was your idea. ID: 72827 · Rating: 0 · rate: / Reply Quote

Septimus Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,905,403 RAC: 0	Message 72829 - Posted: 15 Apr 2022, 10:49:10 UTC - in response to Message 72827. I made a suggestion in the other thread that everybody,all users which seem to number 6,000 a day on average across both projects, do 2,000 each and see if that helps. Depends on how many CPUâ€™s people throw at them, I find 6 on an I7 is optimum, takes about 3.75 mins a unit. My feeling is nothing will get validated and there will be 27 Million tasks in the data base but I hope I am wrong. Engaging 100 cores..... if it breaks the server, it was your idea. Fine Iâ€™ll take the blameâ€¦.100 cores wowâ€¦..impressed. ID: 72829 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 377,180,214 RAC: 758	Message 72830 - Posted: 15 Apr 2022, 10:51:07 UTC - in response to Message 72829. Last modified: 15 Apr 2022, 10:53:34 UTC I made a suggestion in the other thread that everybody,all users which seem to number 6,000 a day on average across both projects, do 2,000 each and see if that helps. Depends on how many CPUâ€™s people throw at them, I find 6 on an I7 is optimum, takes about 3.75 mins a unit. My feeling is nothing will get validated and there will be 27 Million tasks in the data base but I hope I am wrong. Engaging 100 cores..... if it breaks the server, it was your idea. Fine Iâ€™ll take the blameâ€¦.100 cores wowâ€¦..impressed. It's an addiction I really should stop, my electric bill is beyond a joke. But it provides heating for the things you see in my picture. Which in turn get horny and make more, which pays for the electricity. So it's their hobby really. Not sure they understand astrophysics though. I'm in the UK too if you ever want to buy one. ID: 72830 · Rating: 0 · rate: / Reply Quote

San-Fernando-Valley Send message Joined: 13 Apr 17 Posts: 256 Credit: 604,411,638 RAC: 0	Message 72831 - Posted: 15 Apr 2022, 11:15:08 UTC I just hope getting new tasks between batches isn't as tedious as on Separation ... I just started my old XP-Celeron X79 board and will throw ALL the cores at N-Body. It was so dusty, that I couldn't find the power-on switch! ID: 72831 · Rating: 0 · rate: / Reply Quote

Septimus Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,905,403 RAC: 0	Message 72833 - Posted: 15 Apr 2022, 11:25:37 UTC - in response to Message 72831. Fingers crossedâ€¦ ID: 72833 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 377,180,214 RAC: 758	Message 72835 - Posted: 15 Apr 2022, 11:44:09 UTC - in response to Message 72831. I just hope getting new tasks between batches isn't as tedious as on Separation ... I just started my old XP-Celeron X79 board and will throw ALL the cores at N-Body. It was so dusty, that I couldn't find the power-on switch! I get dust on things in a few days, parrots make a lot of dust and I have 15 of them. Frequently I see a computer overheating and have to blow dust out of it. Should be easy enough to get Nbody, those don't seem to have the delay seperation has, I think all the GPUs on separation slows the server down. ID: 72835 · Rating: 0 · rate: / Reply Quote

Septimus Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,905,403 RAC: 0	Message 72852 - Posted: 15 Apr 2022, 16:43:31 UTC - in response to Message 72835. How many have you managed to do Peter and more importantly did you get any validated ? ID: 72852 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 993 Credit: 377,180,214 RAC: 758	Message 72853 - Posted: 15 Apr 2022, 16:47:12 UTC - in response to Message 72852. Last modified: 15 Apr 2022, 16:47:58 UTC How many have you managed to do Peter and more importantly did you get any validated ? 1587. And 5. https://milkyway.cs.rpi.edu/milkyway/results.php?userid=167211&offset=0&show_names=0&state=0&appid=2 ID: 72853 · Rating: 0 · rate: / Reply Quote