WCG Friends

Author	Message
mikey Send message Joined: 8 May 09 Posts: 3321 Credit: 520,516,839 RAC: 27,077	Message 75115 - Posted: 6 Mar 2023, 11:38:11 UTC - in response to Message 75107. Additionally, the deadline of all existing WUs that are partially done will be extended and accepted once the hardware change is done. That's good, all my tasks expire tomorrow. Yup, and those of us who had GPU jobs (with their initial 3-day deadline) or lots of retry jobs (large queues?) have already-expired tasks waiting, so it'll be interesting to see what happens to those - most of my tasks expire late on Monday or on Tuesday as I run small queues and didn't have any non-GPU short deadline tasks when it went down (phew!)... Ah well, it is what it is; just hoping that they don't restart until it is really ready :-) Cheers - Al. Maybe they could restart enough to just take the tasks back that are already out there but not send any new ones, get as many of the ones coming back processed as they can then open the floodgates to us getting more tasks. ID: 75115 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 210 Credit: 106,064,209 RAC: 24,097	Message 75117 - Posted: 6 Mar 2023, 15:38:26 UTC - in response to Message 75115. Maybe they could restart enough to just take the tasks back that are already out there but not send any new ones, get as many of the ones coming back processed as they can then open the floodgates to us getting more tasks. That would be sensible :-) -- however, they have to turn most of the BOINC server stuff on or user attempts to report tasks won't work; it's a matter of how selective they can be (or choose to be...) The weak option is to depend on the delays in uploading stopping user systems from being able to download lots of work straight away, but the strong option is to either not start the download server processes (If that can be done independently of the upload server processes in their set-up) or to [temporarily] disable work-unit generation. And, given an earlier WGC message about why there was a shortage of work, I'm sure they can manage to stop new work being created in several different ways! :-) At about 15:00 UTC today they tweeted that they're still working with the data centre to get things up and running. Here's hoping it doesn't take too much longer and (as you say) that they make an effort to try to give the [limited] network connections and the validators an easy time to start with! Cheers - Al. P.S. I've had a monitoring script running throughout, checking whether my API scripts can see the servers or not - this exercises both the authentication/authorization services and, if I can get past those, read access to parts of the BOINC database. So far, I've never been able to get past authorization at any point during the outage :-( -- I'll keep watching... ID: 75117 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 210 Credit: 106,064,209 RAC: 24,097	Message 75122 - Posted: 6 Mar 2023, 19:48:29 UTC Further update at about 19:00 UTC: Update: Unfortunately, additional hardware problem on the storage server besides the RAID card are preventing us from restarting. Working with the data center on the alternative solutions. I wonder if this is going to take as long to resolve as was the case for the disk fail here or the upload server filestore issues at CPDN a few weeks ago. If so, I think we'll see a few more of the smaller contributors bailing out for good :-( We are, unfortunately, seeing the worst effects of an under-resourced transition. I don't know whether they'll ever be able to get the systems up to the former level again (philanthropy doesn't seem to be that popular at the moment, and I guess Krembil central doesn't see benefits in adding to the [limited?] support it offers to Igor and his small team), but I'd rather have a system that stutters occasionally than no system at all :-) Patient, but mildly irritated - Al. ID: 75122 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 210 Credit: 106,064,209 RAC: 24,097	Message 75123 - Posted: 6 Mar 2023, 21:29:38 UTC Further update at about 20:45 UTC Update: unfortunately, the RAID controller was not the root cause of our storage system failure, the PCI bus failed. Data center is in the process of moving the disks to an alternate system and we will post updates as we progress. Once again, thank you for your patience. Hmmm - reasonably frequent updates here :-) I wonder if that means the RAID controller wasn't actually faulty, or whether the controller permanently upset the bus (or vice versa) At least it seems as if the data centre takes responsibility for finding replacement hardware, which is better than it could have been (so not the worst type of "single point of failure"...) Cheers - Al. ID: 75123 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 594 Credit: 18,960,931 RAC: 5,509	Message 75127 - Posted: 8 Mar 2023, 11:01:38 UTC Thanks for posting the updates here. :-) ID: 75127 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 210 Credit: 106,064,209 RAC: 24,097	Message 75129 - Posted: 8 Mar 2023, 18:42:23 UTC Last modified: 8 Mar 2023, 18:44:09 UTC Latest update (around 16:30 UTC 8th March) Update: As of this morning, the data center continues to work on booting the temporary replacement DSS 7000 storage system. They are attempting multiple alternative strategies to resolve current failures. Sounds like there re a lot of cobwebs to blow out of that bit of kit! :-) -- As far as I can determine (from reseller's sites[1]) it uses a now-discontinued Xeon E5 processor model, so it's not exactly new tech. Someone posted a picture of what appears to be a DSS 7000 unit and said "It's only 90 drives, what could go wrong?..." (although that's a fully populated 4-processor version; there's also a dual Xeon 45 drive version) -- if they manage to get this working, I wonder how long it will be before there's another failure; perhaps we should organize a collection for a brand new storage device? And if anyone who reads this can actually confirm or correct the specification, feel free to do so :-) Cheers - Al. [1] I tried finding specs on Dell's site, but there wasn't anything immediately obvious and useful... ID: 75129 · Rating: 0 · rate: / Reply Quote

Joseph Stateson Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,461,693,501 RAC: 0	Message 75130 - Posted: 8 Mar 2023, 19:36:19 UTC - in response to Message 75129. Latest update (around 16:30 UTC 8th March) Update: As of this morning, the data center continues to work on booting the temporary replacement DSS 7000 storage system. They are attempting multiple alternative strategies to resolve current failures. Sounds like there re a lot of cobwebs to blow out of that bit of kit! :-) -- As far as I can determine (from reseller's sites[1]) it uses a now-discontinued Xeon E5 processor model, so it's not exactly new tech. Someone posted a picture of what appears to be a DSS 7000 unit and said "It's only 90 drives, what could go wrong?..." (although that's a fully populated 4-processor version; there's also a dual Xeon 45 drive version) -- if they manage to get this working, I wonder how long it will be before there's another failure; perhaps we should organize a collection for a brand new storage device? And if anyone who reads this can actually confirm or correct the specification, feel free to do so :-) Cheers - Al. [1] I tried finding specs on Dell's site, but there wasn't anything immediately obvious and useful... I assume Krembil got this system from IBM when they took over WCG. IBM makes their own servers but who know, maybe IBM went for a cheaper Dell system when putting WCG together back then. ID: 75130 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 210 Credit: 106,064,209 RAC: 24,097	Message 75131 - Posted: 8 Mar 2023, 22:14:54 UTC - in response to Message 75130. I assume Krembil got this system from IBM when they took over WCG. IBM makes their own servers but who know, maybe IBM went for a cheaper Dell system when putting WCG together back then. As far as I am aware, no hardware moved from IBM to Krembil (not even the production data disks!) WCG/Krembil will be using what they can have access to, and that is almost certainly far more constrained by budget than the IBM set-up was. Cheers - Al. ID: 75131 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3321 Credit: 520,516,839 RAC: 27,077	Message 75132 - Posted: 9 Mar 2023, 0:40:11 UTC - in response to Message 75131. I assume Krembil got this system from IBM when they took over WCG. IBM makes their own servers but who know, maybe IBM went for a cheaper Dell system when putting WCG together back then. As far as I am aware, no hardware moved from IBM to Krembil (not even the production data disks!) WCG/Krembil will be using what they can have access to, and that is almost certainly far more constrained by budget than the IBM set-up was. Cheers - Al. They had to have gotten at least the data off the disks if not the disks themselves because all of my data in both of my accounts transferred over just fine, I can see where IBM would do backups and give them to Kenbril but why not just give Kembril the disks themselves and get Kembril to give you new or similar ones in return. Part of the problem with giving Kembril the disks though would be dissimilar hardware or even software versions of the hardware running those disks and that causing problems. I know that it's not always 'plug and play' and sometimes alot more 'plug and pray'. ID: 75132 · Rating: 0 · rate: / Reply Quote

Joseph Stateson Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,461,693,501 RAC: 0	Message 75133 - Posted: 9 Mar 2023, 15:59:56 UTC - in response to Message 75132. Last modified: 9 Mar 2023, 16:03:33 UTC I have switched to SiDock for my COVID support when I ran out of WCG. Tasks take usually 2-3 days to complete which is a bummer. Last month I contributed (as best as I can calculate) at least 2 of the 8 points the GPU Users group got for SiDock. https://www.boincgames.com/sprint_details.php?id=11 SiDock have only CPU tasks. I have seen no indication they are working on OpenCL or CUDA applications. ID: 75133 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 210 Credit: 106,064,209 RAC: 24,097	Message 75134 - Posted: 9 Mar 2023, 19:07:03 UTC Latest update (around 17:30 UTC 9th March): Update: The "new" system did recognize the data hardware RAIDs. All have been rebuilt, and the data center is attempting to repair the OS drives/RAID. I always worry when I see the term "rebuilt" in reference to a hardware RAID controller -- sometimes that means "wiped and rebuilt" -- but I presume that's not the case here. Hopefully, all that was needed was to ensure that the RAID structures were recognized and consistent, with a (possibly quite long) delay while doing an integrity check... Cheers - Al. P.S. I wonder if the O/S drives are SSDs? For cost reasons, I doubt that the data disks are... ID: 75134 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3321 Credit: 520,516,839 RAC: 27,077	Message 75135 - Posted: 10 Mar 2023, 12:25:11 UTC - in response to Message 75134. Latest update (around 17:30 UTC 9th March): Update: The "new" system did recognize the data hardware RAIDs. All have been rebuilt, and the data center is attempting to repair the OS drives/RAID. I always worry when I see the term "rebuilt" in reference to a hardware RAID controller -- sometimes that means "wiped and rebuilt" -- but I presume that's not the case here. Hopefully, all that was needed was to ensure that the RAID structures were recognized and consistent, with a (possibly quite long) delay while doing an integrity check... Cheers - Al. P.S. I wonder if the O/S drives are SSDs? For cost reasons, I doubt that the data disks are... It would be nice if the OS drives were NVMe drives, they are blazingly fast if you haven't tried them yet and your OS will be booted up in a few seconds much like the change from platters to SSD drives!! NVMe drive prices are also dropping like rocks, I got a 250gb Samsung 980 for under $40 and 1TB and 2TB ones are dropping as well. ID: 75135 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 594 Credit: 18,960,931 RAC: 5,509	Message 75136 - Posted: 10 Mar 2023, 14:34:19 UTC - in response to Message 75131. I assume Krembil got this system from IBM when they took over WCG. IBM makes their own servers but who know, maybe IBM went for a cheaper Dell system when putting WCG together back then. As far as I am aware, no hardware moved from IBM to Krembil (not even the production data disks!) WCG/Krembil will be using what they can have access to, and that is almost certainly far more constrained by budget than the IBM set-up was. Since AFAIK IBM was running WCG in their cloud and not on dedicated hardware, moving any hardware was pretty much impossible. ID: 75136 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 594 Credit: 18,960,931 RAC: 5,509	Message 75137 - Posted: 10 Mar 2023, 14:46:58 UTC - in response to Message 75132. Last modified: 10 Mar 2023, 14:47:10 UTC They had to have gotten at least the data off the disks if not the disks themselves because all of my data in both of my accounts transferred over just fine We have year 2023, there's no need to ship data on discs/tapes, not even for TBs of data, you copy it over the internet. IIRC even SETI stopped shipping data from Arecibo on hard drives once the internet connection to Arecibo was upgraded to something fast enough, that they could transfer everything without disturbing others. ID: 75137 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 210 Credit: 106,064,209 RAC: 24,097	Message 75138 - Posted: 10 Mar 2023, 20:54:23 UTC Latest update (around 19:20 UTC on 10th March): Update: The storage server was revived yesterday late afternoon. Both database filesystems mounted as before, but the science filesystem did not. It needs a repair; erasing the old log first. I'm not sure what that might mean if there's any data loss :-( -- that might depend on what they meant by "[old] log", of course... I presume that means "not until Monday at the earliest" -- I doubt there'll be activity over the weekend anyway given last weekend's unavailability of data centre support... And on another note: Thanks to Link for the two previous posts; my impression was also that WCG/IBM was "in the cloud" at the time of the transfer, and I seem to recall mention of data being shipped using rsync but I wasn't sure whether that was WCG speaking or user speculation... I'd love to know how much data had to be transferred :-) Cheers - Al. ID: 75138 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3321 Credit: 520,516,839 RAC: 27,077	Message 75139 - Posted: 11 Mar 2023, 11:55:18 UTC - in response to Message 75136. As far as I am aware, no hardware moved from IBM to Krembil (not even the production data disks!) WCG/Krembil will be using what they can have access to, and that is almost certainly far more constrained by budget than the IBM set-up was. Since AFAIK IBM was running WCG in their cloud and not on dedicated hardware, moving any hardware was pretty much impossible. Doesn't that mean it still has to be on a piece of physical hardware someplace even if in a 'virtual' space on that hardware.No you don't need to dedicate a physical Server sized room for things as much anymore ultimately it still has to run on hardware someplace doesn't it? I talked to a guy who was at my house the other day who's friend runs a company that has warehouses full of Server space, it has grown over time, which is where alot of companies rent space on to do whatever they do, he provides the hardware that is then accessed thru the cloud. The companies themselves then manage the 'virtual space' they pay for which is cheaper in the short run as they can be up and running much faster with less money. PrimeGrid just had a crunching Challenge, TDP, and during the one day Hill Climb Stage several crunchers used 'virtual pc's' to move into the podium positions, one said that for about $40US per day you can rent a top of the line cpu for 24 hours. I don't know if that includes a gpu or just the cpu as I didn't go that deep into it. I think the place they rented it from was called TSC but I'm not positive of that. ID: 75139 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 594 Credit: 18,960,931 RAC: 5,509	Message 75140 - Posted: 11 Mar 2023, 12:13:31 UTC - in response to Message 75138. I'd love to know how much data had to be transferred :-) They never posted that, or at least I don't think I've read anything about it. But I woudn't expect any insane amount of data, in particular since WGC deleted most inactive users in the past few years and all WUs were finished before the transfer. Rosetta has 1TB in their database server and I doubt they are running it close to limit while having nearly 1.4 millions users with credit (WCG had a bit over 800k incl. the deleted ones, but according to BOINCstats actually just 93826 still in database at the time of transfer). PrimeGrid has some info on their server status page: around 800GB total space used for a running project with over 350k users. If I add to this info all that I still remember from SETI and on what hardware that project was running, in particular about what database sizes they were talking when it wouldn't fit anymore completely into RAM and cause a slowdown, my best guess for the transfer of a mothballed WCG is around 500GB for everything, probably even less than that. Of course I might be completely wrong. ;-) ID: 75140 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 594 Credit: 18,960,931 RAC: 5,509	Message 75141 - Posted: 11 Mar 2023, 12:29:46 UTC - in response to Message 75139. Doesn't that mean it still has to be on a piece of physical hardware someplace even if in a 'virtual' space on that hardware. Not if they deleted that VM by now, they will not store that forever. Probably they will apply same policy as for any other customer that cancels the contract. Considering it's now more than a year since the shutdown at IBM, I strongly doubt that VM still exists. The move was pretty fast once the decision was made, so it seems there was not much time left for WCG@IBM and hosting that VM. What might exist is a backup of the transfered data at Krembil. ID: 75141 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3321 Credit: 520,516,839 RAC: 27,077	Message 75143 - Posted: 12 Mar 2023, 12:10:45 UTC - in response to Message 75141. Doesn't that mean it still has to be on a piece of physical hardware someplace even if in a 'virtual' space on that hardware. Not if they deleted that VM by now, they will not store that forever. Probably they will apply same policy as for any other customer that cancels the contract. Considering it's now more than a year since the shutdown at IBM, I strongly doubt that VM still exists. The move was pretty fast once the decision was made, so it seems there was not much time left for WCG@IBM and hosting that VM. What might exist is a backup of the transfered data at Krembil. That would make sense, the only reason to keep it is if they shutdown the Server and it's just been sitting there all this time doing nothing at all, but knowing IBM, my f-i-l worked there, they have probably repurposed or replaced it by now. ID: 75143 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 594 Credit: 18,960,931 RAC: 5,509	Message 75147 - Posted: 12 Mar 2023, 16:27:41 UTC - in response to Message 75143. Last modified: 12 Mar 2023, 16:28:49 UTC they have probably repurposed or replaced it by now. Since this is a cloud environment, it repurposes itself as needed, that's the whole point of it, all they need to do is shutting the VM down. ID: 75147 · Rating: 0 · rate: / Reply Quote