Message boards :
Cafe MilkyWay :
WCG Friends
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,896,120 RAC: 55 |
WCG seems to have stopped sending updates to Boincstats as well. Been like that for a week I would say. It has been raised on the relevant thread but no comment from Krembil. Doesn’t seem to be any work at all today, at least we can continue here. |
Send message Joined: 8 May 09 Posts: 3326 Credit: 521,533,668 RAC: 49,421 |
WCG seems to have stopped sending updates to Boincstats as well. Been like that for a week I would say. It has been raised on the relevant thread but no comment from Krembil. Yup mine have stopped getting ARP tasks now too |
Send message Joined: 8 May 09 Posts: 3326 Credit: 521,533,668 RAC: 49,421 |
WCG seems to have stopped sending updates to Boincstats as well. Been like that for a week I would say. It has been raised on the relevant thread but no comment from Krembil. And of courser I got some more new work yesterday!! |
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,896,120 RAC: 55 |
WCG seems to have stopped sending updates to Boincstats as well. Been like that for a week I would say. It has been raised on the relevant thread but no comment from Krembil. I did try today but no tasks available, I don’t run ARP. Boincstats not updated either. |
Send message Joined: 8 May 09 Posts: 3326 Credit: 521,533,668 RAC: 49,421 |
And of courser I got some more new work yesterday!! I'm getting no data from WCG in BoincStats either. I attached a new pc to WCG today and of course it won't show up on my 'devices' list so I can't put it in the correct venue so it won't get any new tasks!! |
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,896,120 RAC: 55 |
Looks like the relevant switches are back in the ON position. |
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,896,120 RAC: 55 |
Looks like the relevant switches are back in the ON position. Oops maybe not. |
Send message Joined: 8 May 09 Posts: 3326 Credit: 521,533,668 RAC: 49,421 |
Looks like the relevant switches are back in the ON position. I'm not convinced yet they actually know how to run a Boinc Project |
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,896,120 RAC: 55 |
Looks like the relevant switches are back in the ON position. Think there is a growing number that would agree with that. No work today it seems. |
Send message Joined: 16 Mar 10 Posts: 211 Credit: 106,507,167 RAC: 21,669 |
I'm not convinced yet they actually know how to run a Boinc Project Before I comment on the above, an observation -- I wonder how many people realize that the new WCG team is probably just the Jurisica Lab MCM team (with dependency on outside agencies for some aspects of hardware and networking...) -- I speculate that Krembil aren't actually pouring support in for Igor and company now they realize what they've "bought into"... And now, a couple of comments on the quotes... It's not really a BOINC project - it's a pre-BOINC project that got "massaged" to fit it into the BOINC universe rather than there being a complete "reboot" [like Apple do on hardware changes!] way back then... And a lot of the initial problems Jurisica Lab had were with the IBM/WCG stuff and the user-facing components (forums, web-site login, web-site missing components, inter-database communications[1]) - unfortunately, experience and relevant expertise are needed to solve such things, and with IBM out of the picture where's the experience?... The upload/download issues were/are probably a product of not having optimal infrastructure and [of course] wouldn't show up as a crisis until the system came under stress - given the change of physical platform(s) this is another "learn by experience" situation, and whilst it's unfortunate it's also understandable -- the fact that it has taken so long to resolve (and may still not be fully fixed) is likely to be down [in part] to dependency on an outside agency (SHARCNET?) for network stuff, as some changes may require said agency to do work that won't happen instantly on demand... [And now all we need is a certain regular WCG forum poster to climb in to point out that the above is all about symptoms, not causes, and that the real problem is WCG's failure to communicate :-) ...] Regarding available work -- I've been getting a steady flow of MCM work since the recovery from the situation that took several critical services off-line (there's a News thread about that at WGC...), but will concede that OPN1 and ARP1 work has been less available. I suspect that in both those cases it may have as much to do with [bi-directional?] data-flow between the scientists and WCG as anything else; we already know that there's no OPNG work because the scientists are prepping up for a new target (or targets) for the GPU version, and we don't know how much more CPU work there might be for the existing target... Mikey, if I recall you're only asking for ARP1 (and HST1?) so at the moment you are out of luck :-( Looking at wingman returns for ARP1 and MCM1 I see far more "No Reply" or "Not Started by Deadline"[2] tasks than I would normally expect for tasks with a 6-day deadline; it's not so prevalent with OPN1 as that uses Adaptive Replication so a lot of tasks don't need wingmen in the first place... I wonder if any existing restrictions on "unreliable" systems getting new work in bulk got lifted for the migration and haven't been put back. So, overall, I don't think the WCG situation is "terminal" but my jury is out on whether things are getting to a point where they might be able to announce official restart (rather than the "still testing" mode that a lot of folks don't seem to realize still applies!) - I suspect there are probably a few months more before that'll be realistic. And I have had some experience of being in a team of one [or two if I was lucky] trying to balance a 48-hour working day with some modicum of personal life, so now I know somewhat more about how non-standard the WCG set-up was/is I'm perhaps a lot more forgiving than many others... (I wonder how many of the more vociferous complainants on the WCG forums can even code/program, let alone have worked as DBAs and/or SysAdmins!) Cheers - Al. [1] Whilst all the standard BOINC-specific stuff resides where BOINC expects to find it, the forums have their own system, as does a lot of the stuff about user statistics. It appears that a lot of that resides in a completely separate database (possibly carried over from pre-BOINC days?...) [2] WCG doesn't flag Not Started by Deadline explicitly on the user web pages and the available API feeds, but such tasks are easily recognized by having an Error reply (with nothing but the client version listed) returned close to the deadline. |
Send message Joined: 8 May 09 Posts: 3326 Credit: 521,533,668 RAC: 49,421 |
Well said and I 100% agree with you!! EXCEPT I am now trying to get some OPN1, gpu, tasks and they aren't coming thru either, I DO have alot of other gpu tasks and on some pc's it is telling me that but on other pc's it just says there are none available. |
Send message Joined: 16 Mar 10 Posts: 211 Credit: 106,507,167 RAC: 21,669 |
For information -- OPN1 has now officially joined OPNG in being suspended whilst setting up work for the new target(s) is being done by the scientists. There's a thread about it in the News forum... So we now have to wait for both CPU and GPU work for OPN; my Pi4 is unhappy because its other BOINC project (TN-Grid) is effectively off at present because of recurrent file server issues (which are, apparently, beyond the control of the TN-Grid folks...) but my GPUs are working here and at Einstein... Cheers - Al. |
Send message Joined: 8 May 09 Posts: 3326 Credit: 521,533,668 RAC: 49,421 |
For information -- OPN1 has now officially joined OPNG in being suspended whilst setting up work for the new target(s) is being done by the scientists. There's a thread about it in the News forum... It figures!! I think I used to run the BRP4 tasks here on my own RPi's, it takes at least a 4gb one though and they are not fast nor do they give alot of credits. |
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,896,120 RAC: 55 |
Seems like WCG has problems been down for two days so far. Anyone heard anything ? |
Send message Joined: 16 Mar 10 Posts: 211 Credit: 106,507,167 RAC: 21,669 |
Seems like WCG has problems been down for two days so far. Anyone heard anything ?There has been a limited amount of information about this outage on their Twitter feed (which I can see [though I'm not "on Twitter"]) and there may or may not be more about it on Facebook... In summary, they had a RAID controller failure which took out their network file server. A replacement controller has been provided and service may resume some time later on 3rd March. Latest tweet, from around 08:00 (in WCG's time-zone) on 3rd March: Update: The borrowed RAID card worked and the drive layout was recognized, so we have all data safe (there is also a tape backup, but accessing that would be slower). Data center managed a full boot and we expect we will resume operation later today.Note the "borrowed" -- I think they owe SHARCNET a controller card, but at least there wasn't a [possibly long] wait whilst they sourced one to solve the immediate problem :-) Note also that they [deliberately] didn't give an actual deadline. This makes sense because there's probably quite a lot of checking needed before it is safe to resume; just because they can see the file-store doesn't guarantee that everything on it is in a viable condition for user service to resume (especially with the huge backlog of uploads that will hammer the upload server(s) once they turn that back on!...) Cheers - Al. |
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,896,120 RAC: 55 |
Thanks for that Al….much appreciated. |
Send message Joined: 8 May 09 Posts: 3326 Credit: 521,533,668 RAC: 49,421 |
Just wanted to make a thread for those of us here temporarily while WCG is moved to chat. Actually your badge is up to 500K right now...WOO HOO!!! And WCG IS working somewhat STILL, I have a stack of gpu tasks to return but they aren't doing it says there's 'server problems' STILL!! But I only rarely get ARP tasks and haven't gotten a TB task since they came back on line in Kembril!! |
Send message Joined: 16 Mar 10 Posts: 211 Credit: 106,507,167 RAC: 21,669 |
For information, in case anyone looks here... :-) Latest tweets from WCG at about 20:00 their time on Friday 3rd: Update: We have confirmed all the data is intact and have replaced the RAID controller, but we are still having some issues with getting the new hardware production ready. Unfortunately, data center staff will not be able to help us over the weekend.Note the comment about lack of weekend support in this situation; I guess that means we won't see any proper signs of life before Monday at the soonest. Additionally, the deadline of all existing WUs that are partially done will be extended and accepted once the hardware change is done.With luck, a lot of the tasks will actually be marked for validation as they finally get uploaded.. For those that don't, there is a mechanism for re-validating work that hasn't been assimilated, but it entails determining the work-unit numbers of the tasks in question in order to feed them into [multiple uses of] an ops PHP script... Someone may have quite a lot of "research" to do for that :-) Unhappy but patient - Al. P.S. I presume the need for support means there are systems involved that WCG can't/should not restart without supervision :-) I wonder if, being short of hardware, they're running some stuff on servers that have other purposes too... [Edited to add to my comment on the deadline tweet...] |
Send message Joined: 19 Jul 10 Posts: 601 Credit: 19,085,027 RAC: 6,085 |
Additionally, the deadline of all existing WUs that are partially done will be extended and accepted once the hardware change is done.That's good, all my tasks expire tomorrow. |
Send message Joined: 16 Mar 10 Posts: 211 Credit: 106,507,167 RAC: 21,669 |
Yup, and those of us who had GPU jobs (with their initial 3-day deadline) or lots of retry jobs (large queues?) have already-expired tasks waiting, so it'll be interesting to see what happens to those - most of my tasks expire late on Monday or on Tuesday as I run small queues and didn't have any non-GPU short deadline tasks when it went down (phew!)...Additionally, the deadline of all existing WUs that are partially done will be extended and accepted once the hardware change is done.That's good, all my tasks expire tomorrow. Ah well, it is what it is; just hoping that they don't restart until it is really ready :-) Cheers - Al. |
©2024 Astroinformatics Group