Welcome to MilkyWay@home

WCG Friends

Message boards : Cafe MilkyWay : WCG Friends
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,853
RAC: 272
Message 74831 - Posted: 18 Dec 2022, 11:31:07 UTC - in response to Message 74830.  

WCG seems to have stopped sending updates to Boincstats as well. Been like that for a week I would say. It has been raised on the relevant thread but no comment from Krembil.


Projects don't "send" the stats, they create a file that all the stats sites then download that shows our stats, Krembil has probably stopped creating the file for some reason so no stats.


Thanks for that Mikey…I guess whatever they were doing they have stopped. Looking at the number of messages in their forums I get the distinct impression they are overwhelmed with the whole WCG project.


I think so too BUT the Africa Rainfall tasks do seem to have stopped backing off on their transfers, at least in the last 2 days they have


Doesn’t seem to be any work at all today, at least we can continue here.
ID: 74831 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,943,543
RAC: 22,328
Message 74832 - Posted: 18 Dec 2022, 12:43:32 UTC - in response to Message 74831.  

WCG seems to have stopped sending updates to Boincstats as well. Been like that for a week I would say. It has been raised on the relevant thread but no comment from Krembil.


Projects don't "send" the stats, they create a file that all the stats sites then download that shows our stats, Krembil has probably stopped creating the file for some reason so no stats.


Thanks for that Mikey…I guess whatever they were doing they have stopped. Looking at the number of messages in their forums I get the distinct impression they are overwhelmed with the whole WCG project.


I think so too BUT the Africa Rainfall tasks do seem to have stopped backing off on their transfers, at least in the last 2 days they have


Doesn’t seem to be any work at all today, at least we can continue here.


Yup mine have stopped getting ARP tasks now too
ID: 74832 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,943,543
RAC: 22,328
Message 74833 - Posted: 19 Dec 2022, 11:54:42 UTC - in response to Message 74832.  

WCG seems to have stopped sending updates to Boincstats as well. Been like that for a week I would say. It has been raised on the relevant thread but no comment from Krembil.


Projects don't "send" the stats, they create a file that all the stats sites then download that shows our stats, Krembil has probably stopped creating the file for some reason so no stats.


Thanks for that Mikey…I guess whatever they were doing they have stopped. Looking at the number of messages in their forums I get the distinct impression they are overwhelmed with the whole WCG project.


I think so too BUT the Africa Rainfall tasks do seem to have stopped backing off on their transfers, at least in the last 2 days they have


Doesn’t seem to be any work at all today, at least we can continue here.


Yup mine have stopped getting ARP tasks now too


And of courser I got some more new work yesterday!!
ID: 74833 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,853
RAC: 272
Message 74834 - Posted: 19 Dec 2022, 17:44:33 UTC - in response to Message 74833.  

WCG seems to have stopped sending updates to Boincstats as well. Been like that for a week I would say. It has been raised on the relevant thread but no comment from Krembil.


Projects don't "send" the stats, they create a file that all the stats sites then download that shows our stats, Krembil has probably stopped creating the file for some reason so no stats.


Thanks for that Mikey…I guess whatever they were doing they have stopped. Looking at the number of messages in their forums I get the distinct impression they are overwhelmed with the whole WCG project.


I think so too BUT the Africa Rainfall tasks do seem to have stopped backing off on their transfers, at least in the last 2 days they have


Doesn’t seem to be any work at all today, at least we can continue here.


Yup mine have stopped getting ARP tasks now too




And of courser I got some more new work yesterday!!


I did try today but no tasks available, I don’t run ARP. Boincstats not updated either.
ID: 74834 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,943,543
RAC: 22,328
Message 74835 - Posted: 19 Dec 2022, 18:30:02 UTC - in response to Message 74834.  

And of courser I got some more new work yesterday!!


I did try today but no tasks available, I don’t run ARP. Boincstats not updated either.


I'm getting no data from WCG in BoincStats either.

I attached a new pc to WCG today and of course it won't show up on my 'devices' list so I can't put it in the correct venue so it won't get any new tasks!!
ID: 74835 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,853
RAC: 272
Message 74839 - Posted: 20 Dec 2022, 9:47:58 UTC - in response to Message 74835.  

Looks like the relevant switches are back in the ON position.
ID: 74839 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,853
RAC: 272
Message 74841 - Posted: 20 Dec 2022, 12:40:23 UTC - in response to Message 74839.  

Looks like the relevant switches are back in the ON position.


Oops maybe not.
ID: 74841 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,943,543
RAC: 22,328
Message 74843 - Posted: 21 Dec 2022, 0:20:11 UTC - in response to Message 74841.  

Looks like the relevant switches are back in the ON position.


Oops maybe not.


I'm not convinced yet they actually know how to run a Boinc Project
ID: 74843 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,853
RAC: 272
Message 74844 - Posted: 21 Dec 2022, 12:47:08 UTC - in response to Message 74843.  

Looks like the relevant switches are back in the ON position.


Oops maybe not.


I'm not convinced yet they actually know how to run a Boinc Project


Think there is a growing number that would agree with that. No work today it seems.
ID: 74844 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 208
Credit: 105,443,688
RAC: 36,727
Message 74845 - Posted: 21 Dec 2022, 21:26:42 UTC - in response to Message 74844.  

I'm not convinced yet they actually know how to run a Boinc Project

Think there is a growing number that would agree with that. No work today it seems.


Before I comment on the above, an observation -- I wonder how many people realize that the new WCG team is probably just the Jurisica Lab MCM team (with dependency on outside agencies for some aspects of hardware and networking...) -- I speculate that Krembil aren't actually pouring support in for Igor and company now they realize what they've "bought into"...

And now, a couple of comments on the quotes...

It's not really a BOINC project - it's a pre-BOINC project that got "massaged" to fit it into the BOINC universe rather than there being a complete "reboot" [like Apple do on hardware changes!] way back then... And a lot of the initial problems Jurisica Lab had were with the IBM/WCG stuff and the user-facing components (forums, web-site login, web-site missing components, inter-database communications[1]) - unfortunately, experience and relevant expertise are needed to solve such things, and with IBM out of the picture where's the experience?...

The upload/download issues were/are probably a product of not having optimal infrastructure and [of course] wouldn't show up as a crisis until the system came under stress - given the change of physical platform(s) this is another "learn by experience" situation, and whilst it's unfortunate it's also understandable -- the fact that it has taken so long to resolve (and may still not be fully fixed) is likely to be down [in part] to dependency on an outside agency (SHARCNET?) for network stuff, as some changes may require said agency to do work that won't happen instantly on demand...

[And now all we need is a certain regular WCG forum poster to climb in to point out that the above is all about symptoms, not causes, and that the real problem is WCG's failure to communicate :-) ...]

Regarding available work -- I've been getting a steady flow of MCM work since the recovery from the situation that took several critical services off-line (there's a News thread about that at WGC...), but will concede that OPN1 and ARP1 work has been less available. I suspect that in both those cases it may have as much to do with [bi-directional?] data-flow between the scientists and WCG as anything else; we already know that there's no OPNG work because the scientists are prepping up for a new target (or targets) for the GPU version, and we don't know how much more CPU work there might be for the existing target... Mikey, if I recall you're only asking for ARP1 (and HST1?) so at the moment you are out of luck :-(

Looking at wingman returns for ARP1 and MCM1 I see far more "No Reply" or "Not Started by Deadline"[2] tasks than I would normally expect for tasks with a 6-day deadline; it's not so prevalent with OPN1 as that uses Adaptive Replication so a lot of tasks don't need wingmen in the first place... I wonder if any existing restrictions on "unreliable" systems getting new work in bulk got lifted for the migration and haven't been put back.

So, overall, I don't think the WCG situation is "terminal" but my jury is out on whether things are getting to a point where they might be able to announce official restart (rather than the "still testing" mode that a lot of folks don't seem to realize still applies!) - I suspect there are probably a few months more before that'll be realistic. And I have had some experience of being in a team of one [or two if I was lucky] trying to balance a 48-hour working day with some modicum of personal life, so now I know somewhat more about how non-standard the WCG set-up was/is I'm perhaps a lot more forgiving than many others... (I wonder how many of the more vociferous complainants on the WCG forums can even code/program, let alone have worked as DBAs and/or SysAdmins!)

Cheers - Al.

[1] Whilst all the standard BOINC-specific stuff resides where BOINC expects to find it, the forums have their own system, as does a lot of the stuff about user statistics. It appears that a lot of that resides in a completely separate database (possibly carried over from pre-BOINC days?...)

[2] WCG doesn't flag Not Started by Deadline explicitly on the user web pages and the available API feeds, but such tasks are easily recognized by having an Error reply (with nothing but the client version listed) returned close to the deadline.
ID: 74845 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,943,543
RAC: 22,328
Message 74846 - Posted: 22 Dec 2022, 4:27:34 UTC - in response to Message 74845.  



Regarding available work -- I've been getting a steady flow of MCM work since the recovery from the situation that took several critical services off-line (there's a News thread about that at WGC...), but will concede that OPN1 and ARP1 work has been less available. I suspect that in both those cases it may have as much to do with [bi-directional?] data-flow between the scientists and WCG as anything else; we already know that there's no OPNG work because the scientists are prepping up for a new target (or targets) for the GPU version, and we don't know how much more CPU work there might be for the existing target... Mikey, if I recall you're only asking for ARP1 (and HST1?) so at the moment you are out of luck :-(

Looking at wingman returns for ARP1 and MCM1 I see far more "No Reply" or "Not Started by Deadline"[2] tasks than I would normally expect for tasks with a 6-day deadline; it's not so prevalent with OPN1 as that uses Adaptive Replication so a lot of tasks don't need wingmen in the first place... I wonder if any existing restrictions on "unreliable" systems getting new work in bulk got lifted for the migration and haven't been put back.

So, overall, I don't think the WCG situation is "terminal" but my jury is out on whether things are getting to a point where they might be able to announce official restart (rather than the "still testing" mode that a lot of folks don't seem to realize still applies!) - I suspect there are probably a few months more before that'll be realistic. And I have had some experience of being in a team of one [or two if I was lucky] trying to balance a 48-hour working day with some modicum of personal life, so now I know somewhat more about how non-standard the WCG set-up was/is I'm perhaps a lot more forgiving than many others... (I wonder how many of the more vociferous complainants on the WCG forums can even code/program, let alone have worked as DBAs and/or SysAdmins!)

Cheers - Al.

[1] Whilst all the standard BOINC-specific stuff resides where BOINC expects to find it, the forums have their own system, as does a lot of the stuff about user statistics. It appears that a lot of that resides in a completely separate database (possibly carried over from pre-BOINC days?...)

[2] WCG doesn't flag Not Started by Deadline explicitly on the user web pages and the available API feeds, but such tasks are easily recognized by having an Error reply (with nothing but the client version listed) returned close to the deadline.


Well said and I 100% agree with you!!

EXCEPT I am now trying to get some OPN1, gpu, tasks and they aren't coming thru either, I DO have alot of other gpu tasks and on some pc's it is telling me that but on other pc's it just says there are none available.
ID: 74846 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 208
Credit: 105,443,688
RAC: 36,727
Message 74847 - Posted: 23 Dec 2022, 2:49:21 UTC

For information -- OPN1 has now officially joined OPNG in being suspended whilst setting up work for the new target(s) is being done by the scientists. There's a thread about it in the News forum...

So we now have to wait for both CPU and GPU work for OPN; my Pi4 is unhappy because its other BOINC project (TN-Grid) is effectively off at present because of recurrent file server issues (which are, apparently, beyond the control of the TN-Grid folks...) but my GPUs are working here and at Einstein...

Cheers - Al.
ID: 74847 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,943,543
RAC: 22,328
Message 74850 - Posted: 23 Dec 2022, 10:42:45 UTC - in response to Message 74847.  

For information -- OPN1 has now officially joined OPNG in being suspended whilst setting up work for the new target(s) is being done by the scientists. There's a thread about it in the News forum...

So we now have to wait for both CPU and GPU work for OPN; my Pi4 is unhappy because its other BOINC project (TN-Grid) is effectively off at present because of recurrent file server issues (which are, apparently, beyond the control of the TN-Grid folks...) but my GPUs are working here and at Einstein...

Cheers - Al.


It figures!! I think I used to run the BRP4 tasks here on my own RPi's, it takes at least a 4gb one though and they are not fast nor do they give alot of credits.
ID: 74850 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,853
RAC: 272
Message 75101 - Posted: 3 Mar 2023, 11:42:15 UTC - in response to Message 74850.  

Seems like WCG has problems been down for two days so far. Anyone heard anything ?
ID: 75101 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 208
Credit: 105,443,688
RAC: 36,727
Message 75102 - Posted: 3 Mar 2023, 18:52:01 UTC - in response to Message 75101.  

Seems like WCG has problems been down for two days so far. Anyone heard anything ?
There has been a limited amount of information about this outage on their Twitter feed (which I can see [though I'm not "on Twitter"]) and there may or may not be more about it on Facebook...

In summary, they had a RAID controller failure which took out their network file server. A replacement controller has been provided and service may resume some time later on 3rd March.

Latest tweet, from around 08:00 (in WCG's time-zone) on 3rd March:
Update: The borrowed RAID card worked and the drive layout was recognized, so we have all data safe (there is also a tape backup, but accessing that would be slower). Data center managed a full boot and we expect we will resume operation later today.
Note the "borrowed" -- I think they owe SHARCNET a controller card, but at least there wasn't a [possibly long] wait whilst they sourced one to solve the immediate problem :-)

Note also that they [deliberately] didn't give an actual deadline. This makes sense because there's probably quite a lot of checking needed before it is safe to resume; just because they can see the file-store doesn't guarantee that everything on it is in a viable condition for user service to resume (especially with the huge backlog of uploads that will hammer the upload server(s) once they turn that back on!...)

Cheers - Al.
ID: 75102 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,882,853
RAC: 272
Message 75103 - Posted: 3 Mar 2023, 19:36:45 UTC - in response to Message 75102.  

Thanks for that Al….much appreciated.
ID: 75103 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,943,543
RAC: 22,328
Message 75104 - Posted: 4 Mar 2023, 4:41:13 UTC - in response to Message 71915.  

Just wanted to make a thread for those of us here temporarily while WCG is moved to chat.

I wanted to point out that WCG has posted a couple of articles worth reading while down.

How is everyone doing? The MW project has been going up and down a bit since we got here, but has everyone got it running ok besides that?
Any questions?

Being the badge hound I am, I'll show off my 100k star badge.


Actually your badge is up to 500K right now...WOO HOO!!!

And WCG IS working somewhat STILL, I have a stack of gpu tasks to return but they aren't doing it says there's 'server problems' STILL!!
But I only rarely get ARP tasks and haven't gotten a TB task since they came back on line in Kembril!!
ID: 75104 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 208
Credit: 105,443,688
RAC: 36,727
Message 75105 - Posted: 4 Mar 2023, 17:30:21 UTC
Last modified: 4 Mar 2023, 17:49:54 UTC

For information, in case anyone looks here... :-)

Latest tweets from WCG at about 20:00 their time on Friday 3rd:

Update: We have confirmed all the data is intact and have replaced the RAID controller, but we are still having some issues with getting the new hardware production ready. Unfortunately, data center staff will not be able to help us over the weekend.
Note the comment about lack of weekend support in this situation; I guess that means we won't see any proper signs of life before Monday at the soonest.

Additionally, the deadline of all existing WUs that are partially done will be extended and accepted once the hardware change is done.
With luck, a lot of the tasks will actually be marked for validation as they finally get uploaded.. For those that don't, there is a mechanism for re-validating work that hasn't been assimilated, but it entails determining the work-unit numbers of the tasks in question in order to feed them into [multiple uses of] an ops PHP script... Someone may have quite a lot of "research" to do for that :-)

Unhappy but patient - Al.

P.S. I presume the need for support means there are systems involved that WCG can't/should not restart without supervision :-) I wonder if, being short of hardware, they're running some stuff on servers that have other purposes too...

[Edited to add to my comment on the deadline tweet...]
ID: 75105 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Link
Avatar

Send message
Joined: 19 Jul 10
Posts: 578
Credit: 18,845,239
RAC: 856
Message 75106 - Posted: 5 Mar 2023, 10:33:01 UTC - in response to Message 75105.  

Additionally, the deadline of all existing WUs that are partially done will be extended and accepted once the hardware change is done.
That's good, all my tasks expire tomorrow.
ID: 75106 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
alanb1951

Send message
Joined: 16 Mar 10
Posts: 208
Credit: 105,443,688
RAC: 36,727
Message 75107 - Posted: 5 Mar 2023, 12:43:33 UTC - in response to Message 75106.  

Additionally, the deadline of all existing WUs that are partially done will be extended and accepted once the hardware change is done.
That's good, all my tasks expire tomorrow.
Yup, and those of us who had GPU jobs (with their initial 3-day deadline) or lots of retry jobs (large queues?) have already-expired tasks waiting, so it'll be interesting to see what happens to those - most of my tasks expire late on Monday or on Tuesday as I run small queues and didn't have any non-GPU short deadline tasks when it went down (phew!)...

Ah well, it is what it is; just hoping that they don't restart until it is really ready :-)

Cheers - Al.
ID: 75107 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Cafe MilkyWay : WCG Friends

©2024 Astroinformatics Group