Welcome to MilkyWay@home

Posts by alanb1951

41) Message boards : News : New Poll Regarding GPU Application of N-Body (Message 75728)
Posted 19 Jun 2023 by alanb1951
Post:
We can only hope someone will be kind enough to improve the OpenCL app into something much, much faster than it is right now.
And someone would probably have to commit to amending the OpenCL code and/or the OpenCL-related code in the CPU-based part of the application whenever there was a relevant change to the science code in the "CPU-only" application code :-)

I get the impression that this application isn't one of those simple ones where the GPU-based calculations are more or less unchangeable and can be completely controlled via parameter settings. If all it is doing is (for instance) FFTs or simple optimization of a matrix, the only issue would be "Can it be made efficient enough to make it worth doing?" However, if it would either entail lots of shuffling data around on the GPU or frequent movement of data to and from the GPU between GPU-worthy sections of computation that might be a completely different matter!

And, of course, if part of making it efficient enough entails "messing" with support libraries or adding hacks to facilitate using the GPU for one task whilst another one is doing CPU-intensive stuff, that would have to be done very carefully, especially if end users might be using their GPUs for more than one BOINC application -- I recall an issue over at WCG a while back which was to do with something in that area :-(

Without an expert eye on the code, we can't know what performance issues (global memory usage, bandwidth between motherboard and GPU, et cetera) there might be. Whilst I share the hope that it might be possible to do something GPU-wise, I would be unsurprised if an [unbiased] expert outsider decides it isn't worth it...

Cheers - Al.
42) Message boards : News : Separation Project Coming To An End (Message 75497)
Posted 13 Jun 2023 by alanb1951
Post:
Tom,

It's always a bit sad to see a project come to an end, but it's also good to know the goals have been achieved. Well done to all concerned!

Regarding actual shut-down, I'm inclined to agree with Ian&Steve C's comment on the timeline in his first post. Has the current Separation run already reached a point where it's effectively work for works sake, or is there still serious value in processing the rest of this batch? It really has to be your call as to when to stop generating new work :-)

If it's a matter of trying to get the current run to converge as quickly as possible, I'll stick around until that happens (as, I suspect, will many others); otherwise I might as well switch to N-Body only at once, resuming GPU work if and only if you have to do re-runs based on the review process.

Regarding GPU users finding other things to do, I suspect quite a few of us will just give more time to Einstein@home :-)

Good luck with the paper! I'll continue to watch progress on all projects here with great interest...

Cheers - Al.

P.S. there is still the occasional post on the SETI@home site asking when that is going to restart -- expect the same to happen here :-)
43) Message boards : Number crunching : n=body tasks failing (Message 75352)
Posted 29 Apr 2023 by alanb1951
Post:
mikey,

Glad you got it working. I just noticed that I'd suggested an equals-sign, where (of course) it should've been a space! Sorry about that, and glad it didn't bite!

Cheers - Al.

P.S. Late reply because of constant outages of MW web site...
44) Message boards : Number crunching : n=body tasks failing (Message 75349)
Posted 27 Apr 2023 by alanb1951
Post:
on multiple pc's and multiple OS's every singe n-body task has failed within a few seconds anyone have any ideas why?

mikey
I looked at a random sample of results marked as Error from a handful of your machines: every one had the same reported error:

Argument parsing error: --nthreads>2: unknown option

Note the greater-than sign, which should be an equals sign...

Quick fix in app_config.xml? :-)

Hope that clears things up!

Cheers - Al.
45) Message boards : Cafe MilkyWay : WCG Friends (Message 75152)
Posted 13 Mar 2023 by alanb1951
Post:
Latest update (about 21:10 UTC 13th March)

The website is back online, but the recovery process continues. As a result, not all functionality is available. Until we can restart BOINC and the database, stats/contributions are not accurate. We will provide further updates as we progress. Thank you for your patience.
"The website is back online" -- that's a matter of opinion :-) -- initially I couldn't navigate away from the home page, either timing out or failing to get through the login process as it failed to load the post-login page and just put the login prompt up again! Eventually, I managed to get onto the forums but it was so painfully slow that I'm still trying to work out how one or two of the regulars had managed to make a post since the forum restart :-) (And speculation on why it's like that won't make it go any faster, of course...)

i wonder if they could have improved the phraseology a bit to warn about probable slowness, or elected not to actually bring up the web site more or less straight away...

Ah, well -- it's a start...

Cheers - Al.
46) Message boards : Cafe MilkyWay : WCG Friends (Message 75138)
Posted 10 Mar 2023 by alanb1951
Post:
Latest update (around 19:20 UTC on 10th March):

Update: The storage server was revived yesterday late afternoon. Both database filesystems mounted as before, but the science filesystem did not. It needs a repair; erasing the old log first.
I'm not sure what that might mean if there's any data loss :-( -- that might depend on what they meant by "[old] log", of course...

I presume that means "not until Monday at the earliest" -- I doubt there'll be activity over the weekend anyway given last weekend's unavailability of data centre support...

And on another note:

Thanks to Link for the two previous posts; my impression was also that WCG/IBM was "in the cloud" at the time of the transfer, and I seem to recall mention of data being shipped using rsync but I wasn't sure whether that was WCG speaking or user speculation... I'd love to know how much data had to be transferred :-)

Cheers - Al.
47) Message boards : Cafe MilkyWay : WCG Friends (Message 75134)
Posted 9 Mar 2023 by alanb1951
Post:
Latest update (around 17:30 UTC 9th March):

Update: The "new" system did recognize the data hardware RAIDs. All have been rebuilt, and the data center is attempting to repair the OS drives/RAID.
I always worry when I see the term "rebuilt" in reference to a hardware RAID controller -- sometimes that means "wiped and rebuilt" -- but I presume that's not the case here. Hopefully, all that was needed was to ensure that the RAID structures were recognized and consistent, with a (possibly quite long) delay while doing an integrity check...

Cheers - Al.

P.S. I wonder if the O/S drives are SSDs? For cost reasons, I doubt that the data disks are...
48) Message boards : Cafe MilkyWay : WCG Friends (Message 75131)
Posted 8 Mar 2023 by alanb1951
Post:
I assume Krembil got this system from IBM when they took over WCG. IBM makes their own servers but who know, maybe IBM went for a cheaper Dell system when putting WCG together back then.
As far as I am aware, no hardware moved from IBM to Krembil (not even the production data disks!) WCG/Krembil will be using what they can have access to, and that is almost certainly far more constrained by budget than the IBM set-up was.

Cheers - Al.
49) Message boards : Cafe MilkyWay : WCG Friends (Message 75129)
Posted 8 Mar 2023 by alanb1951
Post:
Latest update (around 16:30 UTC 8th March)

Update: As of this morning, the data center continues to work on booting the temporary replacement DSS 7000 storage system. They are attempting multiple alternative strategies to resolve current failures.
Sounds like there re a lot of cobwebs to blow out of that bit of kit! :-) -- As far as I can determine (from reseller's sites[1]) it uses a now-discontinued Xeon E5 processor model, so it's not exactly new tech.

Someone posted a picture of what appears to be a DSS 7000 unit and said "It's only 90 drives, what could go wrong?..." (although that's a fully populated 4-processor version; there's also a dual Xeon 45 drive version) -- if they manage to get this working, I wonder how long it will be before there's another failure; perhaps we should organize a collection for a brand new storage device?

And if anyone who reads this can actually confirm or correct the specification, feel free to do so :-)

Cheers - Al.

[1] I tried finding specs on Dell's site, but there wasn't anything immediately obvious and useful...
50) Message boards : Cafe MilkyWay : WCG Friends (Message 75123)
Posted 6 Mar 2023 by alanb1951
Post:
Further update at about 20:45 UTC

Update: unfortunately, the RAID controller was not the root cause of our storage system failure, the PCI bus failed. Data center is in the process of moving the disks to an alternate system and we will post updates as we progress. Once again, thank you for your patience.
Hmmm - reasonably frequent updates here :-)

I wonder if that means the RAID controller wasn't actually faulty, or whether the controller permanently upset the bus (or vice versa)

At least it seems as if the data centre takes responsibility for finding replacement hardware, which is better than it could have been (so not the worst type of "single point of failure"...)

Cheers - Al.
51) Message boards : Cafe MilkyWay : WCG Friends (Message 75122)
Posted 6 Mar 2023 by alanb1951
Post:
Further update at about 19:00 UTC:
Update: Unfortunately, additional hardware problem on the storage server besides the RAID card are preventing us from restarting. Working with the data center on the alternative solutions.
I wonder if this is going to take as long to resolve as was the case for the disk fail here or the upload server filestore issues at CPDN a few weeks ago. If so, I think we'll see a few more of the smaller contributors bailing out for good :-(

We are, unfortunately, seeing the worst effects of an under-resourced transition. I don't know whether they'll ever be able to get the systems up to the former level again (philanthropy doesn't seem to be that popular at the moment, and I guess Krembil central doesn't see benefits in adding to the [limited?] support it offers to Igor and his small team), but I'd rather have a system that stutters occasionally than no system at all :-)

Patient, but mildly irritated - Al.
52) Message boards : Cafe MilkyWay : WCG Friends (Message 75117)
Posted 6 Mar 2023 by alanb1951
Post:
Maybe they could restart enough to just take the tasks back that are already out there but not send any new ones, get as many of the ones coming back processed as they can then open the floodgates to us getting more tasks.

That would be sensible :-) -- however, they have to turn most of the BOINC server stuff on or user attempts to report tasks won't work; it's a matter of how selective they can be (or choose to be...)

The weak option is to depend on the delays in uploading stopping user systems from being able to download lots of work straight away, but the strong option is to either not start the download server processes (If that can be done independently of the upload server processes in their set-up) or to [temporarily] disable work-unit generation. And, given an earlier WGC message about why there was a shortage of work, I'm sure they can manage to stop new work being created in several different ways! :-)

At about 15:00 UTC today they tweeted that they're still working with the data centre to get things up and running. Here's hoping it doesn't take too much longer and (as you say) that they make an effort to try to give the [limited] network connections and the validators an easy time to start with!

Cheers - Al.

P.S. I've had a monitoring script running throughout, checking whether my API scripts can see the servers or not - this exercises both the authentication/authorization services and, if I can get past those, read access to parts of the BOINC database. So far, I've never been able to get past authorization at any point during the outage :-( -- I'll keep watching...
53) Message boards : Cafe MilkyWay : WCG Friends (Message 75107)
Posted 5 Mar 2023 by alanb1951
Post:
Additionally, the deadline of all existing WUs that are partially done will be extended and accepted once the hardware change is done.
That's good, all my tasks expire tomorrow.
Yup, and those of us who had GPU jobs (with their initial 3-day deadline) or lots of retry jobs (large queues?) have already-expired tasks waiting, so it'll be interesting to see what happens to those - most of my tasks expire late on Monday or on Tuesday as I run small queues and didn't have any non-GPU short deadline tasks when it went down (phew!)...

Ah well, it is what it is; just hoping that they don't restart until it is really ready :-)

Cheers - Al.
54) Message boards : Cafe MilkyWay : WCG Friends (Message 75105)
Posted 4 Mar 2023 by alanb1951
Post:
For information, in case anyone looks here... :-)

Latest tweets from WCG at about 20:00 their time on Friday 3rd:

Update: We have confirmed all the data is intact and have replaced the RAID controller, but we are still having some issues with getting the new hardware production ready. Unfortunately, data center staff will not be able to help us over the weekend.
Note the comment about lack of weekend support in this situation; I guess that means we won't see any proper signs of life before Monday at the soonest.

Additionally, the deadline of all existing WUs that are partially done will be extended and accepted once the hardware change is done.
With luck, a lot of the tasks will actually be marked for validation as they finally get uploaded.. For those that don't, there is a mechanism for re-validating work that hasn't been assimilated, but it entails determining the work-unit numbers of the tasks in question in order to feed them into [multiple uses of] an ops PHP script... Someone may have quite a lot of "research" to do for that :-)

Unhappy but patient - Al.

P.S. I presume the need for support means there are systems involved that WCG can't/should not restart without supervision :-) I wonder if, being short of hardware, they're running some stuff on servers that have other purposes too...

[Edited to add to my comment on the deadline tweet...]
55) Message boards : Cafe MilkyWay : WCG Friends (Message 75102)
Posted 3 Mar 2023 by alanb1951
Post:
Seems like WCG has problems been down for two days so far. Anyone heard anything ?
There has been a limited amount of information about this outage on their Twitter feed (which I can see [though I'm not "on Twitter"]) and there may or may not be more about it on Facebook...

In summary, they had a RAID controller failure which took out their network file server. A replacement controller has been provided and service may resume some time later on 3rd March.

Latest tweet, from around 08:00 (in WCG's time-zone) on 3rd March:
Update: The borrowed RAID card worked and the drive layout was recognized, so we have all data safe (there is also a tape backup, but accessing that would be slower). Data center managed a full boot and we expect we will resume operation later today.
Note the "borrowed" -- I think they owe SHARCNET a controller card, but at least there wasn't a [possibly long] wait whilst they sourced one to solve the immediate problem :-)

Note also that they [deliberately] didn't give an actual deadline. This makes sense because there's probably quite a lot of checking needed before it is safe to resume; just because they can see the file-store doesn't guarantee that everything on it is in a viable condition for user service to resume (especially with the huge backlog of uploads that will hammer the upload server(s) once they turn that back on!...)

Cheers - Al.
56) Message boards : Number crunching : Support for Intel ARC GPU ? (Message 75095)
Posted 1 Mar 2023 by alanb1951
Post:
Hello,

Is there any plan to support Intel ARC GPU ?
They're supported by Primegrid and Einstein@home and they're pretty decent in terms of performance.

I've got an ARC A750 on a system dedicated to BOINC since november and it has been crunching WU 24/7 since :)
As far as I am aware, the latest Intel GPUs don't have built-in FP64 support (so OpenCL would have to fall back on software emulation...)

As the GPU app here is FP64, that would seem to make it a non-starter, I'm afraid.

Cheers - Al.
57) Message boards : Application Code Discussion : Requesting but not getting new tasks for NVIDIA GPU (Message 75051)
Posted 12 Feb 2023 by alanb1951
Post:
MilkyWay tasks are OpenCL, not CUDA, and BOINC hasn't spotted OpenCL drivers so you need to install the NVIDIA OpenCL stuff.

Not sure how to do that on your openSUSE - I think making sure you have an OpenCL ICD loader (ocl-icd-linopencl1 on my Ubuntu systems) should do the trick...

You can check the existing OpenCL devices with clinfo...

Cheers - Al.

P.S. the best place for queries like this is probably the Number Crunching forum or the Unix/Linux forum -- you might've got an answer from an openSUSE user! :-)
58) Message boards : Cafe MilkyWay : WCG Friends (Message 74847)
Posted 23 Dec 2022 by alanb1951
Post:
For information -- OPN1 has now officially joined OPNG in being suspended whilst setting up work for the new target(s) is being done by the scientists. There's a thread about it in the News forum...

So we now have to wait for both CPU and GPU work for OPN; my Pi4 is unhappy because its other BOINC project (TN-Grid) is effectively off at present because of recurrent file server issues (which are, apparently, beyond the control of the TN-Grid folks...) but my GPUs are working here and at Einstein...

Cheers - Al.
59) Message boards : Cafe MilkyWay : WCG Friends (Message 74845)
Posted 21 Dec 2022 by alanb1951
Post:
I'm not convinced yet they actually know how to run a Boinc Project

Think there is a growing number that would agree with that. No work today it seems.


Before I comment on the above, an observation -- I wonder how many people realize that the new WCG team is probably just the Jurisica Lab MCM team (with dependency on outside agencies for some aspects of hardware and networking...) -- I speculate that Krembil aren't actually pouring support in for Igor and company now they realize what they've "bought into"...

And now, a couple of comments on the quotes...

It's not really a BOINC project - it's a pre-BOINC project that got "massaged" to fit it into the BOINC universe rather than there being a complete "reboot" [like Apple do on hardware changes!] way back then... And a lot of the initial problems Jurisica Lab had were with the IBM/WCG stuff and the user-facing components (forums, web-site login, web-site missing components, inter-database communications[1]) - unfortunately, experience and relevant expertise are needed to solve such things, and with IBM out of the picture where's the experience?...

The upload/download issues were/are probably a product of not having optimal infrastructure and [of course] wouldn't show up as a crisis until the system came under stress - given the change of physical platform(s) this is another "learn by experience" situation, and whilst it's unfortunate it's also understandable -- the fact that it has taken so long to resolve (and may still not be fully fixed) is likely to be down [in part] to dependency on an outside agency (SHARCNET?) for network stuff, as some changes may require said agency to do work that won't happen instantly on demand...

[And now all we need is a certain regular WCG forum poster to climb in to point out that the above is all about symptoms, not causes, and that the real problem is WCG's failure to communicate :-) ...]

Regarding available work -- I've been getting a steady flow of MCM work since the recovery from the situation that took several critical services off-line (there's a News thread about that at WGC...), but will concede that OPN1 and ARP1 work has been less available. I suspect that in both those cases it may have as much to do with [bi-directional?] data-flow between the scientists and WCG as anything else; we already know that there's no OPNG work because the scientists are prepping up for a new target (or targets) for the GPU version, and we don't know how much more CPU work there might be for the existing target... Mikey, if I recall you're only asking for ARP1 (and HST1?) so at the moment you are out of luck :-(

Looking at wingman returns for ARP1 and MCM1 I see far more "No Reply" or "Not Started by Deadline"[2] tasks than I would normally expect for tasks with a 6-day deadline; it's not so prevalent with OPN1 as that uses Adaptive Replication so a lot of tasks don't need wingmen in the first place... I wonder if any existing restrictions on "unreliable" systems getting new work in bulk got lifted for the migration and haven't been put back.

So, overall, I don't think the WCG situation is "terminal" but my jury is out on whether things are getting to a point where they might be able to announce official restart (rather than the "still testing" mode that a lot of folks don't seem to realize still applies!) - I suspect there are probably a few months more before that'll be realistic. And I have had some experience of being in a team of one [or two if I was lucky] trying to balance a 48-hour working day with some modicum of personal life, so now I know somewhat more about how non-standard the WCG set-up was/is I'm perhaps a lot more forgiving than many others... (I wonder how many of the more vociferous complainants on the WCG forums can even code/program, let alone have worked as DBAs and/or SysAdmins!)

Cheers - Al.

[1] Whilst all the standard BOINC-specific stuff resides where BOINC expects to find it, the forums have their own system, as does a lot of the stuff about user statistics. It appears that a lot of that resides in a completely separate database (possibly carried over from pre-BOINC days?...)

[2] WCG doesn't flag Not Started by Deadline explicitly on the user web pages and the available API feeds, but such tasks are easily recognized by having an Error reply (with nothing but the client version listed) returned close to the deadline.
60) Message boards : Number crunching : Daily graphs of server_status (Message 74553)
Posted 24 Oct 2022 by alanb1951
Post:
...
Tom already said the problem is not enough memory in the current Server but the IT people are in charge of moving the stuff over to the new Server they already have ready and waiting

Hmmm, waiting for what?


Presumably the long running Nbody WU’s have exacerbated the problem ?
The two projects have separate validators, so the only effect one might have on the other is memory demands. However, it's not N-Body that's having severe backlog problems, if the state of my current tasks in progress or waiting for validation is anything to go by...

As at about 20:00 UTC on 24th October a typical _0 N-Body task seems to take about 20 hours to pass through the validator (which doesn't allow it to validate without a wing-man for some reason...) and have a new task sent out for a second opinion. Whilst that isn't brilliant, it's not anywhere near as bad as the situation for Separation tasks!...

As at about the same time as the above, a typical _0 Separation task seems to take over 6 days to pass through the validator (whether it ends up self-validating or not) -- if it doesn't validate without a wingman, there will be two further opinions sought, but when _1 comes back the transitioner will note that 3 results are required so it will spin off another task without troubling the validator[1], and that's pretty quick!

It looks as if it adds about a day to the processing time for each million extra tasks awaiting validation, and bearing in mind that (in theory) more than 10% of initial tasks should end up needing a second (and third!) opinion, clearing out the backlog should produce a reasonable amount of available work, albeit more slowly... (And, of course, tasks that return with an error state should produce new tasks without needing to engage the work unit generator, the same as second retries...) It has been suggested that it might be a good idea to turn off the generation of new work for Separation for a while -- I'd second that suggestion!

Cheers - Al.

[1] Unless the MilkyWay team has done something really strange to the core BOINC stuff, I don't think time-out or error-driven retries and "third opinion" tasks should go anywhere near the work-unit generator so they should not be held up if the validator is not involved (unlike "second opinion" tasks). If Tom knows otherwise, I'd be interested to know what they did!

P.S. "Waiting for what?" -- as mentioned elsewhere, they seem to be waiting for the IT people to sort out the migration...


Previous 20 · Next 20

©2024 Astroinformatics Group