Welcome to MilkyWay@home

Posts by Mr McGill

1) Message boards : News : New MilkyWay@Home Developer (Message 67733)
Posted 27 Aug 2018 by Mr McGill
Post:
Currently, periodic angle parameters θ and φ are not allowed to wrap from 3.14 to 0
or from 0 to 3.14 making it possible for optimizations to get stuck on a boundary of these
parameters even though the best value lies within the constraints. If the best value for one
of these parameters is on the opposite side of the periodic boundary from the current search
neighborhood, the optimizer will continue approaching the boundary causing it to get stuck.


You mean to tell me: that after all this time: all you had to do was allow the angle to roll out to 3.15 by performing a quick 0'ing of any run attempted above Pi? Quite possibly more complex to implement: but it would have solved so many N-Body Lockups wouldn't it?
2) Message boards : News : Nbody 1.68 release (Message 67421)
Posted 2 May 2018 by Mr McGill
Post:
Sorry for the Necro: but a thought came to me, back when we started the 'new' N-Body version faults in N-Body processes seemed reduced! It looked like the problem of them getting stuck with no further gain in progress completion was improved, while now i have seen only 3 reach completion in as many weeks with all others having say 3 or 4 hours of processing time with variable number of hours till completion is reached (from 10 hours to 3 weeks) depending where the process has gotten stuck. A worst case scenario was quoting something like 157 days when it got stuck at ~0.05% for a few hours (just a little beyond the deadline *cough*)

A thought came to me this morning as i clear an overnight frozen Nbody: If everyone hitting a faulty work unit passes it on: will it be passed back into the pool without examination for other machines to attempt? if a packet of data strikes suck a bug will it move up the priority chain to be solved more quickly increasing the 'density' of faulty work units to be computed? Because it just seems odd that so many of these units in particular are faulting: when none of the other variants strike errors. (which i would hope would suggest there is nothing wrong with the logical processors of this device, otherwise i'll need to look to fixing it)


TLDR:
-Number of Nbody work units locking up seems to be getting worse: are they being prioritized by your distribution system?
-Do your systems have a way to measure or monitor how many times such work units are being passed back and forth without reaching completion?

Side query A:
-At one point your server was no longer sending Nbodies to my machine (solves the problem well enough) presumably due to the 'high error' solution you rolled out, but now they are back and as glitch y as ever: how many processes do i need to abort to be thrown back onto that list (if this is at all how this happened)


Side gripe B
As a personal complaint: I do wish Bionic could reduce the total cpu% being utilized, this insistence on using 100% processor X% of the time causes thermal spikes that can lead to thermal throttling. Poor laptop.

This mornings Nbody lockup happened at 1hr 24 minutes at 15ish% in de_nbody_4_19_2018_v168_20k__Data_1_1523906284_51420
3) Message boards : News : Nbody 1.68 release (Message 67058)
Posted 10 Feb 2018 by Mr McGill
Post:
had great hopes the new model Nbody would resolve faults: still getting Nbody trapped, sometimes suspend helps (so far i am around 5 successes to 20 failures), restart has so far not.

Second thought: with our Nbody fails being a pain in the proverbial, are they responsible for some of the failed reporting stuff? In particular their runtime can exceed their report times, which could also cause failures on single processor tasks paused to complete a multicore Nbody that never sees completion?
4) Message boards : News : Validation Inconclusive Errors (Message 66964)
Posted 14 Jan 2018 by Mr McGill
Post:
I own a measly laptop and have been trying to contribute what i can using the decent computation , but the Nbody tasks have interesting properties where at least half tend to get 'caught' in their own processing, to the state where after a day of running, they claim they will be done in two days, or on occasion get trapped for a few days and report a completion time beyond the reporting time.

Is there a way to allow these processes that literally get stuck at a certain component to automatically abort and report the troublesome step or stage to you guys to permit improvements to the code?
As of right now if after a few hours the time remaining starts ballooning i abort manually as there is no way that a 95% complete process that has been running for 2 hours should still be at 95% completion an hour later, which is something i have witnessed.
I note that Bio-inc doesn't even notice when the predicted time to completion ends up beyond the report time and cancel the process then. which is what was previously happening to Nbody before i started manually monitoring it's progress.
I don't know why Nbody is doing this, but i have offered plenty of memory (which it hasn't used) it steadily reduces the CPU usage as it jams up and bioinc has access to a few gigs of storage which it has barely scratched while doing this computation, so i am puzzled if the issue is a bottleneck or a crashing component.




©2021 Astroinformatics Group