Welcome to MilkyWay@home

Posts by Richard Haselgrove

41) Message boards : Number crunching : AMD R9 290X does not receive any GPU work (Message 63188)
Posted 2 Mar 2015 by Richard Haselgrove
Post:
Syntax error: there's a line missing.

<app_version>
<app_name>milkyway_separation__modified_fit</app_name>
<version_num>136</version_num>
<file_ref>
<file_name>milkyway_separation__modified_fit_1.36_windows_x86_64.exe</file_name>
<main_program/>
</file_ref>
</app_version>
<app>
<name>milkyway</name>
</app>

It's much easier to see if the author had used indenting in the original file source - and if this project would finish updating their web code, so that <> would render correctly in [ pre ] and [ code ] blocks.

    <app_version>
        <app_name>milkyway_separation__modified_fit</app_name>
        <version_num>136</version_num>
        <file_ref>
            <file_name>milkyway_separation__modified_fit_1.36_windows_x86_64.exe</file_name>
            <main_program/>
        </file_ref>
    <app>
        <name>milkyway</name>
    </app>

Missing files: if you use an app_info file, you are responsible for supplying all the files referenced. As I said to Arkayn, it would be helpful if he could supply download links to go with his app_info example. I don't know this project well enough (and not at all for ATI cards), so I wouldn't trust my own advice even if I tried to work it out.

I'd also remove Claggy's SpaceTM from <executable />, making it <executable/> - though I think modern versions of the BOINC client should cope with it.
42) Message boards : Number crunching : AMD R9 290X does not receive any GPU work (Message 63177)
Posted 23 Feb 2015 by Richard Haselgrove
Post:
That has both the GPU and CPU apps included, but not NBODY.

Perhaps you should give him download urls for the applications as well?
43) Message boards : News : New Nbody version 1.48 (Message 63170)
Posted 20 Feb 2015 by Richard Haselgrove
Post:
Reporting on the split-run experiment. I ran six tasks with one thread to the first checkpoint, and multi-threaded thereafter. It shows in the stderr_txt:

<stderr_txt>
<search_application> milkyway_nbody 1.48 Windows x86_64 double OpenMP, Crlibm </search_application>
Using OpenMP 1 max threads on a system with 4 processors
Using OpenMP 3 max threads on a system with 4 processors
Poor likelihood. Returning worst case.
<search_likelihood>-9999999.900000000400000</search_likelihood>
21:46:43 (8552): called boinc_finish

</stderr_txt>

So far, three WUs have validated against the extended quorum of three completed tasks, and three are still waiting for 64-bit clients to finish them off. No errors reported.

Wingmate 520623 was interesting:

Using OpenMP 16 max threads on a system with 24 processors

He's using an older BOINC v7.0.64 client, so he can't be using the same facility as me to reduce the thread count - maybe the server has been set to a maximum of 16 threads. But he's getting nothing like a 16::1 CPU time to runtime ratio for nbody tasks.
44) Message boards : News : New Nbody version 1.48 (Message 63167)
Posted 20 Feb 2015 by Richard Haselgrove
Post:
It looks as if the new v1.48 application has been deployed as a 64-bit application only.

32-bit Windows computers are still being allocated work, but are trying to run it with the old v1.46 application - and failing. That causes unnecessary waste and delay in validating the results from hosts which are still 'unreliable' and need the full 3-way validation.
45) Message boards : News : New Nbody version 1.48 (Message 63165)
Posted 19 Feb 2015 by Richard Haselgrove
Post:
Interesting thoughts. I wonder if Sidd can enlighten us where the parallelisation 'sweet spot' is, or if they'd like us to try and find it. Considering the MT phase only, I'd imagine that there comes a point where the overhead of managing and synchronising multiple threads exceeds the benefit - but I wouldn't know whether the tipping point is above or below 15 threads.

I'm currently verifying that the Application configuration tools available in BOINC v7.4.36 allow thread control - initially, limiting the active thread count to 3, so that other projects can continue to make progress on one core while nbody runs.

The next test is to run a bundle of tasks to the first checkpoint with an app_config thread limit of one, with the intention of changing app_config when they've all been prepped, and running the MT phase with a 3-thread app_config. Very labour intensive, and not amenable to scripted automation, but might be an interesting proof-of concept. If it works, the project might consider splitting the app at the end of the initialisation phase.

a) Send out a single threaded task to perform initialisation
b) Return the initialisation data generated as an output file
c) Send out the initialistion data as an input file to a new, multithreaded, simulation task.
46) Message boards : News : New Nbody version 1.48 (Message 63163)
Posted 18 Feb 2015 by Richard Haselgrove
Post:
Therefore, for now, we removed all multithreading of the initialization of the dwarf galaxy. This is why it will run only on one thread until the initialization completes. However, after performing a speed profile on the code, it was determined that a majority of the run time of the previous code was spent on a single function (thanks to Roland Judd for catching that!). We optimized this function, leading to a great decrease in the run time.

Cheers,
Sidd

To put some figures on that. I'm running on an i5 laptop (2 cores, 4 threads). Current task was estimated at 5 hours 26 mins. The single-threaded initialisation phase lasted for 18 minutes. (I think the initialisation lasted roughly the same time for the previous task, estimated at 50 minutes, but I didn't have Process Explorer open to monitor). Would I be right in assuming that the initialisation would be expected to have a constant duration for any given host, no matter how long the expected task duration?

During initialisation, the application doesn't checkpoint, and doesn't report any progress %age. I'm running the current recommended BOINC v7.4.36, which estimates and reports a 'pseudo progress %age' to reassure the casual observer if the application fails to supply any real %age.

But when the application switches to true multithreaded mode after initialisation, it

a) checkpoints
b) report the true progress to that stage, which it reports as zero%

So, the quite significant 'pseudo progress' (5.4%, for this task), is thrown away, and the progress bar regresses to the origin.

That's all 'by the book' - just reporting it as a 'boincification artefact' that might catch some users unawares.
47) Message boards : News : New Nbody version 1.48 (Message 63162)
Posted 18 Feb 2015 by Richard Haselgrove
Post:
Good call, Richard, on asking the correct question.
However, I've already aborted my 2 long-running v1.46 tasks.

If in doubt, answer your own question. This for v1.48 - I doubt they've changed that particular parameter between runs.

    <rsc_fpops_est>	    41011100000000.000000  </rsc_fpops_est>
    <rsc_fpops_bound>	410111000000000000.000000  </rsc_fpops_bound>

I make that ten thousand times estimate :o

I was suspicious, but even I didn't expect it to be that bad!

(and I see they've half-updated the web forum code, just to make things even more interesting)
48) Message boards : News : New Nbody version 1.48 (Message 63160)
Posted 18 Feb 2015 by Richard Haselgrove
Post:
If it reads version 1.46, thats the bugged version and those units will never finish. Abandon those units then force an update for milkyway@home to get v1.48.

Did we ever get absolute confirmation from a project admin, that some of the 1.46 units would never complete?

You should ask whether the tasks will ever complete *successfully*. They are guaranteed to complete, because the BOINC client will kill them eventually for over-running.

I'm not sure how long they will run on for. The traditional setting for a BOINC project in general is ten times the initial estimated runtime, but I wouldn't be surprised if this project had given themselves a bit of extra headroom to protect themselves from what they might see as premature exits.

If you're feeling masochistic, check the ratio between <rsc_fpops_est> and <rsc_fpops_bound> for the tasks in question.
49) Message boards : News : New Modfit Runs (Message 63091)
Posted 2 Feb 2015 by Richard Haselgrove
Post:
It would be useful if there were a fully documented list of app_info parameters and their likely effect - a definitive list rather than the piecemeal guesswork I have stumbled across in various forums. Oh well...

Start with

http://boinc.berkeley.edu/wiki/Anonymous_platform
50) Message boards : Number crunching : What is the cause of these 'validate errors' (Message 63074)
Posted 25 Jan 2015 by Richard Haselgrove
Post:
My belief is that at some point in the dim and distant past, somebody set an obscure configuration policy on the server, and subsequently forgot all about it:

Adaptive Replication
51) Message boards : Number crunching : N-Body long processing time (Message 62628)
Posted 26 Oct 2014 by Richard Haselgrove
Post:
Mine just died.

"Aborting task de_nbody_08_05_orphan_sim_3_1412191206_12667_1: exceeded disk limit: 56.86MB > 50.00MB"

That 50MB number must be from Milkyway@Home's programming? My settings allocate several GB to all the projects.

Yes then that is an error within the app.

And the same error - which might be an problem with the application coding (causing too much to be written to disk), or might be a problem with the BOINC deployment (not specifying a high enough value for the expected amount of data to be written) - has been reported repeatedly on there message boards for over four months (Jacob Klein, 20 June, message 61923).

I don't run MilkyWay any more, because the administrators appear to pay no attention to the specific error messages which are reported to them by users.
52) Message boards : News : New Separation Modfit Version 1.36 (Message 62585)
Posted 16 Oct 2014 by Richard Haselgrove
Post:
Your Win32 machines will no longer be receiving Modfit work units so that won't be a problem any more.

I was aware of the ongoing Win32 problems and that you had subsequently blocked Modfit from Win32 machines. I was just trying to add some specifics to the discussion - that my Win32 errors were immediate computation errors [-1073741515 (0xffffffffc0000135) Unknown error number]. Other posts on this thread seemed to indicate that some Win32 units were failing after processing for significant amounts of time.

Isn't 0xc0000135 (expressing it in 32 bits) the very old and very well known 'The application failed to initialize properly' STATUS_DLL_NOT_FOUND?

Shouldn't somebody be checking to see what DLLs are required by the application (http://www.dependencywalker.com/ is your friend), what DLLs are being sent out, and why there's a difference between the two?
53) Message boards : News : New Version of Separation Modified Fit (1.32) (Message 62356)
Posted 17 Sep 2014 by Richard Haselgrove
Post:
SLRE,

Are you using the most recent NVidia drivers for your cards?

Jake W.

There are hints at SETI@Home of a possible OpenCL problem with NVidia driver 340.52 - the problems observed so far relate to Compute Capability 1.x cards only, but that could be the tip of the iceberg,

NVidia have reproduced the observed problem and are investigating. https://developer.nvidia.com/nvbugs/cuda/edit/1554016 (accessible to registered developers only)
54) Message boards : News : Milkyway/Bitcoin Utopia Update (Message 62261)
Posted 5 Sep 2014 by Richard Haselgrove
Post:
I have never had an issue with BU's website and I access it from home and work.

I'n sorry, Blurf, but - and without meaning any disrespect to you - that means absolutely nothing.

I hadn't ever received a virus from the British Telecom support desk mail server until that night, either - but their lack of vigilance allowed a virus into their systems, and they very generously shared it with me and several thousand other customers.

I mentioned "the social engineering skills of malware-writers" last time. This is what I meant - and I actually quite admire them for it.

I was working late, around 9pm, alone on a client's site (commissioning a new server, IIRC). I saw the infected email incoming on my personal laptop, thought it was suspicious, checked it with my (up-to-date) AV software - no detection. I forwarded the attachment to the Symantec reporting portal - received automated acknowledgement, but nothing further. This was all happening on the Friday of the US Thanksgiving weekend, and I received the infected email within about an hour of what was later reported as the time of first detection in the wild. My feeling is that the time of release was very carefully chosen as one where global security watchfulness was at one of its lowest points in the year. That, and the chosen release vectors (BT, plus NTL - at that time the largest cable service provider in the UK) is highly unlikely to be random.

I was on the phone to BT until well after midnight that night, trying to persuade them that they had a serious problem, but sadly failing.

The next morning, Saturday, I rang a UK security specialist, Sophos, and received what I feel was an exemplary response (I'm not connected with the firm in any way: I just rang as any ordinary joe public).

The person who answered the phone seemed technically astute: listened to my description, and said 'sounds like it's worth a look - I'll call in an engineer', and gave in instructions for secure submission of the sample. A couple of hours later they rang back to say it was definitely malicious, carried two separate payloads, and was previously unknown and undetected by their existing product. By about 1 pm, they had a definition hotfix available, and asked me to test it: by 4 pm, the finished hotfix was available for their customers to download. Meanwhile, BT were still refusing to acknowledge that they had a serious problem. The rest you can read in the links I posted earlier.

I tell that story at some length (and with feeling) because it's a very clear example of why heuristic (behavioural) scanning was added to the anti-malware arsenal - at that time (almost 13 years ago), almost all virus detections depended on the signature-matching which failed so dismally in the BadtransB case.

And that's why I, personally, would never disregard any malware report without investigation. Some users here may recognise my name from the SETI@Home user forums: I am the person responsible for the final assembly and distribution of the "Lunatics" optimised application installer package. Like BOINC itself, I provide a Windows executable file for users to download and run: it requests Administrator privileges while running, and drops a payload of other Windows executables. That makes me a perfect virus distribution vector: you may have got some inkling by now of the responsibilities that I accept that my volunteer role places on me.
55) Message boards : News : Milkyway/Bitcoin Utopia Update (Message 62259)
Posted 5 Sep 2014 by Richard Haselgrove
Post:
Sadly, you are not alone in your complacent attitude. Please remember: not every alarm is a false one. I invite you to read these two write-ups (of the same event), which as you will see I was personally involved in.

http://www.zdnet.com/bt-mails-virus-to-customers-3040139746/
http://www.computerweekly.com/news/2240043325/BT-Openworld-sends-virus-to-customers

Edit: and when lax security attitudes meet the social engineering skills of malware-writers, the results can be significantly damaging. Follow-up to the same story:

http://www.theregister.co.uk/2002/01/31/badtransb_tops_virus_charts/
http://virus.wikia.com/wiki/Badtrans
56) Message boards : News : Milkyway/Bitcoin Utopia Update (Message 62257)
Posted 4 Sep 2014 by Richard Haselgrove
Post:
I think that the last few posts have displayed very poor judgement, to the point of gullibility and possibly culpable neglect.

Yes, the majority of BOINC malware reports are false positives - I've debunked a few myself, and mostly used a clean virustotal scan to do so. But I don't think that justifies ignoring reports without investigation. Use skill, judgement and best security practices to evaluate a potential threat and decide on your course of action.

Rightly or wrongly, Bitcoin Utopia has established a love-hate relationship with the rest of the BOINC community. Some people love it because of the funds it is raising for scientific projects like this one, and for the credits it awards. Other people - sadly - come close to hating it: the funds raised are actually very small, and the credits have skewed the cross-project statistics. It is just possible (but reprehensible if true) that someone in the latter group may have injected malware. Proceed with caution.

One of the arguments I use to support a 'false alarm' judgement on virus scares is: "if the file has been deployed many hundreds of thousands of times from a secure project server over a period of years, and not been detected as a virus over all that time, then it probably doesn't contain a virus". Bitcoin Utopia is a relatively young project, and its applications are rapidly changing: I don't think it can shelter behind that defense.

There was discussion on the BOINC developers mailing list within the last 24 hours - I quote verbatim:

*Deprecated*: mysql_pconnect(): The mysql extension is deprecated and will be removed in the future: use mysqli or PDO instead in
*/home/boincadm/projects/bitcoinutopia/html/inc/db.inc* on line *52*

The error message is on the front page and also on ops front page. How to
fix it?

Did you ever figure this out? I too am seeing this annoying message. I
recently upgraded both the web code and the server code, and I am still getting
it.

We changed this line in util.inc:
ini_set('display_errors', true);
to this:
ini_set('display_errors', false);

That removed some of the messages but there are still pages where it shows
up.

Henri.

I'm sorry, but when the response to a security warning is to shoot the messenger - closely related to burying your head in the sand - then I don't have confidence that a proper anti-malware security protocol is demonstrably in place. Again, another reason for urging caution.
57) Message boards : Number crunching : No usable GPUs found (Message 62102)
Posted 1 Aug 2014 by Richard Haselgrove
Post:
Catalist 12.2 for Windows XP doesn't have OpenCL drivers, even though the download page when it was first released claimed that it did.

When I unpacked the installation package to my machine, this file existed:

C:\AMD\Support\12-1_xp32_dd_ccc\Packages\Apps\OpenCL\OpenCL.msi

See if you can find it in your C:\AMD\Support\12-2_xp32_dd_ccc tree.
58) Message boards : Number crunching : Wasted 1.6 million CPU seconds!!!!!!! (Message 62009)
Posted 4 Jul 2014 by Richard Haselgrove
Post:
I've 'switched off' Milky Way as a source of Tasks. I spent 1.6 million CPU seconds only for the Task to blow up 20 or so hours before the deadline. In the 'Number Crunching' Board I see plenty of concurrent references to Users with similar behaviour. But I haven't seen any explanation from Milky Way Experts.
Here's the message from the death of my exploding Task.

772937494 577749921 578820 22 Jun 2014, 3:40:43 UTC 4 Jul 2014, 0:22:43 UTC Error while computing 888,657.72 1,642,079.00 --- MilkyWay@Home N-Body Simulation [/b][/b]

I believe that is WAAAAAY too long for an N-Body unit to run, next time abort it and move on to the next unit. Your valid N-Body units took about 14,000 and 6,000 seconds, if you pass double the 14k number and the percentage complete isn't about 90=%, I would move on. Not EVERY unit runs on every machine, some units are just too full of things that our pc's can't figure out. Aborting them puts them back in the queue for someone else to try, as long as you don't abort hundreds of units over a short period of time it won't make a bit of difference to your crunching.

Check the task, Mikey. Result ID 772937494

It was Exit status 196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED like all the others - nothing to do with time.

Until the administrators and developers here learn how to drive their BOINC server properly, n(o)body is going to get any work done.
59) Message boards : News : New N-Body Runs (Message 61948)
Posted 25 Jun 2014 by Richard Haselgrove
Post:
The workunits take vastly different amounts of time to complete. This is a problem that we at MW@Home have been working on to assign appropriate credit to crunchers. Our goal is to ultimately perfect this art and prevent the assignment of non-useful simulations that are time-expensive. You are right to say that for the same simulation, the wall clock time on your 4 core machine should be half that of your dual AMD cores. I can go into more detail. If you send me a private message, I would be glad to explain the science of how the workunits are very difficult to assess computationally. I hope that I can answer any questions you may have.

Jake

Have you talked with David Anderson about CreditNew's crediting of MT tasks? I mean *really* talked to him, challenging his position with facts, rather than just receiving the standard speech that it works?

YAFU (YoYo's Beta project) are having the same problem: http://boinc.berkeley.edu/dev/forum_thread.php?id=9317
60) Message boards : Number crunching : BOINC IPs (Message 61943)
Posted 24 Jun 2014 by Richard Haselgrove
Post:
Ok, since this is more of a BOINC issue and not a Milkyway-specific issue, please post your concern on the BOINC forum for further assistance.

Thank you.

On the contrary: once the BOINC client has been setup and configured by an authorised system administrator, it only needs to contact the servers of the project(s) it is attached to.

Any system administrator with network experience will quickly find:

C:\>ping milkyway.cs.rpi.edu

Pinging milkyway2.phys.rpi.edu [128.113.126.23] with 32 bytes of data:

and be able to identify upload and download servers similarly. That's the most basic of DNS lookups.

You will also need to know which ports to open for network access. Most BOINC projects use http on the standard port 80: some (notably WCG) use https on port 443. No other ports need to be opened.


Previous 20 · Next 20

©2024 Astroinformatics Group