Welcome to MilkyWay@home

Posts by Richard Haselgrove

21) Message boards : News : Windows Users-please abort Nbody tasks (Message 63610)
Posted 16 May 2015 by Richard Haselgrove
Post:
There is no "MilkyWay@Home N-Body Simulation" checkbox in Preferences. There isn't even a Preferences. There is a Tools -> Computing Preferences dialog box. Should I be looking elsewhere?

Yes, you should be looking at your project preferences page on this website. Go there via your account page, or follow this link:

http://milkyway.cs.rpi.edu/milkyway/prefs.php?subset=project
22) Message boards : News : Windows Users-please abort Nbody tasks (Message 63604)
Posted 16 May 2015 by Richard Haselgrove
Post:
A task takes more than 32 hours running. Is normal? Thanks

If it's a Windows Nbody task, it's broken. But then, it's normally broken.
23) Message boards : News : New Nbody Version 1.50 (Message 63593)
Posted 15 May 2015 by Richard Haselgrove
Post:
My task has now hit 24 hours, at 100%.
Should I let it continue to run?
And why or why not?

Frustrated.

What does Process Explorer say about what it's doing?
24) Message boards : News : New Nbody Version 1.50 (Message 63591)
Posted 15 May 2015 by Richard Haselgrove
Post:
I'm also noticing the same behavior, on:
ps_nbody_5_12_15_orphan_sim_1_1431361804_28199_0
http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1117230409

Admins:
1: Is the behavior (reserving multiple cores, despite using only 1) ... expected behavior?
2: Is the behavior of going to 100%, then still running for hours/days after that ... expected behavior?
3: Will the task ever end?


It would help tremendously, if you could very thoroughly describe the expected behavior for these work units. People are aborting them, because the tasks look odd/broken, and if they're not broken, you need to do a better job of communicating your expectations.

Thanks in advance for your reply,
Jacob

I second Jacob's call for an 'expected' (developer's viewpoint) description of the runtime profile, in terms of CPU usage over time.

But I would also urge users to monitor this new application with additional tools, not just BOINC Manager.

I've just got back home after a few days away, and I won't even attempt to run one of these tasks (it will be under Windows) until later in the weekend. But what I've just read from several different users is a perfect description of what BOINC v7.4.xx is designed to display when no actual work at all is being reported by the science application. That might be because of an error in the progress reporting or checkpointing functions, or it might mean that nothing is being done. That gradual approach, getting closer and closer to 99.999999% done, but never quite reaching 100%, is what exactly what you should see if an application stalls at startup and goes nowhere.
25) Message boards : Number crunching : 196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED (Message 63511)
Posted 4 May 2015 by Richard Haselgrove
Post:
I reported this problem to the BOINC developers, and got this reply from David Anderson:

I looked at this and couldn't immediately see the problem.
The BOINC client deletes everything in a slot directory before using it for a new job.
If a deletion fails (e.g. because a file is in use by another app) it doesn't use
that slot directory.
I verified this by opening some Word docs in slot directories.

Notes:

* There's a "slot_debug" log flag for messages related to slot directories.
Unfortunately it doesn't print messages about failed file deletions; I'll add this.
* The "disk limit exceeded" errors refer to the per-job disk limit, not the user's
disk usage preferences; I'll change the message to clarify this.
* Apps aren't responsible for cleaning out their slot dirs; BOINC does this. It
may be that BOINC is failing to delete VM images because they're still in use by
the VirtualBox executive.

Bottom line: I'll need some more info to debug this.
If anyone is seeing this reproducibly, let me know.
Otherwise we'll release a client with more debugging output to help us investigate.

-- David

So, help needed.

Under what circumstances does the CMS .vdi image get left behind? Is there a difference between successful task completions and abnormal (error) exits?

Can the .vdi be deleted manually? Immediately? Later? After BOINC restart? After reboot?

Does BOINC ever clean it up by itself, say after a client restart?

And anything else you can think of.

Could somebody pass David's message over to CERN/CMS-dev, please? I don't even have an invitation code to create a posting account.
26) Message boards : Number crunching : 196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED (Message 63484)
Posted 30 Apr 2015 by Richard Haselgrove
Post:
A bad driver by itself wouldn't cause a disk limit error - unless it's spewing out yards and yards of error messages. Look in the slot directory...

I'm afraid I'm not sure I know what to look for. These tasks are failing right away, after only 1-2 seconds of run time. I don't see anything changing in the slot directory when this happens (\ProgramData\BOINC\slots, right?).

Well, each task gets allocated to one particular numbered folder in there as it starts - which one is visible via the 'properties' button while it's active, but a couple of seconds doesn't give you much time to investigate.

Each slot should be empty, unless there's a running task using it. Might be worth (re-)starting BOINC with GPU activity disabled, and emptying any slots which should be empty but aren't. Then, the next task you allow to run should occupy the lowest-numbered empty slot - watch that, and see if anything (big) appears in it.
27) Message boards : Number crunching : 196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED (Message 63481)
Posted 29 Apr 2015 by Richard Haselgrove
Post:
...I think it's to do with <rsc_disk_bound>...

That's what I was thinking, too, of course. However, I now think I might have a GPU hardware problem -- all the tasks I've checked that errored out for me have been completed by another host without a problem. If the tasks I ran had bad parameters, would the same task work for another host?

When I upgraded the video driver, I went to the AMD website, downloaded and ran their auto-detect tool, and let it pick and install a new driver. Is there anything else I need to install?

A bad driver by itself wouldn't cause a disk limit error - unless it's spewing out yards and yards of error messages. Look in the slot directory, as I said at Einstein.
28) Message boards : Number crunching : 196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED (Message 63476)
Posted 28 Apr 2015 by Richard Haselgrove
Post:
I'm with Ananas' original thought - I think it's to do with <rsc_disk_bound>. There's a parallel thread at Einstein - Maximum disk usage exceeded - where ritterm cross-posted, and some related discussion in Results showing "Aborted by user".
29) Message boards : Number crunching : Completed, validation inconclusive, credit pending (Message 63303)
Posted 30 Mar 2015 by Richard Haselgrove
Post:
See my reference to Adaptive Replication
30) Message boards : Number crunching : Boinc not switching away from Milkyway (Message 63295)
Posted 28 Mar 2015 by Richard Haselgrove
Post:
Ah. My fault, should have checked. "Darwin 14.1.0". LHC doesn't have a current application for Macs.

You'll find that many BOINC people (including me) assume Windows far too readily. Best to state your allegiance loud and proud when you start a new thread. Now I check, you did say "Mac OS Yosemite", but I'd forgotten that by the time we got on to LHC. Sorry again.
31) Message boards : Number crunching : Boinc not switching away from Milkyway (Message 63291)
Posted 28 Mar 2015 by Richard Haselgrove
Post:
If you highlight the LHC project in BOINC Manager, and click the 'Update' button, what sequence of messages (request and reply) do you get in the event log?
32) Message boards : Number crunching : Boinc not switching away from Milkyway (Message 63288)
Posted 28 Mar 2015 by Richard Haselgrove
Post:
I do not understand how the scheduling priority works, neither I know where from to control it. Can you point me in an approximately right direction?

I don't think any of us claim to understand how it works, but some of us watch it in action and gain some ability to predict what it's going to do next. Similarly, you can't control the scheduling process, but you can influence it.

To your specific points:

LHC often doesn't have work, but just at the moment they're really busy and would appreciate your help. Make sure you're attached using the correct url - they did change it a while back.
http://lhcathomeclassic.cern.ch/sixtrack/

POGS - no personal knowledge.

Priority - Zero is the highest priority, any non-zero numbers will be negative and hence lower priority. Zero projects, or projects closest to zero, will be fetched first - unless there's some reason, stated lower down the event log, why the particular project can't be contacted at the moment.

The exact numbers are immaterial, and will change from minute to minute anyway. Only the relationship between the numbers - which is bigger, which smaller - matters.

The nearest thing there is to a specification is ClientSchedOctTen - the design specification as at October 2010. No, it hasn't been updated since then. You might want to note two comments in that document:

"This will tend to get large (max-min) clumps of work for a single project, and variety will be lower than the current policy."
"The recent estimated credit REC(P) of a project is maintained by the client, with an averaging half-life of, say, a month."

These two points are linked: I prefer to run with a REC half-life of 1 day, rather than the actual default of 10 days, and my clumps are smaller. REC half-life can be controlled by an option in client configuration
33) Message boards : Number crunching : Boinc not switching away from Milkyway (Message 63285)
Posted 28 Mar 2015 by Richard Haselgrove
Post:
I am running BOINC ver. 7.4.36 with SETI@ and MILKYWAY@ projects on a Mac OS Yosemite. BOINC switches from SETI to MW just fine after the prescribed 30 minutes as per the directive in preferences, but then does not switch back to SETI and stays crunching MW.

The only way I can force it to switch to SETI is to suspend MW, then SETI picks up, and the cycle repeats. Switches to MW after 30 minutes, and clings on to it indefinitely.

I tried all I can think of, but to no avail. Any ideas?

-mm-

Boinc uses Recent Average Credit to determine which project to run if they all have 100% set as their Resource Share, in your case Seti has a RAC of 3,094 while MW has a RAC of 114. It is trying to even them out, so no it won't switch back very often or until the workunit deadlines force it too switch. You can change the Resource Share on each projects webpage under Your Account, Preferences for this project, and then each venue has it's own setting, the default is 100%.

It actually uses Recent Estimated Credit (REC), which is generated and stored internally, rather than the public RAC which you can see on these pages.

The principle is much the same, except that
1) REC follows the Cobblestone standard for all projects, so that over- and under-paying projects don't skew the resource share.
2) REC doesn't rely on validation, so late awards from tardy validations don't confuse the picture.

You can see the current state of play, including relative priority for work fetch, by enabling the <work_fetch_debug> logging flag for the local client. Be warned that this places very verbose data in the Event Log.
34) Message boards : News : MilkyWay@home Server Maintenance (Message 63274)
Posted 26 Mar 2015 by Richard Haselgrove
Post:
Not visible here, with the same sort order, in either Chrome or Internet Explorer. What browser are you using?

Ah, belay that - dredged up a distant memory. The error message appears if (and only if) you have the community preference "Don't move sticky posts to top" checked. We've come across that before .....

..... on 25 September 2014, at SETI: message 1577639. It was fixed the same day.

Edit: probably commit 6c73f71ceeeed4dc5639925692ae92b6d50dc3c5, but make sure you get them all this time.
35) Message boards : Number crunching : q660 to xeon 5460 (Message 63268)
Posted 25 Mar 2015 by Richard Haselgrove
Post:
LOTS of pc's have problems with those multi-thread units.

And we haven't seen Sidd coming back to fix them for almost three weeks.
36) Message boards : News : New Nbody version 1.48 (Message 63267)
Posted 25 Mar 2015 by Richard Haselgrove
Post:
The biggest problem with initialization taking longer and not checkpointing comes when initialization takes longer than 60 minutes. By default BOINC switches tasks every 60 minutes. I just figured out today that for the past few weeks one of my boxes has spent 60 minutes running Nbody initialization, then it switches to a different project for 60 minutes, then it restarts the same initialization from scratch. Basically it has completely wasted half of the processing time since upgrading to 1.48.

That setting "Switch between apps every x minutes" is more sophisticated than you think. I was under the impression that it was only supposed to switch away from an app only when it had checkpointed, or was being pre-empted by a task that was in deadline jeopardy. So... the user's setting is basically just a suggestion for BOINC to try to switch if it can, but only when a checkpoint occurs.

So maybe you're hitting on a BOINC bug there? Are you using the latest version of BOINC? Are you using the "Leave application in memory" option? And can you describe the steps necessary to easily reproduce the problem?

There are well-known weaknesses in MT scheduling. More commonly in the opposite direction:

If you are running a multiplicity of other CPU projects, and BOINC decides to schedule an MT task, it will wait until one of the other project tasks is preemptible - checkpointed, task exit, or something like that. But as soon as the MT task starts, all the other tasks (well, enough to meet the MT thread count) will be preempted immediately, ready or not.

I'd not heard the reverse problem, MT preempting before checkpoint, before - but it's worth investigating, to find what is triggering the switch. Something in 'even higher priority', perhaps? And as Jacob says, if the MT task is kept in memory when suspended, it shouldn't restart from square zero.
37) Message boards : Number crunching : How to constrain workload on an NVIDIA GPU (Message 63243)
Posted 18 Mar 2015 by Richard Haselgrove
Post:
The same way as at SETI - use an application configuration file.

So far as I know, Einstein@Home is the only project which enables you to control GPU usage via your account settings on their website.
38) Message boards : News : New Nbody version 1.48 (Message 63218)
Posted 12 Mar 2015 by Richard Haselgrove
Post:
My next WU seemed to run fine, but can’t validate—looks like the bugs in 1.46 are still biting …

Well, as I pointed out three weeks ago, v1.46 is still being issued to 32-bit hosts, and still failing - the v1.48 deployment was incomplete.

Has anybody seen Sid(d)? Please tell him...
39) Message boards : News : New Nbody version 1.48 (Message 63206)
Posted 8 Mar 2015 by Richard Haselgrove
Post:
Try to use independent tools - Windows would be Task Manager and Process Explorer, I don't know what the Mac OS X equivalents would be - to distinguish between 'what BOINC says is going on' and 'what is really happening'.

Recent BOINCs give users a simulated 'pseudo %age' display to avoid anxieties. But it sounds as if this task is making no real progress at all, and all the time estimates are simulations.

You mention thermal problems and BOINC being 'throttled'. Is that a restriction on the number of cores used, or the proportion of time BOINC is allowed to run? There is a particular problem with the nbody tasks not checkpointing during the initialisation phase: If BOINC interupts computations during this time (especially if applications are not kept in memory while suspended), you might be re-winding to the very beginning at every interruption.
40) Message boards : Number crunching : AMD R9 290X does not receive any GPU work (Message 63190)
Posted 2 Mar 2015 by Richard Haselgrove
Post:
Well, as I said last time, with app_info.xml it's YOUR responsibility to supply any needed files yourself. If any are missing, you'll have to fetch them and place them in the same folder as app_info.xml itself.

Since the main download url for this project is 'http://milkyway.cs.rpi.edu/milkyway/download/', I'd suggest you try

http://milkyway.cs.rpi.edu/milkyway/download/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_ati_101.exe
http://milkyway.cs.rpi.edu/milkyway/download/milkyway_separation__modified_fit_1.36_windows_x86_64.exe
http://milkyway.cs.rpi.edu/milkyway/download/milkyway_separation_1.20_windows_x86_64__opencl_amd_ati.exe
http://milkyway.cs.rpi.edu/milkyway/download/milkyway_separation_1.20_windows_x86_64.exe

WARNING: Limited warranty.

I have tested that all four of the files referenced in the app_info.xml posted by Arkayn exist on the server and can be downloaded from those links. But I have NO IDEA whether they are the right files for your computer, or the right files for this project's current research (in particular, I see that v1.36 is current on the project Applications page, but I have no idea why the v1.20 'separation' files were recommended).

Also, please note that if the project research changes, YOU will have to make matching changes in app_info.xml and supply the files referenced. Keep an eye on these message boards, watching out in particular for any announcement of a "New Separation Modfit Version" in the News area.


Previous 20 · Next 20

©2024 Astroinformatics Group