Welcome to MilkyWay@home

Posts by Jacob Klein

21) Message boards : Number crunching : Ever longer N-Bodies (Message 61923)
Posted 20 Jun 2014 by Jacob Klein
Post:
Can an admin please investigate this issue?
I too am getting the "EXIT_DISK_LIMIT_EXCEEDED" errors.


Outcome Computation error
Client state Compute error
Exit status 196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED

<core_client_version>7.4.2</core_client_version>
<![CDATA[
<message>
Maximum disk usage exceeded
</message>
<stderr_txt>
<search_application> milkyway_nbody 1.40 Windows x86_64 double OpenMP, Crlibm </search_application>
Using OpenMP 8 max threads on a system with 8 processors


de_nbody_06_10_orphan_sim_1_1398336302_1400388_3
http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=771700839
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=576425580

de_nbody_06_10_orphan_sim_1_1398336302_1406796_2
http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=771728077
http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=576815078

22) Message boards : Number crunching : Must set rsc_memory_bound correctly (Message 61459)
Posted 1 Apr 2014 by Jacob Klein
Post:
It looks like this change is being reverted for now, per David's email below.
So, there is no longer an immediate need to correct the value...
But please consider setting it correctly at some point, in case it gets used by the client in the future.


> Date: Mon, 31 Mar 2014 18:53:33 -0700
> From: d..a@ssl.berkeley.edu
> To: b..c_alpha@ssl.berkeley.edu
> Subject: Re: [boinc_alpha] 7.3.14 - Heads up - Memory bound enforcement
>
> On further thought, I'm going to change things back to the way they were, namely
>
> 1) workunit.rsc_memory_bound is used only by the server;
> it won't send a job if rsc_memory_bound > host's available RAM
> 2) the client aborts a job if working set size > host's available RAM
> 3) the client will run a set of jobs only if the sum of their WSSs
> fits in available RAM
> (i.e. if a job's WSS is close to all available RAM,
> it would run that job and nothing else)
>
> The reason for not aborting jobs when WSS > rsc_memory_bound is that
> it requires projects to come up with very accurate estimates of RAM usage,
> which I don't think is feasible in general.
> Also, it will lead to lots of aborted jobs, which is bad for volunteer morale.
>
> -- David
23) Message boards : Number crunching : Must set rsc_memory_bound correctly (Message 61458)
Posted 1 Apr 2014 by Jacob Klein
Post:
MilkyWay Team:

You need to change your work unit parameters, to properly set <rsc_memory_bound> correctly. BOINC 7.3.14 alpha (and potentially future versions also) will read that value, and compare it to the Working Set size, and will auto-abort the work unit if it exceeds the bound.

As of right now, I am getting errors due to your incorrect settings.

For example:
http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=703493619
Exit status 198 (0xc6) EXIT_MEM_LIMIT_EXCEEDED
<core_client_version>7.3.14</core_client_version>
<![CDATA[
<message>
working set size > workunit.rsc_memory_bound: 97.08MB > 47.68MB
</message>

Could you please promptly fix this?

Regards,
Jacob Klein
24) Message boards : News : N-Body 1.38 (Message 60964)
Posted 4 Feb 2014 by Jacob Klein
Post:
No problem. The BOINC developers are nice guys, and they'll fix legitimate problems that get reported, especially those involving idle resources.
25) Message boards : News : N-Body 1.38 (Message 60956)
Posted 4 Feb 2014 by Jacob Klein
Post:
As alluded to in the previous posts, I have found another problem using MT apps. As TJ said, they want to use all cores. In my case, I am still using all cores, however, I am running another WU that is running in high priority (RNA World). In this situation the MT WU is not run and sits idle. Idle WUs, none the less, are figured into the work buffer computation making Boinc think that I have lots of WUs waiting to run. Consequently, the other threads start to run dry.

Because of this problem, I am now considering all MT apps (for any project) to be off-limits. Is it possible to separate the MT from non-MT n-Body WUs and add as a project preference?

I also wonder if it is possible to create a preference setting to limit the number of cpus used by MT apps. ...I should make a request on the Boinc forms...



I believe this particular issue is with the BOINC scheduler, and will be corrected with the next public release of BOINC.

I had the same problem (idle CPUs due to high-priority RNA World task that was running combined with several downloaded MT tasks that could not get started). So I reported it, and they changed it so that BOINC will additionally schedule an MT task, even if it temporarily over-commits the system, thus no cores are left idle. You can get the latest Beta to see - BOINC v7.2.39 - http://boinc.berkeley.edu/download_all.php

Also, regarding limiting the number of CPUs for an MT task... You can actually specify "how many CPUs to consider as scheduled" and also "what command line argument to use to dictate how many threads to create". I think you'll also need the latest Beta to do it, but the options are available (even though they say 7.3) - http://boinc.berkeley.edu/wiki/Client_configuration#Application_configuration ... For me, I use the following to indicate that I want to only consider 6 CPUs scheduled for a MW MT task (that way my GPU tasks and high-priority RNA World task can remain scheduled), but I don't change the command line, so I knowingly overcommit the system, all using the following:

<!-- Milkyway@Home -->
<app_config>
   <!-- MilkyWay@Home N-Body Simulation -->
   <!-- Set it up so BOINC only "schedules" 6 CPUs for the task -->
   <app_version>
      <app_name>milkyway_nbody</app_name>
      <plan_class>mt</plan_class>
      <avg_ncpus>6</avg_ncpus>
      <!-- <ngpus>x</ngpus>                 -->
      <!-- <cmdline>--nthreads 8</cmdline>  -->
   </app_version>
</app_config>
26) Message boards : News : Users Auto-Aborting Work Units (Message 60954)
Posted 4 Feb 2014 by Jacob Klein
Post:
If I'm reading this correctly, you are referring to "MT" (multi-threaded) tasks in general, where they use multiple virtual cores to get the task done, instead of working as an "ST" (single-threaded) task which only uses 1 virtual core.

The thing is... BOINC is sufficiently setup to handle this just fine. It won't overcommit your system (unless it must due to high-priority tasks), it won't undercommit your system, and it properly records REC (recent estimated credit) such that your RS (resource share) percentages are honored across your projects. Sure, other projects can't work concurrently as the MT task, but BOINC is constantly keeping track of the work done, to ensure RS is honored before the MT task and afterward.

There is nothing inherently wrong with MT tasks. They've just been designed to use multiple threads/cores to get the task done quicker.

I'm not sure if it is setup this way, but... if MilkyWay had/has the MT tasks put into their own application, then "disabling" them would be as easy as editing the project preferences to disable that application. Though, I still don't see why you guys don't want to run MT tasks.
27) Message boards : News : Users Auto-Aborting Work Units (Message 60055)
Posted 29 Sep 2013 by Jacob Klein
Post:
I understand your guess.
I require details and steps, instead of guesswork.
28) Message boards : News : Users Auto-Aborting Work Units (Message 60053)
Posted 29 Sep 2013 by Jacob Klein
Post:
I'm not trying to cause trouble.
I'm looking for clarification of the phrase "setting their BOINC clients to auto abort work units from specific applications."
29) Message boards : News : Users Auto-Aborting Work Units (Message 60049)
Posted 29 Sep 2013 by Jacob Klein
Post:
Hey Jake,

I'm a BOINC beta-tester, and was curious -- how is it even possible to "auto-abort" a work unit? Can you explain the procedure/settings necessary to pull that off? Does it require using a non-standard BOINC Manager/program?

Once I know more details, if it's something that can be controlled with the standard BOINC Manager/client, I might get in contact with the BOINC developers to try to prevent it.

Let me know,
Thanks,
Jacob
30) Message boards : News : N-Body 1.18 (Message 58860)
Posted 14 Jun 2013 by Jacob Klein
Post:
Now, THAT is interesting. Look closer.
The code has logic to keep a task set as "deadline miss - high priority"... if it has PREVIOUSLY been marked "high priority". I think that's the "thrashing prevention" message you see there.
That may be key to getting into the scenario that has the problem.
"Previously high priority, to the point where BOINC would keep it high priority to prevent thrashing... and now in a situation where it's no longer a deadline miss."

So, serious question, what happens when this occurs on your system? Does it just run the GPU task, and leave the other 3.96 CPUs completely idle?
31) Message boards : News : N-Body 1.18 (Message 58853)
Posted 14 Jun 2013 by Jacob Klein
Post:
But I think the current v7.0.64 scheduler is failing to obey policy (2) - the relationship between MT and GPU tasks.

Revision: 97ee3a38f265653d6b16bd5611df3ece4b2eef91
Author: David Anderson <davea@ssl.berkeley.edu>
Date: 22/09/2009 00:23:40
Message:
- client: tweak CPU scheduling policy to avoid running multithread apps overcommitted.
Actually: allow overcommitment but only a fractional CPU
(so that, e.g., we can run a GPU app and a 4-CPU app on a 4-CPU host)


Interesting. Sorry, I was not aware of that 2009 checkin where the concession was made. Perhaps this will inspire me to take on an MT task to test exactly how it works.

Are you saying that, if the MT task is scheduled first, and then GPU tasks are allowed to also be scheduled, then BOINC will schedule both [correctly]?

And are you also saying that, if the GPU task is scheduled first, and then the MT task is allowed to also be scheduled, then BOINC will only schedule the GPU task [incorrectly]?

If those statements are how it is currently behaving, and that inconsistency exists, then it does indeed sound like a BOINC scheduling bug.

[Edit] I just tested with 7.1.15 alpha, and it appears that BOINC is correctly scheduling for me. If the 8-CPU MT task is started first, BOINC will allow 2 GPU tasks that each use 0.5 CPU. If the 2 GPU tasks are started first, BOINC will allow the 8-CPU MT task to also be scheduled. Am I trying to reproduce the bug incorrectly?
32) Message boards : News : N-Body 1.18 (Message 58844)
Posted 14 Jun 2013 by Jacob Klein
Post:
Richard,

I'm not trying to hijack a thread here, but... I wanted to point out that the scheduling policy may not be bugged here.

There's a basic "Job Scheduling" section within the BOINC documentation, found here: http://boinc.berkeley.edu/trac/wiki/ClientSched
Those 4 bullet points drive the main scheduling.

But I think, additionally (per the emails below from April, where I asked David Anderson about it a bit)... the scheduler makes sure never to be commit more than #ncpus, when running a multi-thread (mt) task. Conversely, it won't schedule an mt task if the resulting cpu usage would be more than #ncpus.

What's happening, in your case (I think), is that the mt task runs high-priority for a bit (as "Job Scheduling" bullet 2), then at some point BOINC thinks it can finish by deadline, so... it re-evaluates the task list to do scheduling, and schedules a GPU task (as "Job Scheduling" bullet 3), and then when deciding to schedule the mt task (as "Job Scheduling" bullet 4), it does not schedule it, since doing so would result in cpu usage more than #ncpus. If you have enough GPU tasks to always be running at least 1 GPU task, it will never re-schedule the mt task until BOINC puts it in high-priority mode again.

Believe it or not... I think it may be working correctly, and I cannot think of a better design. If you can, you might dig up the boinc_alpha thread, and reply to it. Maybe a design that treats mt tasks and gpu tasks as equals, above regular-cpu tasks?

I don't run mt tasks, so I don't have behavioral proof, but I'm betting your behavior matches the policies described here; ie: no bug.

Regards,
Jacob Klein

------------------------------------------------------------------------------------
Date: Tue, 2 Apr 2013 14:49:34 -0700
From: davea@ssl.berkeley.edu
To: boinc_alpha@ssl.berkeley.edu
Subject: Re: [boinc_alpha] Using app_config.xml <cpu_usage>2</cpu_usageresults in underloading/overloading CPU

These are both consistent with the current job-scheduling policy:

1) If a multi-thread job is running, the scheduler won't run a CPU job
if doing so would exceed #CPUs.

2) the scheduler will run GPU jobs until the CPU load is #CPUs+1
(but not beyond that).

There is a rationale for both of these, thought they are both
open to debate.

-- David

------------------------------------------------------------------------------------
Date: Thu, 4 Apr 2013 13:48:52 -0700
From: davea@ssl.berkeley.edu
To: jacob_w_klein@msn.com
CC: boinc_alpha@ssl.berkeley.edu
Subject: Re: [boinc_alpha] Using app_config.xml <cpu_usage>2</cpu_usageresults in underloading/overloading CPU

Jacob:

There are situations where overloading the CPUs, even slightly,
can greatly reduce overall throughput.

1) multithread apps. Typically these are tightly coupled,
meaning that each thread uses results recently computed by other threads,
and if these are not available it has to wait.
Multithread apps perform best if all the threads are running all the time.
If 1 thread is descheduled (e.g. because there are 8.5 runnable
threads on an 8-CPU system) this causes the other threads to
sleep and wake up, which has an overhead.
If this happens frequently it has a cascading effect
and this can reduce the throughput of the app by a large factor (like 25%).

2) GPU apps. If the CPU part of a GPU app is descheduled,
the GPU is idle until it's scheduled again.
This can be on the order of a second, esp. on Windows.

In such cases, throughput can be much higher e.g. with
7.5 threads than with 8.5 threads.

So I'm reluctant to make any changes based on our collective intuition,
because that intuition may not be correct.

Ideally, some computer science graduate student would do a research
project to study this issue with a range of real-world applications,
and we could use the findings to improve our policies.
However, the academic computer science world pretends that
volunteer computing doesn't exist, so no such study has been done
or is likely to get done.

-- David
------------------------------------------------------------------------------------


Previous 20

©2024 Astroinformatics Group