[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.0.9999.0710262134180.24687@chino.kir.corp.google.com>
Date: Sat, 27 Oct 2007 10:45:45 -0700 (PDT)
From: David Rientjes <rientjes@...gle.com>
To: Paul Jackson <pj@....com>
cc: Lee.Schermerhorn@...com, clameter@....com,
akpm@...ux-foundation.org, ak@...e.de, linux-kernel@...r.kernel.org
Subject: Re: [patch 2/2] cpusets: add interleave_over_allowed option
On Fri, 26 Oct 2007, Paul Jackson wrote:
> Choice A:
> as it does today, the second node in the tasks cpuset or it could
> mean
>
> Choice B:
> the fourth node in the cpuset, if available, just as
> it did in the case above involving a cpuset on nodes 10 and 11.
>
Thanks for describing the situation with MPOL_PREFERRED so thoroughly.
I prefer Choice B because it does not force mempolicies to have any
dependence on cpusets with regard to what nodemask is passed.
[rientjes@...ads ~]$ man set_mempolicy | grep -i cpuset | wc -l
0
It would be very good to store the passed nodemask to set_mempolicy in
struct mempolicy, as you've already recommended for MPOL_INTERLEAVE, so
that you can try to match the intent of the application as much as
possible. But since cpusets are built on top of mempolicies, I don't
think there's any reason why we should respect any nodemask in terms of
the current cpuset context, whether it's preferred or interleave.
So if you were to pass a nodemask with only the fourth node set for an
MPOL_PREFERRED mempolicy, the correct behavior would be to prefer the
fourth node on the system or, if constrained by cpusets, the fourth node
in the cpuset. If the cpuset has fewer than four nodes, the behavior
should be undefined (probably implemented to just cycle the set of
mems_allowed until you reach the fourth entry). That's the result of
constraining a task to a cpuset that obviously wants access to more
nodes -- it's a userspace mistake and abusing cpusets so that the task
does not get what it expects.
That concept isn't actually new: we already restrict tasks to a certain
amount of memory by writing to the mems file and just because it happens
to have access to more memory when unconstrained by cpusets doesn't
matter. You've placed it in a cpuset that wasn't prepared to deal with
what the task was asking for. At least in the MPOL_PREFERRED case you
describe above, it'll be dealt with much more pleasantly by at least
giving it a preferred node as opposed to OOM killing it when a task has
exhausted its available cpuset-constrained memory.
I'd prefer a solution where mempolicies can always be described and used
without ever considering cpusets. Then, a sane implementation will
configure the cpuset accordingly to accomodate its tasks' mempolicies. We
don't want to get in a situation where we are denying a task to be
attached to a cpuset when there are fewer nodes than the preferred node,
for example, but we can fallback to better behavior by at least giving it
a preferred node in the MPOL_PREFERRED case.
David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists