[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1193674988.5035.93.camel@localhost>
Date: Mon, 29 Oct 2007 12:23:08 -0400
From: Lee Schermerhorn <Lee.Schermerhorn@...com>
To: David Rientjes <rientjes@...gle.com>
Cc: Christoph Lameter <clameter@....com>,
Andrew Morton <akpm@...ux-foundation.org>,
Andi Kleen <ak@...e.de>, Paul Jackson <pj@....com>,
linux-kernel@...r.kernel.org
Subject: Re: [patch 2/2] cpusets: add interleave_over_allowed option
On Sat, 2007-10-27 at 12:16 -0700, David Rientjes wrote:
> On Fri, 26 Oct 2007, David Rientjes wrote:
>
> > Hacking and requiring an updated version of libnuma to allow empty
> > nodemasks to be passed is a poor solution; if mempolicy's are supposed to
> > be independent from cpusets, then what semantics does an empty nodemask
> > actually imply when using MPOL_INTERLEAVE? To me, it means the entire
> > set_mempolicy() should be a no-op, and that's exactly how mainline
> > currently treats it _as_well_ as libnuma. So justifying this change in
> > the man page is respectible, but passing an empty nodemask just doesn't
> > make sense.
> >
>
> Another reason that passing an empty nodemask to set_mempolicy() doesn't
> make sense is that libnuma uses numa_set_interleave_mask(&numa_no_nodes)
> to disable interleaving completely.
>
David: as we discussed when you contacted me off-list about this, the
libnuma API and the system call interface are two quite different APIs.
For example, numa_set_interleave_mask(&numa_no_nodes) does not pass
MPOL_INTERLEAVE with an empty mask to set_mempolicy(). Rather it
"installs" an MPOL_DEFAULT policy which internally just deletes the
task's mempolicy, allowing fallback to system default policy. I would
not propose to change this behavior, nor break libnuma in any way.
For other, who weren't involved in the off-list exchange, here's an
excerpt from my response to David:
[
At the libnuma level, I think we need an explicit
"numa_set_interleave_allowed()"--analogous to "numa_set_localalloc()".
The current "numa_alloc_interleaved()" should, I think, allocate on all
*allowed* nodes, rather than all nodes. It can do this using the sys
call interface as defined.
Independent of cpuset-independent interleave, an application needs to
pass a valid subset of the current mems allowed to
"numa_alloc_interleaved_subset()". An application can now obtain the
mems_allowed using the MPOL_F_MEMS_ALLOWED flag that I added, but we
need a libnuma wrapper for this as well. [Yeah, this info can change at
any time, but that's always been the case....]
"numa_interleave_memory()" is essentially mbind(), I think [not looking
at the libnuma source code at this moment]. Maybe provide
"numa_interleave_memory_allowed(void *mem, size_t size)" ???
Finally, I think we need to add a query function:
"nodemask_t numa_get_mems_allowed()" to return the mask of valid nodes
in the current context [cpuset]. This would just be a wrapper around
get_mempolicy() with the MPOL_F_MEMS_ALLOWED flag.
]
Couple of comments on the above:
1. "the sys call interface as defined" in the 2nd paragraph of the
except refers to my patch that uses null/empty nodemask to indicate "all
allowed".
2. As this thread progresses, you've discussed relaxing the requirement
that applications pass a valid subset of mems_allowed. I.e., something
that was illegal becomes legal. An API change, I think. But, a
backward compatible one, so that's OK, right? :-)
3. If we do change the semantics of the mempolicy system calls to allow
nodes outside of the cpuset, then maybe we don't need to query the mems
allowed. I still find it useful, but not absolutely necessary--e.g., to
construct a nodemask that will be acceptable in the current cpuset.
4. I looked at libnuma source. numa_interleave_memory() does use
mbind() which, again, does not complain about nodemasks that include
non-allowed nodes.
Another thing occurs to me: perhaps numactl would need an additional
'nodes' specifier such as 'allowed'. Alternatively, 'all' could be
redefined to me 'all allowed'. This is independent of how you specify
'all allowed' to the system call.
Regards,
Lee
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists