linux-kernel - Re: [patch 2/2] cpusets: add interleave_over

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.0.9999.0710282130320.32474@chino.kir.corp.google.com>
Date:	Sun, 28 Oct 2007 21:47:58 -0700 (PDT)
From:	David Rientjes <rientjes@...gle.com>
To:	Paul Jackson <pj@....com>
cc:	clameter@....com, Lee.Schermerhorn@...com,
	akpm@...ux-foundation.org, ak@...e.de, linux-kernel@...r.kernel.org
Subject: Re: [patch 2/2] cpusets: add interleave_over_allowed option

On Sun, 28 Oct 2007, Paul Jackson wrote:

> And, unless someone in the know tells us otherwise, I have to assume
> that this could break them.  Now, the odds are that they simply don't
> run that solution stack on any system making active use of cpusets,
> so the odds are this would be no problem for them.  But I don't
> presently have enough knowledge of their situation to take that risk.
> 

If we can't identify any applications that would be broken by this, what's 
the difference in simply implementing Choice B and then, if we hear 
complaints, add your hack to revert back to Choice A behavior based on the 
get_mempolicy() call you specified is always part of libnuma?

The problem that I see with immediately offering both choices is that we 
don't know if anybody is actually reverting back to Choice A behavior 
because libnuma, by default, would use it.  That's going to making it very 
painful to remove later because we've supported both options and have made 
libnuma and {get,set}_mempolicy() arguments ambiguous.  We should only 
support both choices if they will both be used and there's no hard 
evidence to suggest that at this point.

> But dual support is pretty easy so far as the kernel code is concerned.
> It's just a few nodes_remap() calls optionally invoked at a few key
> spots in mm/mempolicy.c.  Consequently there won't be a big hurry to
> remove Choice A.
> 

You earlier insisted on an ease of documentation for the MPOL_INTERLEAVE 
case and now this dual support that you're proposing is going to make the 
documentation very difficult to understand for anyone who simply wants to 
use mempolicies.

Others even in this thread have had a hard enough time understanding the 
difference between the two choices and you explained them very thoroughly.  
It's going to be much more trouble than it's worth, I predict.

> There is no "_then_ attach the task to a cpuset."  On systems with
> kernels configured with CONFIG_CPUSETS=y, all tasks are in a cpuset
> all the time.  Moreover, from a practical point of view, on large
> systems managed with cpuset based mechanisms, almost all tasks are in
> cpusets that do not include all nodes, for the entire life of the task.
> 

And that application would need to be implemented to know the nodes that 
it has access to before it issues its set_mempolicy(MPOL_PREFERRED) 
command anyway if it truly uses Choice A behavior.  So unless these tasks 
are looking in /proc/pid/status and parsing Mems_allowed and then 
specifying one as its preferred node or always being guaranteed a certain 
set of nodes that they are always attached to in a cpuset so they have 
such foresight of what node to prefer, Choice A can't possibly be what 
they want.

> > Yet the 'mems' file would still be system-wide; otherwise it would be 
> > impossible to expand the memory your cpuset has access to.
> 
> I had to read that a couple of times to make sense of it.  I take that
> it means that the node numbering used in each cpuset's 'mems' file has
> to be system-wide.  Yes, agreed.
> 
> (Well, actually, the node numbering of each cpusets 'mems' file could
> be relative to its parent cpusets 'mem' numbers, but let's not go
> there, as this discussion is already sufficiently complicated ;)
> 

I appreciate that very much.

> Would it meet the need that prompted your initial patch set if we
> added Choice B memory policy node numbering, but left Choice A as the
> kernel default, with a per-task option (perhaps invokable by a new
> option to one of the {get,set}_mempolicy() calls) to choose Choice B?
> 

The needs I was addressing with my initial patchset was so that when a 
cpuset is expanded, any MPOL_INTERLEAVE memory policy of attached tasks 
automatically get expanded as well.  This discussion has somewhat diverged 
from that, but I hope you still support what we earlier talked about in 
terms of adding a field to struct mempolicy to remember the intended 
nodemask the application asked to interleave over.

> This lets us get Choice B out there, and lets the two main libraries,
> libnuma and libcpuset, dynamically adapt to whichever Choice is active
> for the current task.
> 
> Unchanged applications and existing binaries would simply continue with
> Choice A.  With one additional line of code, a user application could
> get Choice B, with its ability for example to request MPOL_INTERLEAVE
> over all cpuset allowed nodes, where the kernel automatically adapts
> that to changing cpuset changes from larger 'mems' to smaller 'mems'
> and back to larger 'mems' again.
> 

You don't actually need to choose between the two choices for adapting 
MPOL_INTERLEAVE over _all_ allowed cpuset nodes.

I thought what we agreed upon and what you were going to implement was 
adding a nodemask_t to struct mempolicy for the intended nodemask of the 
memory policy and then AND it with pol->cpuset_mems_allowed.  That 
completely satisfies my needs and my applications that want to allocate 
over all available nodes (by simply passing numa_all_nodes to 
set_mempolicy(MPOL_INTERLEAVE)).  If I wanted to interleave only over a 
subset, the choices would matter.

		David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/