linux-kernel - Re: [Documentation] State of CPU controller in cgroup v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160820155659.GA16906@mtj.duckdns.org>
Date:   Sat, 20 Aug 2016 11:56:59 -0400
From:   Tejun Heo <tj@...nel.org>
To:     Andy Lutomirski <luto@...capital.net>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Mike Galbraith <umgwanakikbuti@...il.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        kernel-team@...com,
        "open list:CONTROL GROUP (CGROUP)" <cgroups@...r.kernel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Paul Turner <pjt@...gle.com>, Li Zefan <lizefan@...wei.com>,
        Linux API <linux-api@...r.kernel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [Documentation] State of CPU controller in cgroup v2

Hello, Andy.

On Wed, Aug 17, 2016 at 01:18:24PM -0700, Andy Lutomirski wrote:
> >   2-1-1. Process Granularity
> >
> >   For memory, because an address space is shared between all threads
> >   of a process, the terminal consumer is a process, not a thread.
> >   Separating the threads of a single process into different memory
> >   control domains doesn't make semantical sense.  cgroup v2 ensures
> >   that all controller can agree on the same organization by requiring
> >   that threads of the same process belong to the same cgroup.
> 
> I haven't followed all of the history here, but it seems to me that
> this argument is less accurate than it appears.  Linux, for better or
> for worse, has somewhat orthogonal concepts of thread groups
> (processes), mms, and file tables.  An mm has VMAs in it, and VMAs can
> reference things (files, etc) that hold resources.  (Two mms can share
> resources by mapping the same thing or using fork().)  File tables
> hold files, and files can use resources.  Both of these are, at best,
> moderately good approximations of what actually holds resources.
> Meanwhile, threads (tasks) do syscalls, take page faults, *allocate*
> resources, etc.
> 
> So I think it's not really true to say that the "terminal consumer" of
> anything is a process, not a thread.

The terminal consumer is actually the mm context.  A task may be the
allocating entity but not always for itself.

This becomes clear whenever an entity is allocating memory on behalf
of someone else - get_user_pages(), khugepaged, swapoff and so on (and
likely userfaultfd too).  When a task is trying to add a page to a
VMA, the task might not have any relationship with the VMA other than
that it's operating on it for someone else.  The page has to be
charged to whoever is responsible for the VMA and the only ownership
which can be established is the containing mm_struct.

While a mm_struct technically may not map to a process, it is a very
close approxmiation which is hardly ever broken in practice.

> While it's certainly easier to think about assigning processes to
> cgroups, and I certainly agree that, in the common case, it's the
> right thing to do, I don't see why requiring it is a good idea.  Can
> we turn this around: what actually goes wrong if cgroup v2 were to
> allow assigning individual threads if a user specifically requests it?

Consider the scenario where you have somebody faulting on behalf of a
foreign VMA, but the thread who created and is actively using that VMA
is in a different cgroup than the process leader.  Who are we going to
charge?  All possible answers seem erratic.

Please note that I agree that thread granularity can be useful for
some resources; however, my points are 1. it should be scoped so that
the resource distribution tree as a whole can be shared across
different resources, and, 2. cgroup filesystem interface isn't a good
interface for the purpose.  I'll continue the second point below.

> >   there are other reasons to enforce process granularity.  One
> >   important one is isolating system-level management operations from
> >   in-process application operations.  The cgroup interface, being a
> >   virtual filesystem, is very unfit for multiple independent
> >   operations taking place at the same time as most operations have to
> >   be multi-step and there is no way to synchronize multiple accessors.
> >   See also [5] Documentation/cgroup-v2.txt, "R-2. Thread Granularity"
> 
> I don't buy this argument at all.  System-level code is likely to
> assign single process *trees*, which are a different beast entirely.
> I.e. you fork, move the child into a cgroup, and that child and its
> children stay in that cgroup.  I don't see how the thread/process
> distinction matters.

Good point on the multi-process issue, this is something which nagged
me a bit while working on rgroup, although I have to point out that
the issue here is one of not going far enough rather than the approach
being wrong.  There are limitations to scoping it to individual
processes but that doesn't negate the underlying problem or the
usefulness of in-process control.

For system-level and process-level operations to not step on each
other's toes, they need to agree on the granularity boundary -
system-level should be able to treat an application hierarchy as a
single unit.  A possible solution is allowing rgroup hirearchies to
span across process boundaries and implementing cgroup migration
operations which treat such hierarchies as a single unit.  I'm not yet
sure whether the boundary should be at program groups or rgroups.

> On the contrary: with cgroup namespaces, one could easily create a
> cgroup namespace, shove a process in it, and let that process delegate
> its threads to child cgroups however it likes.  (Well, children of the
> namespace root.)

cgroup namespace solves just one piece of the whole problem and not in
a very robust way.  It's okay for containers but not so for individual
applications.

* Using namespace is neither trivial or dependable.  It requires
  explicit mount setups, and, more importantly, an application can't
  rely on a specific namespace setup being there, unlike a
  setpriority() extension.  This affects application designs in the
  first place and severely hampers the accessibility and thus
  usefulness of in-application resource control.

* While it makes the names consistent from inside, it doesn't solve
  the atomicity issues when system and application operate on the
  subtree concurrently.

  Imagine system level operation trying to relocate the namespace.
  While the symbolic names can be made to stay the same before and
  after.  That's about it.  During migration, depending on how
  migration is implemented, some may see path linking back to the old
  or new location.  Even the open files for the filesystem knobs
  wouldn't work after such migration.

* It's just a bad interface if one has to use setpriority(2) to set a
  thread priority but resort to opening a file, parse path, open
  another file, write a number string which uses a completely
  different value range to it for thread groups.

> >   2-1-2. No Internal Process Constraint
> >
> >   cgroup v2 does not allow processes to belong to any cgroup which has
> >   child cgroups when resource controllers are enabled on it (the
> >   notable exception being the root cgroup itself).
> 
> Can you elaborate on this exception?  How do you get any of the
> supposed benefits of not having processes and cgroups exist as
> siblings when you make an exception for the root?  Similarly, if you
> make an exception for the root, what do you do about cgroup namespaces
> where the apparent root isn't the global root?

Having a special case doesn't necessarily get in the way of benefiting
from a set of general rules.  The root cgroup is inherently special as
it has to be the catch-all scope for entities and resource
consumptions which can't be tied to any specific consumer - irq
handling, packet rx, journal writes, memory reclaim from global memory
pressure and so on.  None of sub-cgroups have to worry about them.

These base-system operations are special regardless of cgroup and we
already have sometimes crude ways to affect their behaviors where
necessary through sysctl knobs, priorities on specific kernel threads
and so on.  cgroup doesn't change the situation all that much.  What
gets left in the root cgroup usually are the base-system operations
which are outside the scope of cgroup resource control in the first
place and cgroup resource graph can treat the root as an opaque anchor
point.

There can be other ways to deal with the issue; however, treating root
cgroup this way has the big advantage of minimizing the gap between
configurations without and with cgroups both in terms of mental model
and implementation.

Hopefully, the case of a namespace root is clear now.  If it's gonna
have a sub-hierarchy, it itself can't contain processes but the system
root just contains base-system entities and resources which a
namespace root doesn't have to worry about.  Ignoring base-system
stuff, a namespace root is topologically in the same position as the
system root in the cgroup resource graph.

Thanks.

-- 
tejun