linux-kernel - Re: cgroup namespace and user namespace interactions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAOviyaj+hxAuu46sqpxNPxT37MBQo5OB_qfWVyr9uis3X5Sruw@mail.gmail.com>
Date:	Fri, 29 Apr 2016 15:54:03 +1000
From:	Aleksa Sarai <cyphar@...har.com>
To:	Tejun Heo <tj@...nel.org>
Cc:	linux-kernel@...r.kernel.org, cgroups@...r.kernel.org
Subject: Re: cgroup namespace and user namespace interactions

> The new cgroup namespace currently only allows for superficial
> interaction with the user namespace (it checks against the namespace
> it was created in whether or not a user has the right capabilities
> before allowing mounting, and things like that). However, there is one
> glaring feature that appears to be missing from the new cgroup
> namespace implementation: unprivileged user namespaces can't modify
> their sub-hierarchy. This is particularly frustrating for the
> containerisation community, where we are working on adding support for
> "rootless containers" in runC (the execution driver of Docker)[1]. It
> essentially means that we can't use cgroup resource limiting to limit
> *the resources of our own processes*. It also makes things like the
> freezer cgroup unusable.
>
> Here follows how I think we can solve this issue: the most obvious way
> of dealing with this would be (in the cgroupv1 view) to create a new
> subtree in every controller when you CLONE_NEWCGROUP. This new subtree
> is the root of the process's cgroup hierarchy. This doesn't affect any
> resource control, but it will result in the process only being able to
> affect its *own* resources. However, for cgroupv2 we have the "No
> Internal Process Constraint". So, maybe we could also move all of the
> other processes into a sibling subtree (with the *exact same* access
> permissions as the parent). Thus, the operation would look like this:
>
> - C0 - P00
>        \ P01
>        \ P02 (about to setns)
>
> becomes
>
> - C0 - C00 - P00
>                  \ P01
>        \ C01 - P02
>
> But then we have C00 which is just a waste of cycles (it doesn't have
> any resource settings). So maybe there's some optimisation we can do
> there, but that's as far as I've gotten into thinking about how to
> deal with the constraints of cgroupv2. After that's been solved we can
> reuse how we store the user namespace the cgroup was created in
> (cgroup_namespace.user_ns), and just check that whatever user is
> trying to modify the cgroup has CAP_SYS_ADMIN in that user namespace.
>
> Do you think this would work? Are there any recommendations on whether
> we can make this work better? Also, can you clarify whether or not
> CLONE_NEWCGROUP only works for cgroupv2 or does it also work on
> cgroupv1 (we haven't yet transitioned to cgroupv2 in runC).
>
> Thanks.
>
> [1]: https://github.com/opencontainers/runc/pull/774


Does anyone have an opinion on this proposal?

-- 
Aleksa Sarai (cyphar)
www.cyphar.com