[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAN6Zp5wmMOR+UAfk7D3gXGqFk+VX93yMebd61URtrCMV-Dvfpg@mail.gmail.com>
Date: Wed, 22 Jul 2015 14:10:34 -0400
From: Vincent Batts <vbatts@...il.com>
To: "Eric W. Biederman" <ebiederm@...ssion.com>
Cc: Aditya Kali <adityakali@...gle.com>, linux-api@...r.kernel.org,
Linux Containers <containers@...ts.linux-foundation.org>,
serge.hallyn@...ntu.com, linux-kernel@...r.kernel.org,
luto@...capital.net, tj@...nel.org, cgroups@...r.kernel.org,
mingo@...hat.com
Subject: Re: [PATCHv1 0/8] CGroup Namespaces
Has there been further movement on CLONE_NEWCGROUP outside of this?
vb
On Sun, Oct 19, 2014 at 12:54 AM, Eric W. Biederman
<ebiederm@...ssion.com> wrote:
> Aditya Kali <adityakali@...gle.com> writes:
>
>> Second take at the Cgroup Namespace patch-set.
>>
>> Major changes form RFC (V0):
>> 1. setns support for cgroupns
>> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>> mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
>> 3. writes to cgroup files outside of cgroupns-root are not allowed
>> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>> anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>> your cgroupns-root.
>>
>> More details in the writeup below.
>
> This definitely looks like the right direction to go, and something that
> in some form or another I had been asking for since cgroups were merged.
> So I am very glad to see this work moving forward.
>
> I had hoped that we might just be able to be clever with remounting
> cgroupfs but 2 things stand in the way.
> 1) /proc/<pid>/cgroups (but proc could capture that).
> 2) providing a hard guarnatee that tasks stay within a subset of the
> cgroup hierarchy.
>
> So I think this clearly meets the requirements for a new namespace.
>
> We need to have the discussion on chmod of files on cgroupfs. There is
> a notion that has floated around that only systemd or only root (with
> the appropriate capabilities) should be allowed to set resource limits
> in cgroupfs. In a practical reality that is nonsense. If an atribute
> is properly bound in it's hiearchy it should be safe to change.
>
> Not all attributes are properly bound to hierarchy and some are or at
> least were dangerous for anyone except root to set. So I suggest that a
> CFTYPE flag perhaps CFTYPE_UNPRIV be added for attributes that are safe
> to allow anyone to set, and require CFTYPE_UNPRIV be set before we chmod
> a cgroup attribute from root.
>
> That would be complimentary work, and not strictly tied the cgroup
> namespaces but unprivileged cgroup namespaces don't make much sense
> without that work.
>
> Eric
>
>> Background
>> Cgroups and Namespaces are used together to create “virtual”
>> containers that isolates the host environment from the processes
>> running in container. But since cgroups themselves are not
>> “virtualized”, the task is always able to see global cgroups view
>> through cgroupfs mount and via /proc/self/cgroup file.
>>
>> $ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>> This exposure of cgroup names to the processes running inside a
>> container results in some problems:
>> (1) The container names are typically host-container-management-agent
>> (systemd, docker/libcontainer, etc.) data and leaking its name (or
>> leaking the hierarchy) reveals too much information about the host
>> system.
>> (2) It makes the container migration across machines (CRIU) more
>> difficult as the container names need to be unique across the
>> machines in the migration domain.
>> (3) It makes it difficult to run container management tools (like
>> docker/libcontainer, lmctfy, etc.) within virtual containers
>> without adding dependency on some state/agent present outside the
>> container.
>>
>> Note that the feature proposed here is completely different than the
>> “ns cgroup” feature which existed in the linux kernel until recently.
>> The ns cgroup also attempted to connect cgroups and namespaces by
>> creating a new cgroup every time a new namespace was created. It did
>> not solve any of the above mentioned problems and was later dropped
>> from the kernel. Incidentally though, it used the same config option
>> name CONFIG_CGROUP_NS as used in my prototype!
>>
>> Introducing CGroup Namespaces
>> With unified cgroup hierarchy
>> (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>> have a much more coherent cgroup view and its easy to associate a
>> container with a single cgroup. This also allows us to virtualize the
>> cgroup view for tasks inside the container.
>>
>> The new CGroup Namespace allows a process to “unshare” its cgroup
>> hierarchy starting from the cgroup its currently in.
>> For Ex:
>> $ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>> $ ls -l /proc/self/ns/cgroup
>> lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>> $ ~/unshare -c # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>> [ns]$ ls -l /proc/self/ns/cgroup
>> lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
>> cgroup:[4026532183]
>> # From within new cgroupns, process sees that its in the root cgroup
>> [ns]$ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>
>> # From global cgroupns:
>> $ cat /proc/<pid>/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>> # Unshare cgroupns along with userns and mountns
>> # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>> # sets up uid/gid map and exec’s /bin/bash
>> $ ~/unshare -c -u -m
>>
>> # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
>> # hierarchy.
>> [ns]$ mount -t cgroup cgroup /tmp/cgroup
>> [ns]$ ls -l /tmp/cgroup
>> total 0
>> -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>> -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>> -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>> -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>
>> The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
>> filesystem root for the namespace specific cgroupfs mount.
>>
>> The virtualization of /proc/self/cgroup file combined with restricting
>> the view of cgroup hierarchy by namespace-private cgroupfs mount
>> should provide a completely isolated cgroup view inside the container.
>>
>> In its current form, the cgroup namespaces patcheset provides following
>> behavior:
>>
>> (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>> the process calling unshare is running.
>> For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>> cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>> For the init_cgroup_ns, this is the real root (“/”) cgroup
>> (identified in code as cgrp_dfl_root.cgrp).
>>
>> (2) The cgroupns-root cgroup does not change even if the namespace
>> creator process later moves to a different cgroup.
>> $ ~/unshare -c # unshare cgroupns in some cgroup
>> [ns]$ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>> [ns]$ mkdir sub_cgrp_1
>> [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>> [ns]$ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>> (3) Each process gets its CGROUPNS specific view of
>> /proc/<pid>/cgroup.
>> (a) Processes running inside the cgroup namespace will be able to see
>> cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>> [ns]$ sleep 100000 & # From within unshared cgroupns
>> [1] 7353
>> [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>> [ns]$ cat /proc/7353/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>> (b) From global cgroupns, the real cgroup path will be visible:
>> $ cat /proc/7353/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>
>> (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
>> path will be visible:
>> # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
>> [ns2]$ cat /proc/7353/cgroup
>> [ns2]$
>> This is same as when cgroup hierarchy is not mounted at all.
>> (In correct container setup though, it should not be possible to
>> access PIDs in another container in the first place.)
>>
>> (4) Processes inside a cgroupns are not allowed to move out of the
>> cgroupns-root. This is true even if a privileged process in global
>> cgroupns tries to move the process out of its cgroupns-root.
>>
>> # From global cgroupns
>> $ cat /proc/7353/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>> # cgroupns-root for 7353 is /batchjobs/c_job_id1
>> $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>> -bash: echo: write error: Operation not permitted
>>
>> (5) Setns to another cgroup namespace is allowed only when:
>> (a) process has CAP_SYS_ADMIN in its current userns
>> (b) process has CAP_SYS_ADMIN in the target cgroupns' userns
>> (c) the process's current cgroup is a descendant cgroupns-root of the
>> target namespace.
>> (d) the target cgroupns-root is descendant of current cgroupns-root..
>> The last check (d) prevents processes from escaping their cgroupns-root by
>> attaching to parent cgroupns. Thus, setns is allowed only when the process
>> is trying to restrict itself to a deeper cgroup hierarchy.
>>
>> (6) When some thread from a multi-threaded process unshares its
>> cgroup-namespace, the new cgroupns gets applied to the entire
>> process (all the threads). This should be OK since
>> unified-hierarchy only allows process-level containerization. So
>> all the threads in the process will have the same cgroup. And both
>> - changing cgroups and unsharing namespaces - are protected under
>> threadgroup_lock(task).
>>
>> (7) The cgroup namespace is alive as long as there is atleast 1
>> process inside it. When the last process exits, the cgroup
>> namespace is destroyed. The cgroupns-root and the actual cgroups
>> remain though.
>>
>> (8) 'mount -t cgroup cgroup <mntpt>' when called from within cgroupns mounts
>> the unified cgroup hierarchy with cgroupns-root as the filesystem root.
>> The process needs CAP_SYS_ADMIN in its userns and mntns. This allows the
>> container management tools to be run inside the containers transparently.
>>
>> Implementation
>> The current patch-set is based on top of Tejun Heo's cgroup tree (for-next
>> branch). Its fairly non-intrusive and provides above mentioned
>> features.
>>
>> Possible extensions of CGROUPNS:
>> (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
>> capabilities to restrict cgroups to administrative users. CGroup
>> namespaces could be of help here. With cgroup namespaces, it might
>> be possible to delegate administration of sub-cgroups under a
>> cgroupns-root to the cgroupns owner.
>
>
>
>
>> ---
>> fs/kernfs/dir.c | 53 +++++++++---
>> fs/kernfs/mount.c | 48 +++++++++++
>> fs/proc/namespaces.c | 3 +
>> include/linux/cgroup.h | 41 +++++++++-
>> include/linux/cgroup_namespace.h | 62 +++++++++++++++
>> include/linux/kernfs.h | 5 ++
>> include/linux/nsproxy.h | 2 +
>> include/linux/proc_ns.h | 4 +
>> include/uapi/linux/sched.h | 3 +-
>> init/Kconfig | 9 +++
>> kernel/Makefile | 1 +
>> kernel/cgroup.c | 139 ++++++++++++++++++++++++++------
>> kernel/cgroup_namespace.c | 168 +++++++++++++++++++++++++++++++++++++++
>> kernel/fork.c | 2 +-
>> kernel/nsproxy.c | 19 ++++-
>> 15 files changed, 518 insertions(+), 41 deletions(-)
>> create mode 100644 include/linux/cgroup_namespace.h
>> create mode 100644 kernel/cgroup_namespace.c
>>
>> [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
>> [PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
>> [PATCHv1 3/8] cgroup: add function to get task's cgroup on default
>> [PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
>> [PATCHv1 5/8] cgroup: introduce cgroup namespaces
>> [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
>> [PATCHv1 7/8] cgroup: cgroup namespace setns support
>> [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns
>> _______________________________________________
>> Containers mailing list
>> Containers@...ts.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/containers
> _______________________________________________
> Containers mailing list
> Containers@...ts.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists