linux-kernel - Re: [PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <877fg06uf9.fsf@x220.int.ebiederm.org>
Date:	Thu, 14 Apr 2016 11:12:42 -0500
From:	ebiederm@...ssion.com (Eric W. Biederman)
To:	"Serge E. Hallyn" <serge@...lyn.com>
Cc:	Tejun Heo <tj@...nel.org>, linux-api@...r.kernel.org,
	adityakali@...gle.com,
	Linux Containers <containers@...ts.osdl.org>,
	cgroups@...r.kernel.org, lkml <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field

"Serge E. Hallyn" <serge@...lyn.com> writes:

> Quoting Eric W. Biederman (ebiederm@...ssion.com):
>> "Serge E. Hallyn" <serge@...lyn.com> writes:
>> 
>> > This is so that userspace can distinguish a mount made in a cgroup
>> > namespace from a bind mount from a cgroup subdirectory.
>> 
>> To do that do you need to print the path, or is an extra option that
>> reveals nothing except that it was a cgroup mount sufficient?
>> 
>> Is there any practical difference between a mount in a namespace and a
>> bind mount?
>> 
>> Given the way the conversation has been going I think it would be good
>> to see the answers to these questions.  Perhaps I missed it but I
>> haven't seen the answers to those questions.
>
> Yup, I tried to answer those in my last email, let me try again.
>
> Let's say I start a container using cgroup namespaces, /lxc/x1.  It mounts
> freezer at /sys/fs/cgroup so it has field three of mountinfo as /lxc/x1,
> and /sys/fs/cgroup/ is the path to the container's cgroup (/lxc/x1).  In
> that container, I start another container x1, not using cgroup namespaces.
> It also wants a cgroup mount, and a common way to handle that (to prevent
> container rewriting its limits) is to mount a tmpfs at /sys/fs/cgroup,
> create /sysfs/cgroup/lxc/x1, and bind mount /sys/fs/cgroup/lxc/x1 from
> the parent container onto /sys/fs/cgroup/lxc/x1 in the child container.
> Now for that bind mount, the mountinfo field 3 will show /lxc/x1/lxc/x1,
> with mount target /sys/fs/cgroup/lxc/x1, while /proc/self/cgroup for a task
> in that container will show '/lxc/x1'.  Unless it has been moved into
> /lxc/x1/lxc/x1 in the container (/lxc/x1/lxc/x1/lxc/x1 on the host)...
> Every time I've thought "maybe we can just..." I've found a case where it
> wouldn't work.
>
> At first in lxc we simply said if /proc/self/ns/cgroup exists assume that
> the cgroupfs mounts are not bind mounts.  However, old userspace (and
> container drivers) on new kernels is certainly possible, especially an
> older distro in a container on a newer distro on the host.  That completely
> breaks with this approach.
>
> I also personally think there *is* value in letting a task know its
> place on the system, so hiding the full cgroup path is imo not only not
> a valid goal, it's counter-productive.  Part of making for better
> virtualization is to give userspace all the info it needs about its
> current limits.  Consider that with the unified hierarchy, you cannot
> have tasks in a cgroup that also has child cgroups - except for the
> root.  Cgroup namespaces do not make an exception for this, so knowing
> that you are not in the absolute cgroup root actually can prevent you
> from trying something that cannot work.  Or, I suppose, at least
> understanding why you're unable to do what you're trying to do (namely
> your container manager messed up).  I point this out because finding
> a way to only show the namespaced root in field 3 of mountinfo would
> fix the base problem, but at the cost of hiding useful information
> from a container.

It is just the superblock show_path method.  And regardless of the rest
of the usefullness of your mount option implementing show_path appears
to be fundamentally the right thing in this context.  As that field
appears to have the same issue as /proc/self/cgroup.

Eric