[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5435AE41.20105@landley.net>
Date: Wed, 08 Oct 2014 16:36:01 -0500
From: Rob Landley <rob@...dley.net>
To: Andy Lutomirski <luto@...capital.net>,
"Eric W. Biederman" <ebiederm@...ssion.com>
CC: Andrew Vagin <avagin@...allels.com>,
Andrey Vagin <avagin@...nvz.org>,
Linux FS Devel <linux-fsdevel@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Linux API <linux-api@...r.kernel.org>,
Andrey Vagin <avagin@...il.com>,
Alexander Viro <viro@...iv.linux.org.uk>,
Andrew Morton <akpm@...ux-foundation.org>,
Cyrill Gorcunov <gorcunov@...nvz.org>,
Pavel Emelyanov <xemul@...allels.com>,
Serge Hallyn <serge.hallyn@...onical.com>
Subject: Re: [PATCH] [RFC] mnt: add ability to clone mntns starting with the
current root
On 10/08/14 14:31, Andy Lutomirski wrote:
> On Wed, Oct 8, 2014 at 12:23 PM, Eric W. Biederman
> <ebiederm@...ssion.com> wrote:
>> Andy Lutomirski <luto@...capital.net> writes:
>>>> Maybe we want to say that rootfs should not be used if we are going to
>>>> create containers...
>>
>> Today it is an assumption of the vfs that rootfs is mounted. With
>> rootfs mounted and pivot_root at the base of the mount stack you can
>> make as minimal of a set of mounts as the vfs allows.
>>
>> Removing rootfs from the vfs requires an audit of everything that
>> manipulates mounts. It is not remotely a local excercise.
>
> Would it be a less invasive audit to allow different mount namespaces
> to have different rootfses?
I.E. The same way different namespaces have different init tasks?
The abstraction containers has implemented here should be logically
consistent.
>>> Could we have an extra rootfs-like fs that is always completely empty,
>>> doesn't allow any writes, and can sit at the bottom of container
>>> namespace hierarchies? If so, and if we add a new syscall that's like
>>> pivot_root (or unshare) but prunes the hierarchy, then we could switch
>>> to that rootfs then.
>>
>> Or equally have something that guarantees that rootfs is empty and
>> read-only at the time the normal root filesystem is mounted. That is
>> certainly a much more localized change if we want to go there.
>>
>> I am half tempted to suggest that mount --move /some/path / be updated
>> to make the old / just go away (perhaps to be replaced with a read-only
>> empty rootfs). That gets us into figuring out if we break userspace
>> which is a big challenge.
>
> Hence my argument for a new syscall or entirely new operation.
I'm still waiting for somebody to explain to my why chroot() shouldn't
be changed to do this instead of adding a new syscall. (At least when
mount namespace support is enabled.)
> mount(2) and friends are way too multiplexed right now. I just found
> yet another security bug due to the insanely complicated semantics of
> the vfs syscalls. (Yes, a different one from the one yesterday.)
As the guy who rewrote busybox mount 3 times, and who just implemented a
brand new one (toybox) from scratch:
It's a bit fiddly, yes.
> A new operation kills several birds with one stone. It could look like:
>
> int mntns_change_root(int dfd, const char *path, int flags);
>
> return -EPERM if chrooted.
Really?
> Returns -EINVAL if path (relative to dfd) isn't a mountmount.
Requiring that chroot() only be called on mountpoints would break
existing semantics, which gets us back to new systemcall instead of
changing behavior of existing one.
If I recall, the first line of pushback against merging the openvz code
as is was "buckets of new syscalls". Pushback against adding a new
system call is understandable. Why can't we fix chroot() now that we
have the tools to do so?
> Otherwise it disconnects path from the existing
> hierarchy, attaches a permanently-empty read-only rootfs under it,
> makes it the root of the mntns, and does the root refs fixup. The old
> hierarchy gets thrown out.
We have a chroot() syscall. We don't use it for containers because it
doesn't do what we want. Does it currently do what _anybody_ wants?
> Systemd could use this, too.
While that's a strong argument against it, I'm willing to overlook it.
Rob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists