[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160307050332.GA14163@mail.hallyn.com>
Date: Sun, 6 Mar 2016 23:03:32 -0600
From: "Serge E. Hallyn" <serge@...lyn.com>
To: Andy Lutomirski <luto@...capital.net>
Cc: "Serge E. Hallyn" <serge@...lyn.com>,
"Eric W. Biederman" <ebiederm@...ssion.com>,
Serge Hallyn <serge.hallyn@...ntu.com>,
Seth Forshee <seth.forshee@...onical.com>,
lkml <linux-kernel@...r.kernel.org>,
Stéphane Graber <stgraber@...ntu.com>
Subject: Re: user namespace and fully visible proc and sys mounts
On Sun, Mar 06, 2016 at 07:49:14PM -0800, Andy Lutomirski wrote:
> On Sun, Mar 6, 2016 at 7:45 PM, Serge E. Hallyn <serge@...lyn.com> wrote:
> > On Sun, Mar 06, 2016 at 06:24:23PM -0800, Andy Lutomirski wrote:
> >> On Mar 6, 2016 2:03 PM, "Eric W. Biederman" <ebiederm@...ssion.com> wrote:
> >> >
> >> > "Serge E. Hallyn" <serge.hallyn@...ntu.com> writes:
> >> >
> >> > > Hi,
> >> > >
> >> > > So we've been over this many times... but unfortunately there is more
> >> > > breakage to report. Regular privileged and unprivileged containers
> >> > > work all right for us. But running an unprivileged container inside a
> >> > > privileged container is blocked.
> >> > >
> >> > > When creating privileged containers, lxc by default does a few things:
> >> > > it mounts some fuse.lxcfs files over procfiles include /proc/meminfo and
> >> > > /proc/uptime. It mounts proc rw but /proc/sysrq-trigger ro as well as
> >> > > moves /proc/sys/net out of the way, bind-mounts /proc/sys readonly
> >> > > (because this container is not in a user namespace) then moves
> >> > > /proc/sys/net back. Finally it mounts sys ro but bind-mounts
> >> > > /sys/devices/virtual/net as writeable.
> >> > >
> >> > > If any of these are left enabled, unprivileged containers can't be
> >> > > started. If all are disabled, then they can be.
> >> > >
> >> > > Can we find a way to make these not block remounts in child user
> >> > > namespaces? A boot flag, a procfs and sysfs mount option, a sysctl?
> >> >
> >> > Are any of these overmounts done for the purpose of security? It
> >> > appears the /proc/sys and /sys mounts being made read-only is for that
> >> > purpose.
> >> >
> >> > If none of the mounts are for secuirty the easy solution that works
> >> > today is to also mount /proc and /sys somewhere else in your container
> >> > so that the permission check for mounting a new copy passes.
> >>
> >> Can we use the big hammer approach on /proc/sys? Specifically, what
> >> if we made it so that /proc mounts created in a non-root namespace
> >> *only* see things that are scoped to the active namespaces, and only
> >> those over which the mounter has capabilities? We could have mount
> >> options for this.
> >
> > Of course the problem is precisely non-user-namespaced containers which
> > do own and have capabilities over the /proc/sys/files. For user-namespaced
> > containers /proc/sys/ isn't really an issue.
>
> What I mean is:
>
> mount -o nsonly=user,net -t proc none /proc
>
> would show the list of processors and things scoped to the current
> userns and netns, would *not* show global sysctls, and would fail
> unless the caller has appropriate caps over the userns and netns.
> This would work even if the old procfs is not fully visbile.
Gah, so apparently I'd forgotten the workaround I'd implemented - I
thought things had regressed, but they haven't, I'd just missed a step.
Sorry for the noise. I don't want to make things more complicated or
more brittle when we can make it work as is - thanks.
-serge
Powered by blists - more mailing lists