linux-kernel - Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1401005530.2322.43.camel@dabdike.int.hansenpartnership.com>
Date:	Sun, 25 May 2014 12:12:10 +0400
From:	James Bottomley <James.Bottomley@...senPartnership.com>
To:	Serge Hallyn <serge.hallyn@...ntu.com>
Cc:	Marian Marinov <mm@...com>, Andy Lutomirski <luto@...capital.net>,
	"Serge E. Hallyn" <serge@...lyn.com>,
	"Michael H. Warfield" <mhw@...tsend.com>,
	Arnd Bergmann <arnd@...db.de>,
	LXC development mailing-list 
	<lxc-devel@...ts.linuxcontainers.org>,
	Richard Weinberger <richard@....at>,
	LKML <linux-kernel@...r.kernel.org>,
	Serge Hallyn <serge.hallyn@...onical.com>,
	Jens Axboe <axboe@...nel.dk>
Subject: Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user
 namespaces

On Sat, 2014-05-24 at 22:25 +0000, Serge Hallyn wrote:
> Quoting James Bottomley (James.Bottomley@...senPartnership.com):
> > On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
> > > On 05/20/2014 05:19 PM, Serge Hallyn wrote:
> > > > Quoting Andy Lutomirski (luto@...capital.net):
> > > >> On May 15, 2014 1:26 PM, "Serge E. Hallyn" <serge@...lyn.com> wrote:
> > > >>> 
> > > >>> Quoting Richard Weinberger (richard@....at):
> > > >>>> Am 15.05.2014 21:50, schrieb Serge Hallyn:
> > > >>>>> Quoting Richard Weinberger (richard.weinberger@...il.com):
> > > >>>>>> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman <gregkh@...uxfoundation.org> wrote:
> > > >>>>>>> Then don't use a container to build such a thing, or fix the build scripts to not do that :)
> > > >>>>>> 
> > > >>>>>> I second this. To me it looks like some folks try to (ab)use Linux containers for purposes where KVM
> > > >>>>>> would much better fit in. Please don't put more complexity into containers. They are already horrible
> > > >>>>>> complex and error prone.
> > > >>>>> 
> > > >>>>> I, naturally, disagree :)  The only use case which is inherently not valid for containers is running a
> > > >>>>> kernel.  Practically speaking there are other things which likely will never be possible, but if someone 
> > > >>>>> offers a way to do something in containers, "you can't do that in containers" is not an apropos response.
> > > >>>>> 
> > > >>>>> "That abstraction is wrong" is certainly valid, as when vpids were originally proposed and rejected,
> > > >>>>> resulting in the development of pid namespaces.  "We have to work out (x) first" can be valid (and I can
> > > >>>>> think of examples here), assuming it's not just trying to hide behind a catch-22/chicken-egg problem.
> > > >>>>> 
> > > >>>>> Finally, saying "containers are complex and error prone" is conflating several large suites of userspace
> > > >>>>> code and many kernel features which support them.  Being more precise would, if the argument is valid, lend
> > > >>>>> it a lot more weight.
> > > >>>> 
> > > >>>> We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc. To understand the
> > > >>>> internals better I also wrote my own userspace to create/start containers. There are so many things which can
> > > >>>> hurt you badly. With user namespaces we expose a really big attack surface to regular users. I.e. Suddenly a
> > > >>>> user is allowed to mount filesystems.
> > > >>> 
> > > >>> That is currently not the case.  They can mount some virtual filesystems and do bind mounts, but cannot mount
> > > >>> most real filesystems.  This keeps us protected (for now) from potentially unsafe superblock readers in the 
> > > >>> kernel.
> > > >>> 
> > > >>>> Ask Andy, he found already lots of nasty things...
> > > >> 
> > > >> I don't think I have anything brilliant to add to this discussion right now, except possibly:
> > > >> 
> > > >> ISTM that Linux distributions are, in general, vulnerable to all kinds of shenanigans that would happen if an
> > > >> untrusted user can cause a block device to appear.  That user doesn't need permission to mount it
> > > > 
> > > > Interesting point.  This would further suggest that we absolutely must ensure that a loop device which shows up in
> > > > the container does not also show up in the host.
> > > 
> > > Can I suggest the usage of the devices cgroup to achieve that?
> > 
> > Not really ... cgroups impose resource limits, it's namespaces that
> > impose visibility separations.  In theory this can be done with the
> > device namespace that's been proposed; however, a simpler way is simply
> > to rm the device node in the host and mknod it in the guest.  I don't
> > really see host visibility as a huge problem: in a shared OS
> > virtualisation it's not really possible securely to separate the guest
> > from the host (only vice versa).
> > 
> > But I really don't think we want to do it this way.  Giving a container
> > the ability to do a mount is too dangerous.  What we want to do is
> > intercept the mount in the host and perform it on behalf of the guest as
> > host root in the guest's mount namespace.  If you do it that way, it
> 
> That doesn't help the problem of guests being able to provide bad input
> for (basically fuzz) the in-kernel filesystem code.  So apparently I'm
> suffering a failure of the imagination - what problem exactly does it solve?

Well, there's two types of fuzzing, one is on sys_mount, which this
would help with because the host filters the mount including all
parameters and may even redo the mount (from direct to bind etc).

If you're thinking the system can be compromised by fuzzing within the
filesystem, then yes, I agree, but it's the same vulnerability an
unvirtualised host would have, so I don't necessarily see it as our
problem.

The problem vectored mount solves is the one of not wanting root in the
container to have unfettered access to sys_mount because it allows the
host to vet all calls and execute the ones it likes in the context of
real root (possibly after modifying the parameters).

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/