linux-kernel - Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1462395979.14310.133.camel@HansenPartnership.com>
Date:	Wed, 04 May 2016 17:06:19 -0400
From:	James Bottomley <James.Bottomley@...senPartnership.com>
To:	Djalal Harouni <tixxdz@...il.com>,
	Alexander Viro <viro@...iv.linux.org.uk>,
	Chris Mason <clm@...com>, tytso@....edu,
	Serge Hallyn <serge.hallyn@...onical.com>,
	Josh Triplett <josh@...htriplett.org>,
	"Eric W. Biederman" <ebiederm@...ssion.com>,
	Andy Lutomirski <luto@...nel.org>,
	Seth Forshee <seth.forshee@...onical.com>,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
	linux-security-module@...r.kernel.org,
	Dongsu Park <dongsu@...ocode.com>,
	David Herrmann <dh.herrmann@...glemail.com>,
	Miklos Szeredi <mszeredi@...hat.com>,
	Alban Crequy <alban.crequy@...il.com>
Subject: Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

On Wed, 2016-05-04 at 16:26 +0200, Djalal Harouni wrote:
> This is version 2 of the VFS:userns support portable root filesystems
> RFC. Changes since version 1:
> 
> * Update documentation and remove some ambiguity about the feature.
>   Based on Josh Triplett comments.
> * Use a new email address to send the RFC :-)
> 
> 
> This RFC tries to explore how to support filesystem operations inside
> user namespace using only VFS and a per mount namespace solution.
> This
> allows to take advantage of user namespace separations without
> introducing any change at the filesystems level. All this is handled
> with the virtual view of mount namespaces.
> 
> 
> 1) Presentation:
> ================
> 
> The main aim is to support portable root filesystems and allow 
> containers, virtual machines and other cases to use the same root 
> filesystem. Due to security reasons, filesystems can't be mounted 
> inside user namespaces, and mounting them outside will not solve the 
> problem since they will show up with the wrong UIDs/GIDs. Read and 
> write operations will also fail and so on.
> 
> The current userspace solution is to automatically chown the whole 
> root filesystem before starting a container, example:
> (host) init_user_ns  1000000:1065536  => (container) user_ns_X1
> 0:65535
> (host) init_user_ns  2000000:2065536  => (container) user_ns_Y1
> 0:65535
> (host) init_user_ns  3000000:3065536  => (container) user_ns_Z1
> 0:65535
> ...
> 
> Every time a chown is called, files are changed and so on... This
> prevents to have portable filesystems where you can throw anywhere
> and boot. Having an extra step to adapt the filesystem to the current
> mapping and persist it will not allow to verify its integrity, it 
> makes snapshots and migration a bit harder, and probably other
> limitations...
> 
> It seems that there are multiple ways to allow user namespaces 
> combine nicely with filesystems, but none of them is that easy. The 
> bind mount and pin the user namespace during mount time will not 
> work, bind mounts share the same super block, hence you may endup 
> working on the wrong vfsmount context and there is no easy way to get
> out of that...

So this option was discussed at the recent LSF/MM summit.  The most
supported suggestion was that you'd use a new internal fs type that had
a struct mount with a new superblock and would copy the underlying
inodes but substitute it's own with modified ->getatrr/->setattr calls
that did the uid shift.  In many ways it would be a remapping bind
which would look similar to overlayfs but be a lot simpler.

> Using the user namespace in the super block seems the way to go, and
> there is the "Support fuse mounts in user namespaces" [1] patches 
> which seem nice but perhaps too complex!?

So I don't think that does what you want.  The fuse project I've used
before to do uid/gid shifts for build containers is bindfs

https://github.com/mpartel/bindfs/

It allows a --map argument where you specify pairs of uids/gids to map
(tedious for large ranges, but the map can be fixed to use uid:range
instead of individual).

>  there is also the overlayfs solution, and finaly the VFS layer 
> solution.
> 
> We present here a simple VFS solution, everything is packed inside 
> VFS, filesystems don't need to know anything (except probably XFS, 
> and special operations inside union filesystems). Currently it 
> supports ext4, btrfs and overlayfs. Changes into filesystems are 
> small, just parse the vfs_shift_uids and vfs_shift_gids options 
> during mount and set the appropriate flags into the super_block
> structure.

So this looks a little daunting.  It sprays the VFS with knowledge
about the shifts and requires support from every underlying filesystem.
 A simple remapping bind filesystem would be a lot simpler and require
no underlying filesystem support.

James