linux-kernel - Re: [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <diqzlej60z57.fsf@ackerleytng-cloudtop.c.googlers.com>
Date:   Wed, 05 Apr 2023 21:58:44 +0000
From:   Ackerley Tng <ackerleytng@...gle.com>
To:     Christian Brauner <brauner@...nel.org>
Cc:     kvm@...r.kernel.org, linux-api@...r.kernel.org,
        linux-arch@...r.kernel.org, linux-doc@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, qemu-devel@...gnu.org, aarcange@...hat.com,
        ak@...ux.intel.com, akpm@...ux-foundation.org, arnd@...db.de,
        bfields@...ldses.org, bp@...en8.de, chao.p.peng@...ux.intel.com,
        corbet@....net, dave.hansen@...el.com, david@...hat.com,
        ddutile@...hat.com, dhildenb@...hat.com, hpa@...or.com,
        hughd@...gle.com, jlayton@...nel.org, jmattson@...gle.com,
        joro@...tes.org, jun.nakajima@...el.com,
        kirill.shutemov@...ux.intel.com, linmiaohe@...wei.com,
        luto@...nel.org, mail@...iej.szmigiero.name, mhocko@...e.com,
        michael.roth@....com, mingo@...hat.com, naoya.horiguchi@....com,
        pbonzini@...hat.com, qperret@...gle.com, rppt@...nel.org,
        seanjc@...gle.com, shuah@...nel.org, steven.price@....com,
        tabba@...gle.com, tglx@...utronix.de, vannapurve@...gle.com,
        vbabka@...e.cz, vkuznets@...hat.com, wanpengli@...cent.com,
        wei.w.wang@...el.com, x86@...nel.org, yu.c.zhang@...ux.intel.com
Subject: Re: [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify
 mount for memfd_restricted


Thanks again for your review!

Christian Brauner <brauner@...nel.org> writes:
> On Tue, Apr 04, 2023 at 03:53:13PM +0200, Christian Brauner wrote:
>> On Fri, Mar 31, 2023 at 11:50:39PM +0000, Ackerley Tng wrote:
>> >
>> > ...
>> >
>> > -SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
>> > +static int restrictedmem_create(struct vfsmount *mount)
>> >  {
>> >  	struct file *file, *restricted_file;
>> >  	int fd, err;
>> >
>> > -	if (flags)
>> > -		return -EINVAL;
>> > -
>> >  	fd = get_unused_fd_flags(0);

>> Any reasons the file descriptors aren't O_CLOEXEC by default? I don't
>> see any reasons why we should introduce new fdtypes that aren't
>> O_CLOEXEC by default. The "don't mix-and-match" train has already left
>> the station anyway as we do have seccomp noitifer fds and pidfds both of
>> which are O_CLOEXEC by default.


Thanks for pointing this out. I agree with using O_CLOEXEC, but didn’t
notice this before. Let us discuss this under the original series at
[1].

>> >  	if (fd < 0)
>> >  		return fd;
>> >
>> > -	file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
>> > +	if (mount)
>> > +		file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0,  
>> VM_NORESERVE);
>> > +	else
>> > +		file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
>> > +
>> >  	if (IS_ERR(file)) {
>> >  		err = PTR_ERR(file);
>> >  		goto err_fd;
>> > @@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int,  
>> flags)
>> >  	return err;
>> >  }
>> >
>> > +static bool is_shmem_mount(struct vfsmount *mnt)
>> > +{
>> > +	return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC;

>> This can just be if (mnt->mnt_sb->s_magic == TMPFS_MAGIC).


Will simplify this in the next revision.

>> > +}
>> > +
>> > +static bool is_mount_root(struct file *file)
>> > +{
>> > +	return file->f_path.dentry == file->f_path.mnt->mnt_root;

>> mount -t tmpfs tmpfs /mnt
>> touch /mnt/bla
>> touch /mnt/ble
>> mount --bind /mnt/bla /mnt/ble
>> fd = open("/mnt/ble")
>> fd_restricted = memfd_restricted(fd)

>> IOW, this doesn't restrict it to the tmpfs root. It only restricts it to
>> paths that refer to the root of any tmpfs mount. To exclude bind-mounts
>> that aren't bind-mounts of the whole filesystem you want:

>> path->dentry == path->mnt->mnt_root &&
>> path->mnt->mnt_root == path->mnt->mnt_sb->s_root


Will adopt this in the next revision and add a selftest to check
this. Thanks for pointing this out!

>> > +}
>> > +
>> > +static int restrictedmem_create_on_user_mount(int mount_fd)
>> > +{
>> > +	int ret;
>> > +	struct fd f;
>> > +	struct vfsmount *mnt;
>> > +
>> > +	f = fdget_raw(mount_fd);
>> > +	if (!f.file)
>> > +		return -EBADF;
>> > +
>> > +	ret = -EINVAL;
>> > +	if (!is_mount_root(f.file))
>> > +		goto out;
>> > +
>> > +	mnt = f.file->f_path.mnt;
>> > +	if (!is_shmem_mount(mnt))
>> > +		goto out;
>> > +
>> > +	ret = file_permission(f.file, MAY_WRITE | MAY_EXEC);

>> With the current semantics you're asking whether you have write
>> permissions on the /mnt/ble file in order to get answer to the question
>> whether you're allowed to create an unlinked restricted memory file.
>> That doesn't make much sense afaict.


That's true. Since mnt_want_write() already checks for write permissions
and this syscall creates an unlinked file on the mount, we don't have to
check permissions on the file then. Will remove this in the next
revision!

>> > +	if (ret)
>> > +		goto out;
>> > +
>> > +	ret = mnt_want_write(mnt);
>> > +	if (unlikely(ret))
>> > +		goto out;
>> > +
>> > +	ret = restrictedmem_create(mnt);
>> > +
>> > +	mnt_drop_write(mnt);
>> > +out:
>> > +	fdput(f);
>> > +
>> > +	return ret;
>> > +}
>> > +
>> > +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd)
>> > +{
>> > +	if (flags & ~RMFD_USERMNT)
>> > +		return -EINVAL;
>> > +
>> > +	if (flags == RMFD_USERMNT) {

>> Why do you even need this flag? It seems that @mount_fd being < 0 is
>> sufficient to indicate that a new restricted memory fd is supposed to be
>> created in the system instance.


I'm hoping to have this patch series merged after Chao's patch series
introduces the memfd_restricted() syscall [1].

This flag is necessary to indicate the validity of the second argument.

With this flag, we can definitively return an error if the fd is
invalid, which I think is a better experience for the userspace
programmer than if we just silently default to the kernel mount when the
fd provided is invalid.

>> > +		if (mount_fd < 0)
>> > +			return -EINVAL;
>> > +
>> > +		return restrictedmem_create_on_user_mount(mount_fd);
>> > +	} else {
>> > +		return restrictedmem_create(NULL);
>> > +	}
>> > +}

>> I have to say that I'm very confused by all of this the more I look at  
>> it.

>> Effectively memfd restricted functions as a wrapper filesystem around
>> the tmpfs filesystem. This is basically a weird overlay filesystem.
>> You're allocating tmpfs files that you stash in restrictedmem files.
>> I have to say that this seems very hacky. I didn't get this at all at
>> first.

>> So what does the caller get if they call statx() on a restricted memfd?
>> Do they get the device number of the tmpfs mount and the inode numbers
>> of the tmpfs mount? Because it looks like they would:

>> static int restrictedmem_getattr(struct user_namespace *mnt_userns,
>> 				 const struct path *path, struct kstat *stat,
>> 				 u32 request_mask, unsigned int query_flags)
>> {
>> 	struct inode *inode = d_inode(path->dentry);
>> 	struct restrictedmem *rm = inode->i_mapping->private_data;
>> 	struct file *memfd = rm->memfd;

>> 	return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,

> This is pretty broken btw, because @path refers to a restrictedmem path
> which you're passing to a tmpfs iop...

> I see that in

> 	return memfd->f_inode->i_op->getattr(mnt_userns, &memfd->f_path, stat,
> 					     request_mask, query_flags);

> this if fixed but still, this is... not great.


Thanks, this will be fixed in the next revision by rebasing on Chao's
latest code.

>> 					     request_mask, query_flags);

>> That @memfd would be a struct file allocated in a tmpfs instance, no? So
>> you'd be calling the inode operation of the tmpfs file meaning that
>> struct kstat will be filled up with the info from the tmpfs instance.

>> But then if I call statfs() and check the fstype I would get
>> RESTRICTEDMEM_MAGIC, no? This is... unorthodox?

>> I'm honestly puzzled and this sounds really strange. There must be a
>> better way to implement all of this.

>> Shouldn't you try and make this a part of tmpfs proper? Make a really
>> separate filesystem and add a memfs library that both tmpfs and
>> restrictedmemfs can use? Add a mount option to tmpfs that makes it a
>> restricted tmpfs?

This was discussed earlier in the patch series introducing
memfd_restricted and this approach was taken to better manage ownership
of required functionalities between two subsystems. Please see
discussion beginning [2]

[1] ->  
https://lore.kernel.org/lkml/20221202061347.1070246-1-chao.p.peng@linux.intel.com/T/.
[2] ->  
https://lore.kernel.org/lkml/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com/