linux-kernel - Re: [PATCH 1/3] fs, xfs: introduce S_IOMAP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170731164625.GB4477@magnolia>
Date:   Mon, 31 Jul 2017 09:46:25 -0700
From:   "Darrick J. Wong" <darrick.wong@...cle.com>
To:     Dan Williams <dan.j.williams@...el.com>
Cc:     Jan Kara <jack@...e.cz>, linux-nvdimm@...ts.01.org,
        Dave Chinner <david@...morbit.com>,
        linux-kernel@...r.kernel.org, linux-xfs@...r.kernel.org,
        Jeff Moyer <jmoyer@...hat.com>,
        Alexander Viro <viro@...iv.linux.org.uk>, luto@...nel.org,
        linux-fsdevel@...r.kernel.org,
        Ross Zwisler <ross.zwisler@...ux.intel.com>,
        Christoph Hellwig <hch@....de>
Subject: Re: [PATCH 1/3] fs, xfs: introduce S_IOMAP_IMMUTABLE

On Sat, Jul 29, 2017 at 12:43:35PM -0700, Dan Williams wrote:
> An inode with this flag set indicates that the file's block map cannot
> be changed, no size change, deletion, hole-punch, range collapse, or
> reflink.
> 
> The implementation of toggling the flag and sealing the state of the
> extent map is saved for a later patch. The functionality provided by
> S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
> that provided by S_SWAPFILE, and it is targeted to replace it.
> 
> For now, only xfs and the core vfs are updated to consider the new flag.
> 
> The additional check that is added for this flag, beyond what we are
> already doing for swapfiles, is to truncate or abort writes that try to
> extend the file size.
> 
> Cc: Jan Kara <jack@...e.cz>
> Cc: Jeff Moyer <jmoyer@...hat.com>
> Cc: Christoph Hellwig <hch@....de>
> Cc: Ross Zwisler <ross.zwisler@...ux.intel.com>
> Cc: Alexander Viro <viro@...iv.linux.org.uk>
> Suggested-by: Dave Chinner <david@...morbit.com>
> Suggested-by: "Darrick J. Wong" <darrick.wong@...cle.com>
> Signed-off-by: Dan Williams <dan.j.williams@...el.com>
> ---
>  fs/attr.c          |   10 ++++++++++
>  fs/namei.c         |    3 ++-
>  fs/open.c          |    6 ++++++
>  fs/read_write.c    |    3 +++
>  fs/xfs/xfs_ioctl.c |    6 ++++++
>  include/linux/fs.h |    2 ++
>  mm/filemap.c       |    9 +++++++++
>  7 files changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/attr.c b/fs/attr.c
> index 135304146120..8573e364bd06 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -112,6 +112,16 @@ EXPORT_SYMBOL(setattr_prepare);
>   */
>  int inode_newsize_ok(const struct inode *inode, loff_t offset)
>  {
> +	if (IS_IOMAP_IMMUTABLE(inode)) {
> +		/*
> +		 * Any size change is disallowed. Size increases may
> +		 * dirty metadata that an application is not prepared to
> +		 * sync, and a size decrease may expose free blocks to
> +		 * in-flight DMA.
> +		 */
> +		return -ETXTBSY;
> +	}
> +
>  	if (inode->i_size < offset) {
>  		unsigned long limit;
>  
> diff --git a/fs/namei.c b/fs/namei.c
> index ddb6a7c2b3d4..588f1135c42c 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2793,7 +2793,8 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
>  		return -EPERM;
>  
>  	if (check_sticky(dir, inode) || IS_APPEND(inode) ||
> -	    IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode))
> +	    IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode)
> +	    || IS_IOMAP_IMMUTABLE(inode))

This caught my eye because why wouldn't we be able to unlink a file
whose block map is immutable?  Link count has "nothing" to do with that.

:)

Hmm... I guess the reasoning here is that if we're IOMAP_IMMUTABLE'd,
we're not allowed to touch the link count since (we assume) nobody's
who's interested in *inode is going to call fsync on it to flush the
metadata to disk?

If that's the case, then shouldn't there be a corresponding may_create
check to prevent us from linking a file into a directory?

(I don't really see why link/unlink shouldn't be allowed...)

>  		return -EPERM;
>  	if (isdir) {
>  		if (!d_is_dir(victim))
> diff --git a/fs/open.c b/fs/open.c
> index 35bb784763a4..7395860d7164 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -292,6 +292,12 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  		return -ETXTBSY;
>  
>  	/*
> +	 * We cannot allow any allocation changes on an iomap immutable file

If it's a (regular allocation) or (unshare blocks) fallocate call, we
could return zero immediately since no writes will be able to hit
ENOSPC because we know that the block mappings are there and no CoW is
required.

> +	 */
> +	if (IS_IOMAP_IMMUTABLE(inode))
> +		return -ETXTBSY;

I've been wondering in the back of my head if we could allow a
size-extending FALLOC_FL_ZERO_RANGE here that would fsync at the end?
The use case I had in mind was extending files without having to turn
off the IOMAP_IMMUTABLE flag, since (in theory, anyway) we'd bzero() the
memory and then increase i_size, so userspace should never be able to
access storage that isn't totally finalized.

OTOH that also seems like a huge weird loophole in IOMAP_IMMUTABLE, so
maybe everyone would prefer to shoot down this use case now?

> +
> +	/*
>  	 * Revalidate the write permissions, in case security policy has
>  	 * changed since the files were opened.
>  	 */
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 0cc7033aa413..dc673be7c7cb 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1706,6 +1706,9 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
>  	if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
>  		return -ETXTBSY;
>  
> +	if (IS_IOMAP_IMMUTABLE(inode_in) || IS_IOMAP_IMMUTABLE(inode_out))
> +		return -ETXTBSY;
> +
>  	/* Don't reflink dirs, pipes, sockets... */
>  	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
>  		return -EISDIR;
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index e75c40a47b7d..2e64488bc4de 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1755,6 +1755,12 @@ xfs_ioc_swapext(
>  		goto out_put_tmp_file;
>  	}
>  
> +	if (IS_IOMAP_IMMUTABLE(file_inode(f.file)) ||
> +	    IS_IOMAP_IMMUTABLE(file_inode(tmp.file))) {
> +		error = -EINVAL;
> +		goto out_put_tmp_file;
> +	}

Someone's going to have to audit more of the XFS ioctls to make sure
that none of them can touch the block mapping, but as this is an RFC it
doesn't need to be done right this instant.

> +
>  	/*
>  	 * We need to ensure that the fds passed in point to XFS inodes
>  	 * before we cast and access them as XFS structures as we have no
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6e1fd5d21248..0a254b768855 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1829,6 +1829,7 @@ struct super_operations {
>  #else
>  #define S_DAX		0	/* Make all the DAX code disappear */
>  #endif
> +#define S_IOMAP_IMMUTABLE 16384 /* logical-to-physical extent map is fixed */
>  
>  /*
>   * Note that nosuid etc flags are inode-specific: setting some file-system
> @@ -1867,6 +1868,7 @@ struct super_operations {
>  #define IS_AUTOMOUNT(inode)	((inode)->i_flags & S_AUTOMOUNT)
>  #define IS_NOSEC(inode)		((inode)->i_flags & S_NOSEC)
>  #define IS_DAX(inode)		((inode)->i_flags & S_DAX)
> +#define IS_IOMAP_IMMUTABLE(inode) ((inode)->i_flags & S_IOMAP_IMMUTABLE)
>  
>  #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
>  				 (inode)->i_rdev == WHITEOUT_DEV)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a49702445ce0..e4a6529da2bd 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2806,6 +2806,15 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  	if (unlikely(pos >= inode->i_sb->s_maxbytes))
>  		return -EFBIG;
>  
> +	/* Are we about to mutate the block map on an immutable file? */
> +	if (IS_IOMAP_IMMUTABLE(inode)
> +			&& (pos + iov_iter_count(from) > i_size_read(inode))) {
> +		if (pos < i_size_read(inode))
> +			iov_iter_truncate(from, i_size_read(inode) - pos);

Writes past the end get truncated to stop at EOF?  I'd have thought that
would be an error since userspace is asking for metadata updates it
previously said it would never need...

--D

> +		else
> +			return -ETXTBSY;
> +	}
> +
>  	iov_iter_truncate(from, inode->i_sb->s_maxbytes - pos);
>  	return iov_iter_count(from);
>  }
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html