linux-kernel - Re: [PATCH 2/2] xfs: fix stale data exposure problem when punch hole, collapse range or zero range across a delalloc extent

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200106215755.GB472651@magnolia>
Date:   Mon, 6 Jan 2020 13:57:55 -0800
From:   "Darrick J. Wong" <darrick.wong@...cle.com>
To:     yu kuai <yukuai3@...wei.com>
Cc:     bfoster@...hat.com, dchinner@...hat.com, sandeen@...deen.net,
        cmaiolino@...hat.com, hch@....de, linux-xfs@...r.kernel.org,
        linux-kernel@...r.kernel.org, zhengbin13@...wei.com,
        yi.zhang@...wei.com, houtao1@...wei.com
Subject: Re: [PATCH 2/2] xfs: fix stale data exposure problem when punch
 hole, collapse range or zero range across a delalloc extent

On Thu, Dec 26, 2019 at 09:47:21PM +0800, yu kuai wrote:
> In xfs_file_fallocate, when punch hole, zero range or collapse range is
> performed, xfs_fulsh_unmap_range() need to be called first. However,
> xfs_map_blocks will convert the whole extent to real, even if there are
> some blocks not related. Furthermore, the unrelated blocks will hold stale
> data since xfs_fulsh_unmap_range didn't flush the correspond dirty pages
> to disk.
> 
> In this case, if user shutdown file system through xfsioctl with cmd
> 'XFS_IOC_GOINGDOWN' and arg 'XFS_FSOP_GOING_FLAGS_LOGFLUSH'. All the
> completed transactions will be flushed to disk, while dirty pages will
> never be flushed to disk. And after remount, the file will hold stale
> data.

Waitaminute, what problem are you trying to solve?

You have a file with a huge delalloc extent because we just wrote a
bunch of 'X' characters to part of a file:

---dddddddddddddddd

Then you want to fallocate or something in the middle of that:

---dddddddddddddddd
           ^^^^------ collapse range these blocks

So we xfs_flush_unmap_range to kill the pagecache on that range:

---dddddddddddddddd
           ^^^^------ xfs_flush_unmap_range()

This triggers writeback, which can convert the entire delalloc range to
a single extent:

---rrrrrrrrrrrrrrrr
           ^^^^^^^^-- This is the range we are writing back
   ^^^^^^^^---------- This range doesn't undergo writeback, but we wrote
                      the extent tree anyway

After committing that update to the log, the fs goes down, which leaves
us with the following after we reboot, mount, and recover the fs:

---rrrrrrrrrrrrrrrr
           ^^^^^^^^-- This part contains 'X'
   ^^^^^^^^---------- This range never underwent writeback, so it's full
		      of junk from the previous owner of the space

So your solution is to split the delalloc reservation to constrain the
allocation to the range that's being operated on?

If so, I think a better solution (at least from the perspective of
reducing fragmentation) would be to map the extent unwritten and force a
post-writeback conversion[1] but I got shot down for performance reasons
the last time I suggested that.

--D

[1] https://lore.kernel.org/linux-xfs/155259894630.30230.10064390935593758177.stgit@magnolia/

> Fix the problem by spliting delalloc extent before xfs_flush_unmap_range
> is called.
> 
> Signed-off-by: yu kuai <yukuai3@...wei.com>
> ---
>  fs/xfs/xfs_file.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 47 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index c93250108952..5398102feec9 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -786,6 +786,50 @@ xfs_break_layouts(
>  
>  	return error;
>  }
> +int
> +try_split_da_extent(
> +	struct xfs_inode	*ip,
> +	loff_t			offset,
> +	loff_t			len)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	xfs_fileoff_t		start = XFS_B_TO_FSBT(mp, offset);
> +	xfs_fileoff_t		end = XFS_B_TO_FSBT(mp, offset + len - 1);
> +	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
> +	struct xfs_iext_cursor	cur;
> +	struct xfs_bmbt_irec	imap;
> +	int error;
> +
> +	/*
> +	 * if start belong to a delalloc extent and it's not the first block,
> +	 * split the extent at start.
> +	 */
> +	if (xfs_iext_lookup_extent(ip, ifp, start, &cur, &imap) &&
> +	    imap.br_startblock != HOLESTARTBLOCK &&
> +	    isnullstartblock(imap.br_startblock) &&
> +	    start > imap.br_startoff) {
> +		error = xfs_bmap_split_da_extent(ip, start);
> +		if (error)
> +			return error;
> +		ip->i_d.di_nextents--;
> +	}
> +
> +	/*
> +	 * if end + 1 belong to a delalloc extent and it's not the first block,
> +	 * split the extent at end + 1.
> +	 */
> +	if (xfs_iext_lookup_extent(ip, ifp, end + 1, &cur, &imap) &&
> +	    imap.br_startblock != HOLESTARTBLOCK &&
> +	    isnullstartblock(imap.br_startblock) &&
> +	    end + 1 > imap.br_startoff) {
> +		error = xfs_bmap_split_da_extent(ip, end + 1);
> +		if (error)
> +			return error;
> +		ip->i_d.di_nextents--;
> +	}
> +
> +	return 0;
> +}
>  
>  #define	XFS_FALLOC_FL_SUPPORTED						\
>  		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
> @@ -842,6 +886,9 @@ xfs_file_fallocate(
>  	 */
>  	if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE |
>  		    FALLOC_FL_COLLAPSE_RANGE)) {
> +		error = try_split_da_extent(ip, offset, len);
> +		if (error)
> +			goto out_unlock;
>  		error = xfs_flush_unmap_range(ip, offset, len);
>  		if (error)
>  			goto out_unlock;
> -- 
> 2.17.2
>