linux-kernel - Re: [PATCH v3 1/1] ext4: Mark buffer new if it is unwritten to avoid stale data exposure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <87fs3a7f39.fsf@doe.com>
Date:   Tue, 19 Sep 2023 13:46:26 +0530
From:   Ritesh Harjani (IBM) <ritesh.list@...il.com>
To:     Ojaswin Mujoo <ojaswin@...ux.ibm.com>, linux-ext4@...r.kernel.org,
        Theodore Ts'o <tytso@....edu>
Cc:     linux-kernel@...r.kernel.org, Jan Kara <jack@...e.cz>
Subject: Re: [PATCH v3 1/1] ext4: Mark buffer new if it is unwritten to avoid stale data exposure

Ojaswin Mujoo <ojaswin@...ux.ibm.com> writes:

> ** Short Version **
>
> In ext4 with dioread_nolock, we could have a scenario where the bh returned by
> get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has
> UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we
> never zero out the range of bh that is not under write, causing whatever stale
> data is present in the folio at that time to be written out to disk. To fix this
> mark the buffer as new, in case it is unwritten, in ext4_get_block_unwritten().
>
> ** Long Version **
>
> The issue mentioned above was resulting in two different bugs:
>
> 1. On block size < page size case in ext4, generic/269 was reliably
> failing with dioread_nolock. The state of the write was as follows:
>
>   * The write was extending i_size.
>   * The last block of the file was fallocated and had an unwritten extent
>   * We were near ENOSPC and hence we were switching to non-delayed alloc
>     allocation.
>
> In this case, the back trace that triggers the bug is as follows:
>
>   ext4_da_write_begin()
>     /* switch to nodelalloc due to low space */
>     ext4_write_begin()
>       ext4_should_dioread_nolock() // true since mount flags still have delalloc
>       __block_write_begin(..., ext4_get_block_unwritten)
>         __block_write_begin_int()
>           for(each buffer head in page) {
>             /* first iteration, this is bh1 which contains i_size */
>             if (!buffer_mapped)
>               get_block() /* returns bh with only UNWRITTEN and MAPPED */
>             /* second iteration, bh2 */
>               if (!buffer_mapped)
>                 get_block() /* we fail here, could be ENOSPC */
>           }
>           if (err)
>             /*
>              * this would zero out all new buffers and mark them uptodate.
>              * Since bh1 was never marked new, we skip it here which causes
>              * the bug later.
>              */
>             folio_zero_new_buffers();
>       /* ext4_wrte_begin() error handling */
>       ext4_truncate_failed_write()
>         ext4_truncate()
>           ext4_block_truncate_page()
>             __ext4_block_zero_page_range()
>               if(!buffer_uptodate())
>                 ext4_read_bh_lock()
>                   ext4_read_bh() -> ... ext4_submit_bh_wbc()
>                     BUG_ON(buffer_unwritten(bh)); /* !!! */
>
> 2. The second issue is stale data exposure with page size >= blocksize
> with dioread_nolock. The conditions needed for it to happen are same as
> the previous issue ie dioread_nolock around ENOSPC condition. The issue
> is also similar where in __block_write_begin_int() when we call
> ext4_get_block_unwritten() on the buffer_head and the underlying extent
> is unwritten, we get an unwritten and mapped buffer head. Since it is
> not new, we never zero out the partial range which is not under write,
> thus writing stale data to disk. This can be easily observed with the
> following reproducer:
>
>  fallocate -l 4k testfile
>  xfs_io -c "pwrite 2k 2k" testfile
>  # hexdump output will have stale data in from byte 0 to 2k in testfile
>  hexdump -C testfile
>
> NOTE: To trigger this, we need dioread_nolock enabled and write happening via
> ext4_write_begin(), which is usually used when we have -o nodealloc. Since
> dioread_nolock is disabled with nodelalloc, the only alternate way to call
> ext4_write_begin() is to ensure that delayed alloc switches to nodelalloc ie
> ext4_da_write_begin() calls ext4_write_begin(). This will usually happen when
> ext4 is almost full like the way generic/269 was triggering it in Issue 1 above.
> This might make the issue harder to hit. Hence, for reliable replication, I used
> the below patch to temporarily allow dioread_nolock with nodelalloc and then
> mount the disk with -o nodealloc,dioread_nolock. With this you can hit the stale
> data issue 100% of times:
>
> @@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
>   if (ext4_should_journal_data(inode))
>     return 0;
>   /* temporary fix to prevent generic/422 test failures */
> - if (!test_opt(inode->i_sb, DELALLOC))
> -   return 0;
> + // if (!test_opt(inode->i_sb, DELALLOC))
> + //  return 0;
>   return 1;
>  }
>
> After applying this patch to mark buffer as NEW, both the above issues are
> fixed.
>
> Signed-off-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
> Reviewed-by: Jan Kara <jack@...e.cz>
> ---
>  fs/ext4/inode.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 6c490f05e2ba..8b286a800193 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -789,10 +789,22 @@ int ext4_get_block(struct inode *inode, sector_t iblock,
>  int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
>  			     struct buffer_head *bh_result, int create)
>  {
> +	int ret = 0;
> +
>  	ext4_debug("ext4_get_block_unwritten: inode %lu, create flag %d\n",
>  		   inode->i_ino, create);
> -	return _ext4_get_block(inode, iblock, bh_result,
> +	ret = _ext4_get_block(inode, iblock, bh_result,
>  			       EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
> +
> +	/*
> +	 * If the buffer is marked unwritten, mark it as new to make sure it is
> +	 * zeroed out correctly in case of partial writes. Otherwise, there is
> +	 * a chance of stale data getting exposed.
> +	 */
> +	if (ret == 0 && buffer_unwritten(bh_result))
> +		set_buffer_new(bh_result);
> +
> +	return ret;
>  }

Doing above was my first thought too when we were discussing
the issue internally. But I am glad the discussions happened with Jan to
understand the right reason behing MAP_NEW flag. 

While at it we also noticed that the ext4_map_blocks() returns different
map->m_flags set for the same extent type when passed with different flags
in it's argument :). So it quickly become confusing. Thanks for sorting
out the unknowns.

This patch looks good to me too! Feel free to add - 

Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@...il.com>