linux-kernel - Re: [PATCH v2 2/6] ext4: fix stale data if it bail out of the extents mapping loop

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <m5drn6xauyaksmui7b3vpua24ttgmjnwsi3sgavpelxlcwivsw@6bpmobqvpw7f>
Date: Thu, 19 Jun 2025 18:21:59 +0200
From: Jan Kara <jack@...e.cz>
To: Zhang Yi <yi.zhang@...weicloud.com>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca, jack@...e.cz, 
	ojaswin@...ux.ibm.com, yi.zhang@...wei.com, libaokun1@...wei.com, yukuai3@...wei.com, 
	yangerkun@...wei.com
Subject: Re: [PATCH v2 2/6] ext4: fix stale data if it bail out of the
 extents mapping loop

On Wed 11-06-25 19:16:21, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@...wei.com>
> 
> During the process of writing back folios, if
> mpage_map_and_submit_extent() exits the extent mapping loop due to an
> ENOSPC or ENOMEM error, it may result in stale data or filesystem
> inconsistency in environments where the block size is smaller than the
> folio size.
> 
> When mapping a discontinuous folio in mpage_map_and_submit_extent(),
> some buffers may have already be mapped. If we exit the mapping loop
> prematurely, the folio data within the mapped range will not be written
> back, and the file's disk size will not be updated. Once the transaction
> that includes this range of extents is committed, this can lead to stale
> data or filesystem inconsistency.
> 
> Fix this by submitting the current processing partial mapped folio and
> update the disk size to the end of the mapped range.
> 
> Suggested-by: Jan Kara <jack@...e.cz>
> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
> ---
>  fs/ext4/inode.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 48 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3a086fee7989..d0db6e3bf158 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2362,6 +2362,42 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>  	return 0;
>  }
>  
> +/*
> + * This is used to submit mapped buffers in a single folio that is not fully
> + * mapped for various reasons, such as insufficient space or journal credits.
> + */
> +static int mpage_submit_buffers(struct mpage_da_data *mpd, loff_t pos)
> +{
> +	struct inode *inode = mpd->inode;
> +	struct folio *folio;
> +	int ret;
> +
> +	folio = filemap_get_folio(inode->i_mapping, mpd->first_page);
> +	if (IS_ERR(folio))
> +		return PTR_ERR(folio);
> +
> +	ret = mpage_submit_folio(mpd, folio);
> +	if (ret)
> +		goto out;
> +	/*
> +	 * Update first_page to prevent this folio from being released in
> +	 * mpage_release_unused_pages(), it should not equal to the folio
> +	 * index.
> +	 *
> +	 * The first_page will be reset to the aligned folio index when this
> +	 * folio is written again in the next round. Additionally, do not
> +	 * update wbc->nr_to_write here, as it will be updated once the
> +	 * entire folio has finished processing.
> +	 */
> +	mpd->first_page = round_up(pos, PAGE_SIZE) >> PAGE_SHIFT;

Well, but there can be many folios between mpd->first_page and pos. And
this way you avoid cleaning them up (unlocking them and dropping elevated
refcount) before we restart next loop. How is this going to work?

Also I don't see in this patch where mpd->first_page would get set back to
retry writing this folio. What am I missing?

> +	WARN_ON_ONCE((folio->index == mpd->first_page) ||
> +		     !folio_contains(folio, pos >> PAGE_SHIFT));
> +out:
> +	folio_unlock(folio);
> +	folio_put(folio);
> +	return ret;
> +}
> +
>  /*
>   * mpage_map_and_submit_extent - map extent starting at mpd->lblk of length
>   *				 mpd->len and submit pages underlying it for IO
> @@ -2412,8 +2448,16 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>  			 */
>  			if ((err == -ENOMEM) ||
>  			    (err == -ENOSPC && ext4_count_free_clusters(sb))) {
> -				if (progress)
> +				/*
> +				 * We may have already allocated extents for
> +				 * some bhs inside the folio, issue the
> +				 * corresponding data to prevent stale data.
> +				 */
> +				if (progress) {
> +					if (mpage_submit_buffers(mpd, disksize))
> +						goto invalidate_dirty_pages;
>  					goto update_disksize;
> +				}
>  				return err;
>  			}
>  			ext4_msg(sb, KERN_CRIT,
> @@ -2432,6 +2476,8 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>  			*give_up_on_write = true;
>  			return err;
>  		}
> +		disksize = ((loff_t)(map->m_lblk + map->m_len)) <<
> +				inode->i_blkbits;

I don't think setting disksize like this is correct in case
mpage_map_and_submit_buffers() below fails (when extent covers many folios
and we don't succeed in writing them all). In that case we may need to keep
disksize somewhere in the middle of the extent.

Overall I don't think we need to modify disksize handling here. It is fine
to leave (part of) the extent dangling beyond disksize until we retry the
writeback in these rare cases.

>  		progress = 1;
>  		/*
>  		 * Update buffer state, submit mapped pages, and get us new
> @@ -2442,12 +2488,12 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>  			goto update_disksize;
>  	} while (map->m_len);
>  
> +	disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT;
>  update_disksize:
>  	/*
>  	 * Update on-disk size after IO is submitted.  Races with
>  	 * truncate are avoided by checking i_size under i_data_sem.
>  	 */
> -	disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT;
>  	if (disksize > READ_ONCE(EXT4_I(inode)->i_disksize)) {
>  		int err2;
>  		loff_t i_size;

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR