linux-kernel - Re: [PATCH v2 2/6] ext4: fix stale data if it bail out of the extents mapping loop

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <14966764-5bbc-48a9-9d56-841255cfe3c6@huaweicloud.com>
Date: Fri, 20 Jun 2025 12:57:18 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Jan Kara <jack@...e.cz>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
 linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
 ojaswin@...ux.ibm.com, yi.zhang@...wei.com, libaokun1@...wei.com,
 yukuai3@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH v2 2/6] ext4: fix stale data if it bail out of the extents
 mapping loop

On 2025/6/20 0:21, Jan Kara wrote:
> On Wed 11-06-25 19:16:21, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@...wei.com>
>>
>> During the process of writing back folios, if
>> mpage_map_and_submit_extent() exits the extent mapping loop due to an
>> ENOSPC or ENOMEM error, it may result in stale data or filesystem
>> inconsistency in environments where the block size is smaller than the
>> folio size.
>>
>> When mapping a discontinuous folio in mpage_map_and_submit_extent(),
>> some buffers may have already be mapped. If we exit the mapping loop
>> prematurely, the folio data within the mapped range will not be written
>> back, and the file's disk size will not be updated. Once the transaction
>> that includes this range of extents is committed, this can lead to stale
>> data or filesystem inconsistency.
>>
>> Fix this by submitting the current processing partial mapped folio and
>> update the disk size to the end of the mapped range.
>>
>> Suggested-by: Jan Kara <jack@...e.cz>
>> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
>> ---
>>  fs/ext4/inode.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 48 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 3a086fee7989..d0db6e3bf158 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -2362,6 +2362,42 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>>  	return 0;
>>  }
>>  
>> +/*
>> + * This is used to submit mapped buffers in a single folio that is not fully
>> + * mapped for various reasons, such as insufficient space or journal credits.
>> + */
>> +static int mpage_submit_buffers(struct mpage_da_data *mpd, loff_t pos)
>> +{
>> +	struct inode *inode = mpd->inode;
>> +	struct folio *folio;
>> +	int ret;
>> +
>> +	folio = filemap_get_folio(inode->i_mapping, mpd->first_page);
>> +	if (IS_ERR(folio))
>> +		return PTR_ERR(folio);
>> +
>> +	ret = mpage_submit_folio(mpd, folio);
>> +	if (ret)
>> +		goto out;
>> +	/*
>> +	 * Update first_page to prevent this folio from being released in
>> +	 * mpage_release_unused_pages(), it should not equal to the folio
>> +	 * index.
>> +	 *
>> +	 * The first_page will be reset to the aligned folio index when this
>> +	 * folio is written again in the next round. Additionally, do not
>> +	 * update wbc->nr_to_write here, as it will be updated once the
>> +	 * entire folio has finished processing.
>> +	 */
>> +	mpd->first_page = round_up(pos, PAGE_SIZE) >> PAGE_SHIFT;
> 
> Well, but there can be many folios between mpd->first_page and pos. And
> this way you avoid cleaning them up (unlocking them and dropping elevated
> refcount) before we restart next loop. How is this going to work?
> 

Hmm, I don't think there can be many folios between mpd->first_page and
pos. All of the fully mapped folios should be unlocked by
mpage_folio_done(), and there is no elevated since it always call
folio_batch_release() once we finish processing the folios.
mpage_release_unused_pages() is used to clean up unsubmitted folios.

For example, suppose we have a 4kB block size filesystem and we found 4
order-2 folios need to be mapped in the mpage_prepare_extent_to_map().

       first_page             next_page
       |                      |
      [HHHH][HHHH][HHHH][HHHH]              H: hole  L: locked
       LLLL  LLLL  LLLL  LLLL

In the first round in the mpage_map_and_submit_extent(), we mapped the
first two folios along with the first two pages of the third folio, the
mpage_map_and_submit_buffers() should then submit and unlock the first
two folios, while also updating mpd->first_page to the beginning of the
third folio.

                  first_page  next_page
                  |          |
      [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
       UUUU  UUUU  LLLL  LLLL               W: mapped  U: unlocked

In the second round in the mpage_map_and_submit_extent(), we failed to
map the blocks and call mpage_submit_buffers() to submit and unlock
this partially mapped folio. Additionally, we increased mpd->first_page.

                     first_page next_page
                     |        /
      [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
       UUUU  UUUU  UUUU  LLLL               W: mapped  U: unlocked

Then, we break out to ext4_do_writepages(), which calls
mpage_release_unused_pages() to unlock the last folio.

                     first_page next_page
                     |        /
      [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
       UUUU  UUUU  UUUU  UUUU               W: mapped  U: unlocked

In the next round in the ext4_do_writepages(), mpage_prepare_extent_to_map()
restarts processing the third folio and resets the mpd->first_page to
the beginning of it.

                   first_page  next_page
                   |          /
      [WWWW][WWWW][WWHH][HHHH]              H: hole    L: locked
       UUUU  UUUU  LLLL  LLLL               W: mapped  U: unlocked


> Also I don't see in this patch where mpd->first_page would get set back to
> retry writing this folio. What am I missing?
> 

We already have this setting, please refer to the following section in
mpage_prepare_extent_to_map().

static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
{
	pgoff_t index = mpd->first_page;
	...
	mpd->map.m_len = 0;
	mpd->next_page = index;
	...
	while (index <= end) {
...
			if (mpd->map.m_len == 0)
				mpd->first_page = folio->index;
...
}

>> +	WARN_ON_ONCE((folio->index == mpd->first_page) ||
>> +		     !folio_contains(folio, pos >> PAGE_SHIFT));
>> +out:
>> +	folio_unlock(folio);
>> +	folio_put(folio);
>> +	return ret;
>> +}
>> +
>>  /*
>>   * mpage_map_and_submit_extent - map extent starting at mpd->lblk of length
>>   *				 mpd->len and submit pages underlying it for IO
>> @@ -2412,8 +2448,16 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>>  			 */
>>  			if ((err == -ENOMEM) ||
>>  			    (err == -ENOSPC && ext4_count_free_clusters(sb))) {
>> -				if (progress)
>> +				/*
>> +				 * We may have already allocated extents for
>> +				 * some bhs inside the folio, issue the
>> +				 * corresponding data to prevent stale data.
>> +				 */
>> +				if (progress) {
>> +					if (mpage_submit_buffers(mpd, disksize))
>> +						goto invalidate_dirty_pages;
>>  					goto update_disksize;
>> +				}
>>  				return err;
>>  			}
>>  			ext4_msg(sb, KERN_CRIT,
>> @@ -2432,6 +2476,8 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>>  			*give_up_on_write = true;
>>  			return err;
>>  		}
>> +		disksize = ((loff_t)(map->m_lblk + map->m_len)) <<
>> +				inode->i_blkbits;
> 
> I don't think setting disksize like this is correct in case
> mpage_map_and_submit_buffers() below fails (when extent covers many folios
> and we don't succeed in writing them all). In that case we may need to keep
> disksize somewhere in the middle of the extent.
> 
> Overall I don't think we need to modify disksize handling here. It is fine
> to leave (part of) the extent dangling beyond disksize until we retry the
> writeback in these rare cases.

OK, this is indeed a rare case. Let's keep it as it is.

Thanks,
Yi.

> 
>>  		progress = 1;
>>  		/*
>>  		 * Update buffer state, submit mapped pages, and get us new
>> @@ -2442,12 +2488,12 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>>  			goto update_disksize;
>>  	} while (map->m_len);
>>  
>> +	disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT;
>>  update_disksize:
>>  	/*
>>  	 * Update on-disk size after IO is submitted.  Races with
>>  	 * truncate are avoided by checking i_size under i_data_sem.
>>  	 */
>> -	disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT;
>>  	if (disksize > READ_ONCE(EXT4_I(inode)->i_disksize)) {
>>  		int err2;
>>  		loff_t i_size;
> 
> 								Honza