[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <14966764-5bbc-48a9-9d56-841255cfe3c6@huaweicloud.com>
Date: Fri, 20 Jun 2025 12:57:18 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Jan Kara <jack@...e.cz>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
ojaswin@...ux.ibm.com, yi.zhang@...wei.com, libaokun1@...wei.com,
yukuai3@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH v2 2/6] ext4: fix stale data if it bail out of the extents
mapping loop
On 2025/6/20 0:21, Jan Kara wrote:
> On Wed 11-06-25 19:16:21, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@...wei.com>
>>
>> During the process of writing back folios, if
>> mpage_map_and_submit_extent() exits the extent mapping loop due to an
>> ENOSPC or ENOMEM error, it may result in stale data or filesystem
>> inconsistency in environments where the block size is smaller than the
>> folio size.
>>
>> When mapping a discontinuous folio in mpage_map_and_submit_extent(),
>> some buffers may have already be mapped. If we exit the mapping loop
>> prematurely, the folio data within the mapped range will not be written
>> back, and the file's disk size will not be updated. Once the transaction
>> that includes this range of extents is committed, this can lead to stale
>> data or filesystem inconsistency.
>>
>> Fix this by submitting the current processing partial mapped folio and
>> update the disk size to the end of the mapped range.
>>
>> Suggested-by: Jan Kara <jack@...e.cz>
>> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
>> ---
>> fs/ext4/inode.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 48 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 3a086fee7989..d0db6e3bf158 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -2362,6 +2362,42 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>> return 0;
>> }
>>
>> +/*
>> + * This is used to submit mapped buffers in a single folio that is not fully
>> + * mapped for various reasons, such as insufficient space or journal credits.
>> + */
>> +static int mpage_submit_buffers(struct mpage_da_data *mpd, loff_t pos)
>> +{
>> + struct inode *inode = mpd->inode;
>> + struct folio *folio;
>> + int ret;
>> +
>> + folio = filemap_get_folio(inode->i_mapping, mpd->first_page);
>> + if (IS_ERR(folio))
>> + return PTR_ERR(folio);
>> +
>> + ret = mpage_submit_folio(mpd, folio);
>> + if (ret)
>> + goto out;
>> + /*
>> + * Update first_page to prevent this folio from being released in
>> + * mpage_release_unused_pages(), it should not equal to the folio
>> + * index.
>> + *
>> + * The first_page will be reset to the aligned folio index when this
>> + * folio is written again in the next round. Additionally, do not
>> + * update wbc->nr_to_write here, as it will be updated once the
>> + * entire folio has finished processing.
>> + */
>> + mpd->first_page = round_up(pos, PAGE_SIZE) >> PAGE_SHIFT;
>
> Well, but there can be many folios between mpd->first_page and pos. And
> this way you avoid cleaning them up (unlocking them and dropping elevated
> refcount) before we restart next loop. How is this going to work?
>
Hmm, I don't think there can be many folios between mpd->first_page and
pos. All of the fully mapped folios should be unlocked by
mpage_folio_done(), and there is no elevated since it always call
folio_batch_release() once we finish processing the folios.
mpage_release_unused_pages() is used to clean up unsubmitted folios.
For example, suppose we have a 4kB block size filesystem and we found 4
order-2 folios need to be mapped in the mpage_prepare_extent_to_map().
first_page next_page
| |
[HHHH][HHHH][HHHH][HHHH] H: hole L: locked
LLLL LLLL LLLL LLLL
In the first round in the mpage_map_and_submit_extent(), we mapped the
first two folios along with the first two pages of the third folio, the
mpage_map_and_submit_buffers() should then submit and unlock the first
two folios, while also updating mpd->first_page to the beginning of the
third folio.
first_page next_page
| |
[WWWW][WWWW][WWHH][HHHH] H: hole L: locked
UUUU UUUU LLLL LLLL W: mapped U: unlocked
In the second round in the mpage_map_and_submit_extent(), we failed to
map the blocks and call mpage_submit_buffers() to submit and unlock
this partially mapped folio. Additionally, we increased mpd->first_page.
first_page next_page
| /
[WWWW][WWWW][WWHH][HHHH] H: hole L: locked
UUUU UUUU UUUU LLLL W: mapped U: unlocked
Then, we break out to ext4_do_writepages(), which calls
mpage_release_unused_pages() to unlock the last folio.
first_page next_page
| /
[WWWW][WWWW][WWHH][HHHH] H: hole L: locked
UUUU UUUU UUUU UUUU W: mapped U: unlocked
In the next round in the ext4_do_writepages(), mpage_prepare_extent_to_map()
restarts processing the third folio and resets the mpd->first_page to
the beginning of it.
first_page next_page
| /
[WWWW][WWWW][WWHH][HHHH] H: hole L: locked
UUUU UUUU LLLL LLLL W: mapped U: unlocked
> Also I don't see in this patch where mpd->first_page would get set back to
> retry writing this folio. What am I missing?
>
We already have this setting, please refer to the following section in
mpage_prepare_extent_to_map().
static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
{
pgoff_t index = mpd->first_page;
...
mpd->map.m_len = 0;
mpd->next_page = index;
...
while (index <= end) {
...
if (mpd->map.m_len == 0)
mpd->first_page = folio->index;
...
}
>> + WARN_ON_ONCE((folio->index == mpd->first_page) ||
>> + !folio_contains(folio, pos >> PAGE_SHIFT));
>> +out:
>> + folio_unlock(folio);
>> + folio_put(folio);
>> + return ret;
>> +}
>> +
>> /*
>> * mpage_map_and_submit_extent - map extent starting at mpd->lblk of length
>> * mpd->len and submit pages underlying it for IO
>> @@ -2412,8 +2448,16 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>> */
>> if ((err == -ENOMEM) ||
>> (err == -ENOSPC && ext4_count_free_clusters(sb))) {
>> - if (progress)
>> + /*
>> + * We may have already allocated extents for
>> + * some bhs inside the folio, issue the
>> + * corresponding data to prevent stale data.
>> + */
>> + if (progress) {
>> + if (mpage_submit_buffers(mpd, disksize))
>> + goto invalidate_dirty_pages;
>> goto update_disksize;
>> + }
>> return err;
>> }
>> ext4_msg(sb, KERN_CRIT,
>> @@ -2432,6 +2476,8 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>> *give_up_on_write = true;
>> return err;
>> }
>> + disksize = ((loff_t)(map->m_lblk + map->m_len)) <<
>> + inode->i_blkbits;
>
> I don't think setting disksize like this is correct in case
> mpage_map_and_submit_buffers() below fails (when extent covers many folios
> and we don't succeed in writing them all). In that case we may need to keep
> disksize somewhere in the middle of the extent.
>
> Overall I don't think we need to modify disksize handling here. It is fine
> to leave (part of) the extent dangling beyond disksize until we retry the
> writeback in these rare cases.
OK, this is indeed a rare case. Let's keep it as it is.
Thanks,
Yi.
>
>> progress = 1;
>> /*
>> * Update buffer state, submit mapped pages, and get us new
>> @@ -2442,12 +2488,12 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>> goto update_disksize;
>> } while (map->m_len);
>>
>> + disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT;
>> update_disksize:
>> /*
>> * Update on-disk size after IO is submitted. Races with
>> * truncate are avoided by checking i_size under i_data_sem.
>> */
>> - disksize = ((loff_t)mpd->first_page) << PAGE_SHIFT;
>> if (disksize > READ_ONCE(EXT4_I(inode)->i_disksize)) {
>> int err2;
>> loff_t i_size;
>
> Honza
Powered by blists - more mailing lists