[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49596299-8cd5-4b43-ba32-cf2b404236a7@huaweicloud.com>
Date: Sat, 21 Jun 2025 12:42:11 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Jan Kara <jack@...e.cz>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
ojaswin@...ux.ibm.com, yi.zhang@...wei.com, libaokun1@...wei.com,
yukuai3@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH v2 2/6] ext4: fix stale data if it bail out of the extents
mapping loop
On 2025/6/20 23:21, Jan Kara wrote:
> On Fri 20-06-25 12:57:18, Zhang Yi wrote:
>> On 2025/6/20 0:21, Jan Kara wrote:
>>> On Wed 11-06-25 19:16:21, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@...wei.com>
>>>>
>>>> During the process of writing back folios, if
>>>> mpage_map_and_submit_extent() exits the extent mapping loop due to an
>>>> ENOSPC or ENOMEM error, it may result in stale data or filesystem
>>>> inconsistency in environments where the block size is smaller than the
>>>> folio size.
>>>>
>>>> When mapping a discontinuous folio in mpage_map_and_submit_extent(),
>>>> some buffers may have already be mapped. If we exit the mapping loop
>>>> prematurely, the folio data within the mapped range will not be written
>>>> back, and the file's disk size will not be updated. Once the transaction
>>>> that includes this range of extents is committed, this can lead to stale
>>>> data or filesystem inconsistency.
>>>>
>>>> Fix this by submitting the current processing partial mapped folio and
>>>> update the disk size to the end of the mapped range.
>>>>
>>>> Suggested-by: Jan Kara <jack@...e.cz>
>>>> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
>>>> ---
>>>> fs/ext4/inode.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++--
>>>> 1 file changed, 48 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index 3a086fee7989..d0db6e3bf158 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -2362,6 +2362,42 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>>>> return 0;
>>>> }
>>>>
>>>> +/*
>>>> + * This is used to submit mapped buffers in a single folio that is not fully
>>>> + * mapped for various reasons, such as insufficient space or journal credits.
>>>> + */
>>>> +static int mpage_submit_buffers(struct mpage_da_data *mpd, loff_t pos)
>>>> +{
>>>> + struct inode *inode = mpd->inode;
>>>> + struct folio *folio;
>>>> + int ret;
>>>> +
>>>> + folio = filemap_get_folio(inode->i_mapping, mpd->first_page);
>>>> + if (IS_ERR(folio))
>>>> + return PTR_ERR(folio);
>>>> +
>>>> + ret = mpage_submit_folio(mpd, folio);
>>>> + if (ret)
>>>> + goto out;
>>>> + /*
>>>> + * Update first_page to prevent this folio from being released in
>>>> + * mpage_release_unused_pages(), it should not equal to the folio
>>>> + * index.
>>>> + *
>>>> + * The first_page will be reset to the aligned folio index when this
>>>> + * folio is written again in the next round. Additionally, do not
>>>> + * update wbc->nr_to_write here, as it will be updated once the
>>>> + * entire folio has finished processing.
>>>> + */
>>>> + mpd->first_page = round_up(pos, PAGE_SIZE) >> PAGE_SHIFT;
>>>
>>> Well, but there can be many folios between mpd->first_page and pos. And
>>> this way you avoid cleaning them up (unlocking them and dropping elevated
>>> refcount) before we restart next loop. How is this going to work?
>>>
>>
>> Hmm, I don't think there can be many folios between mpd->first_page and
>> pos. All of the fully mapped folios should be unlocked by
>> mpage_folio_done(), and there is no elevated since it always call
>> folio_batch_release() once we finish processing the folios.
>
> Indeed. I forgot that mpage_map_one_extent() with shorten mpd->map->m_len
> to the number of currently mapped blocks.
>
>> mpage_release_unused_pages() is used to clean up unsubmitted folios.
>>
>> For example, suppose we have a 4kB block size filesystem and we found 4
>> order-2 folios need to be mapped in the mpage_prepare_extent_to_map().
>>
>> first_page next_page
>> | |
>> [HHHH][HHHH][HHHH][HHHH] H: hole L: locked
>> LLLL LLLL LLLL LLLL
>>
>> In the first round in the mpage_map_and_submit_extent(), we mapped the
>> first two folios along with the first two pages of the third folio, the
>> mpage_map_and_submit_buffers() should then submit and unlock the first
>> two folios, while also updating mpd->first_page to the beginning of the
>> third folio.
>>
>> first_page next_page
>> | |
>> [WWWW][WWWW][WWHH][HHHH] H: hole L: locked
>> UUUU UUUU LLLL LLLL W: mapped U: unlocked
>>
>> In the second round in the mpage_map_and_submit_extent(), we failed to
>> map the blocks and call mpage_submit_buffers() to submit and unlock
>> this partially mapped folio. Additionally, we increased mpd->first_page.
>>
>> first_page next_page
>> | /
>> [WWWW][WWWW][WWHH][HHHH] H: hole L: locked
>> UUUU UUUU UUUU LLLL W: mapped U: unlocked
>
> Good. But what if we have a filesystem with 1k blocksize and order 0
> folios? I mean situation like:
>
> first_page next_page
> | |
> [HHHH][HHHH][HHHH][HHHH] H: hole L: locked
> L L L L
>
> Now we map first two folios.
>
> first_page next_page
> | |
> [MMMM][MMMM][HHHH][HHHH] H: hole L: locked
> L L
>
> Now mpage_map_one_extent() maps half of the folio and fails to extend the
> transaction further:
>
> first_page next_page
> | |
> [MMMM][MMMM][MMHH][HHHH] H: hole L: locked
> L L
>
> and mpage_submit_folio() now shifts mpd->first page like:
>
> first_page
> | next_page
> | |
> [MMMM][MMMM][MMHH][HHHH] H: hole L: locked
> L L
>
> and it never gets reset back?
>
> I suspect you thought that the failure to extend transaction in the middle
> of order 0 folio should not happen because you reserve credits for full
> page worth of writeback? But those credits could be exhaused by the time we
> get to mapping third folio because mpage_map_one_extent() only ensures
> there are credits for mapping one extent.
Ooops, you are right. Sorry, it was my mistake.
>
> And I think reserving credits for just one extent is fine even from the
> beginning (as I wrote in my comment to patch 4). We just need to handle
> this partial case
Yeah.
> which should be possible by just leaving
> mpd->first_page untouched and leave unlocking to
> mpage_release_unused_pages().
I was going to use this solution, but it breaks the semantics of the
mpage_release_unused_pages() and trigger BUG_ON(folio_test_writeback(folio))
in this function. I don't want to drop this BUG_ON since I think it's
somewhat useful.
> But I can be missing some effects, the writeback code is really complex...
Indeed, I was confused by this code for a long time. Thank you a lot for
patiently correcting my mistakes in my patch.
> BTW long-term the code may be easier to follow if we replace
> mpd->first_page and mpd->next_page with logical block based or byte based
> indexing. Now when we have large order folios, page is not that important
> concept for writeback anymore.
>
I suppose we should do this conversion now.
Thanks,
Yi.
Powered by blists - more mailing lists