[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3aafd643-3655-420e-93fa-25d0d0ff4f32@huaweicloud.com>
Date: Fri, 6 Jun 2025 14:54:21 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Jan Kara <jack@...e.cz>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
ojaswin@...ux.ibm.com, yi.zhang@...wei.com, libaokun1@...wei.com,
yukuai3@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH 1/5] ext4: restart handle if credits are insufficient
during writepages
On 2025/6/5 22:04, Jan Kara wrote:
> On Fri 30-05-25 14:28:54, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@...wei.com>
>>
>> After large folios are supported on ext4, writing back a sufficiently
>> large and discontinuous folio may consume a significant number of
>> journal credits, placing considerable strain on the journal. For
>> example, in a 20GB filesystem with 1K block size and 1MB journal size,
>> writing back a 2MB folio could require thousands of credits in the
>> worst-case scenario (when each block is discontinuous and distributed
>> across different block groups), potentially exceeding the journal size.
>>
>> Fix this by making the write-back process first reserves credits for one
>> page and attempts to extend the transaction if the credits are
>> insufficient. In particular, if the credits for a transaction reach
>> their upper limit, stop the handle and initiate a new transaction.
>>
>> Note that since we do not support partial folio writeouts, some blocks
>> within this folio may have been allocated. These allocated extents are
>> submitted through the current transaction, but the folio itself is not
>> submitted. To prevent stale data and potential deadlocks in ordered
>> mode, only the dioread_nolock mode supports this solution, as it always
>> allocate unwritten extents.
>>
>> Suggested-by: Jan Kara <jack@...e.cz>
>> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
>
> Couple of simplification suggestions below and one bigger issue we need to
> deal with.
>
Thanks a lot for your comments and suggestions.
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index be9a4cba35fd..5ef34c0c5633 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -1680,6 +1680,7 @@ struct mpage_da_data {
>> unsigned int do_map:1;
>> unsigned int scanned_until_end:1;
>> unsigned int journalled_more_data:1;
>> + unsigned int continue_map:1;
>> };
>>
>> static void mpage_release_unused_pages(struct mpage_da_data *mpd,
>> @@ -2367,6 +2368,8 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>> *
>> * @handle - handle for journal operations
>> * @mpd - extent to map
>> + * @needed_blocks - journal credits needed for one writepages iteration
>> + * @check_blocks - journal credits needed for map one extent
>> * @give_up_on_write - we set this to true iff there is a fatal error and there
>> * is no hope of writing the data. The caller should discard
>> * dirty pages to avoid infinite loops.
>> @@ -2383,6 +2386,7 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>> */
>> static int mpage_map_and_submit_extent(handle_t *handle,
>> struct mpage_da_data *mpd,
>> + int needed_blocks, int check_blocks,
>> bool *give_up_on_write)
>> {
>> struct inode *inode = mpd->inode;
>> @@ -2393,6 +2397,8 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>> ext4_io_end_t *io_end = mpd->io_submit.io_end;
>> struct ext4_io_end_vec *io_end_vec;
>>
>> + mpd->continue_map = 0;
>> +
>> io_end_vec = ext4_alloc_io_end_vec(io_end);
>> if (IS_ERR(io_end_vec))
>> return PTR_ERR(io_end_vec);
>> @@ -2439,6 +2445,34 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>> err = mpage_map_and_submit_buffers(mpd);
>> if (err < 0)
>> goto update_disksize;
>> + if (!map->m_len)
>> + goto update_disksize;
>> +
>> + /*
>> + * For mapping a folio that is sufficiently large and
>> + * discontinuous, the current handle credits may be
>> + * insufficient, try to extend the handle.
>> + */
>> + err = __ext4_journal_ensure_credits(handle, check_blocks,
>> + needed_blocks, 0);
>> + if (err < 0)
>> + goto update_disksize;
>
> IMO it would be more logical to have __ext4_journal_ensure_credits() in
> mpage_map_one_extent() where the handle is actually used. Also there it
> would be pretty logical to do:
>
> /* Make sure transaction has enough credits for this extent */
> needed_credits = ext4_chunk_trans_blocks(inode, 1);
> err = ext4_journal_ensure_credits(handle, needed_credits, 0);
>
> No need to extend the transaction by more than we need for this current
> extent and also no need to propagate needed credits here.
>
> If __ext4_journal_ensure_credits() cannot extend the transaction, we can
> just return -EAGAIN (or something like that) and make sure the retry logic
> on ENOSPC or similar transient errors in mpage_map_and_submit_extent()
> applies properly.
Yeah, that would be simpler.
>
>> + /*
>> + * The credits for the current handle and transaction have
>> + * reached their upper limit, stop the handle and initiate a
>> + * new transaction. Note that some blocks in this folio may
>> + * have been allocated, and these allocated extents are
>> + * submitted through the current transaction, but the folio
>> + * itself is not submitted. To prevent stale data and
>> + * potential deadlock in ordered mode, only the
>> + * dioread_nolock mode supports this.
>> + */
>> + if (err > 0) {
>> + WARN_ON_ONCE(!ext4_should_dioread_nolock(inode));
>> + mpd->continue_map = 1;
>> + err = 0;
>> + goto update_disksize;
>> + }
>> } while (map->m_len);
>>
>> update_disksize:
>> @@ -2467,6 +2501,9 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>> if (!err)
>> err = err2;
>> }
>> + if (!err && mpd->continue_map)
>> + ext4_get_io_end(io_end);
>> +
>
> IMHO it would be more logical to not call ext4_put_io_end[_deferred]() in
> ext4_do_writepages() if we see we need to continue doing mapping for the
> current io_end.
>
> That way it would be also more obvious that you've just reintroduced
> deadlock fixed by 646caa9c8e196 ("ext4: fix deadlock during page
> writeback"). This is actually a fundamental thing because for
> ext4_journal_stop() to complete, we may need IO on the folio to finish
> which means we need io_end to be processed. Even if we avoided the awkward
> case with sync handle described in 646caa9c8e196, to be able to start a new
> handle we may need to complete a previous transaction commit to be able to
> make space in the journal.
Yeah, you are right, I missed the full folios that were attached to the
same io_end in the previous rounds. If we continue to use this solution,
I think we should split the io_end and submit the previous one which
includes those full folios before the previous transaction is
committed.
>
> Thinking some more about this holding ioend for a folio with partially
> submitted IO is also deadlock prone because mpage_prepare_extent_to_map()
> can call folio_wait_writeback() which will effectively wait for the last
> reference to ioend to be dropped so that underlying extents can be
> converted and folio_writeback bit cleared.
I don't understand this one. The mpage_prepare_extent_to_map() should
call folio_wait_writeback() for the current processing partial folio,
not the previous full folios that were attached to the io_end. This is
because mpd->first_page should be moved forward in mpage_folio_done()
once we complete the previous full folio. Besides, in my current
solution, the current partial folio will not be submitted, the
folio_writeback flag will not be set, so how does this deadlock happen?
>
> So what I think we need to do is that if we submit part of the folio and
> cannot submit it all, we just redirty the folio and bail out of the mapping
> loop (similarly as in ENOSPC case).
After looking at the ENOSPC case again, I found that the handling of
ENOSPC before we enabling large folio is also wrong, it may case stale
data on 1K block size. Suppose we are processing four bhs on a dirty
page. We map the first bh, and the corresponding io_vec is added to the
io_end, with the unwritten flag set. However, when we attempt to map
the second bh, we bail out of the loop due to ENOSPC. At this point,
ext4_do_writepages()->ext4_put_io_end() will convert the extent of the
first bh to written. However, since the folio has not been committed
(mpage_submit_folio() submit a full folio), it will trigger stale data
issue. Is that right? I suppose we also have to write partial folio out
in this case.
> Then once IO completes
> mpage_prepare_extent_to_map() is able to start working on the folio again.
> Since we cleared dirty bits in the buffers we should not be repeating the
> work we already did...
>
Hmm, it looks like this solution should work. We should introduce a
partial folio version of mpage_submit_folio(), call it and redirty
the folio once we need to bail out of the loop since insufficient
space or journal credits. But ext4_bio_write_folio() will handle the
the logic of fscrypt case, I'm not familiar with fscrypt, so I'm not
sure it could handle the partial page properly. I'll give it a try.
Thanks,
Yi.
>
>> return err;
>> }
>>
>> @@ -2703,7 +2740,7 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>> handle_t *handle = NULL;
>> struct inode *inode = mpd->inode;
>> struct address_space *mapping = inode->i_mapping;
>> - int needed_blocks, rsv_blocks = 0, ret = 0;
>> + int needed_blocks, check_blocks, rsv_blocks = 0, ret = 0;
>> struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
>> struct blk_plug plug;
>> bool give_up_on_write = false;
>> @@ -2825,10 +2862,13 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>>
>> while (!mpd->scanned_until_end && wbc->nr_to_write > 0) {
>> /* For each extent of pages we use new io_end */
>> - mpd->io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
>> if (!mpd->io_submit.io_end) {
>> - ret = -ENOMEM;
>> - break;
>> + mpd->io_submit.io_end =
>> + ext4_init_io_end(inode, GFP_KERNEL);
>> + if (!mpd->io_submit.io_end) {
>> + ret = -ENOMEM;
>> + break;
>> + }
>> }
>>
>> WARN_ON_ONCE(!mpd->can_map);
>> @@ -2841,10 +2881,13 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>> */
>> BUG_ON(ext4_should_journal_data(inode));
>> needed_blocks = ext4_da_writepages_trans_blocks(inode);
>> + check_blocks = ext4_chunk_trans_blocks(inode,
>> + MAX_WRITEPAGES_EXTENT_LEN);
>>
>> /* start a new transaction */
>> handle = ext4_journal_start_with_reserve(inode,
>> - EXT4_HT_WRITE_PAGE, needed_blocks, rsv_blocks);
>> + EXT4_HT_WRITE_PAGE, needed_blocks,
>> + mpd->continue_map ? 0 : rsv_blocks);
>> if (IS_ERR(handle)) {
>> ret = PTR_ERR(handle);
>> ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
>> @@ -2861,6 +2904,7 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>> ret = mpage_prepare_extent_to_map(mpd);
>> if (!ret && mpd->map.m_len)
>> ret = mpage_map_and_submit_extent(handle, mpd,
>> + needed_blocks, check_blocks,
>> &give_up_on_write);
>> /*
>> * Caution: If the handle is synchronous,
>> @@ -2894,7 +2938,8 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>> ext4_journal_stop(handle);
>> } else
>> ext4_put_io_end(mpd->io_submit.io_end);
>> - mpd->io_submit.io_end = NULL;
>> + if (ret || !mpd->continue_map)
>> + mpd->io_submit.io_end = NULL;
>>
>> if (ret == -ENOSPC && sbi->s_journal) {
>> /*
>> --
>> 2.46.1
>>
Powered by blists - more mailing lists