linux-ext4 - Re: [PATCH 1/5] ext4: restart handle if credits are insufficient during writepages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3aafd643-3655-420e-93fa-25d0d0ff4f32@huaweicloud.com>
Date: Fri, 6 Jun 2025 14:54:21 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Jan Kara <jack@...e.cz>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
 linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
 ojaswin@...ux.ibm.com, yi.zhang@...wei.com, libaokun1@...wei.com,
 yukuai3@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH 1/5] ext4: restart handle if credits are insufficient
 during writepages

On 2025/6/5 22:04, Jan Kara wrote:
> On Fri 30-05-25 14:28:54, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@...wei.com>
>>
>> After large folios are supported on ext4, writing back a sufficiently
>> large and discontinuous folio may consume a significant number of
>> journal credits, placing considerable strain on the journal. For
>> example, in a 20GB filesystem with 1K block size and 1MB journal size,
>> writing back a 2MB folio could require thousands of credits in the
>> worst-case scenario (when each block is discontinuous and distributed
>> across different block groups), potentially exceeding the journal size.
>>
>> Fix this by making the write-back process first reserves credits for one
>> page and attempts to extend the transaction if the credits are
>> insufficient. In particular, if the credits for a transaction reach
>> their upper limit, stop the handle and initiate a new transaction.
>>
>> Note that since we do not support partial folio writeouts, some blocks
>> within this folio may have been allocated. These allocated extents are
>> submitted through the current transaction, but the folio itself is not
>> submitted. To prevent stale data and potential deadlocks in ordered
>> mode, only the dioread_nolock mode supports this solution, as it always
>> allocate unwritten extents.
>>
>> Suggested-by: Jan Kara <jack@...e.cz>
>> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
> 
> Couple of simplification suggestions below and one bigger issue we need to
> deal with.
> 
Thanks a lot for your comments and suggestions.


>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index be9a4cba35fd..5ef34c0c5633 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -1680,6 +1680,7 @@ struct mpage_da_data {
>>  	unsigned int do_map:1;
>>  	unsigned int scanned_until_end:1;
>>  	unsigned int journalled_more_data:1;
>> +	unsigned int continue_map:1;
>>  };
>>  
>>  static void mpage_release_unused_pages(struct mpage_da_data *mpd,
>> @@ -2367,6 +2368,8 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>>   *
>>   * @handle - handle for journal operations
>>   * @mpd - extent to map
>> + * @needed_blocks - journal credits needed for one writepages iteration
>> + * @check_blocks - journal credits needed for map one extent
>>   * @give_up_on_write - we set this to true iff there is a fatal error and there
>>   *                     is no hope of writing the data. The caller should discard
>>   *                     dirty pages to avoid infinite loops.
>> @@ -2383,6 +2386,7 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
>>   */
>>  static int mpage_map_and_submit_extent(handle_t *handle,
>>  				       struct mpage_da_data *mpd,
>> +				       int needed_blocks, int check_blocks,
>>  				       bool *give_up_on_write)
>>  {
>>  	struct inode *inode = mpd->inode;
>> @@ -2393,6 +2397,8 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>>  	ext4_io_end_t *io_end = mpd->io_submit.io_end;
>>  	struct ext4_io_end_vec *io_end_vec;
>>  
>> +	mpd->continue_map = 0;
>> +
>>  	io_end_vec = ext4_alloc_io_end_vec(io_end);
>>  	if (IS_ERR(io_end_vec))
>>  		return PTR_ERR(io_end_vec);
>> @@ -2439,6 +2445,34 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>>  		err = mpage_map_and_submit_buffers(mpd);
>>  		if (err < 0)
>>  			goto update_disksize;
>> +		if (!map->m_len)
>> +			goto update_disksize;
>> +
>> +		/*
>> +		 * For mapping a folio that is sufficiently large and
>> +		 * discontinuous, the current handle credits may be
>> +		 * insufficient, try to extend the handle.
>> +		 */
>> +		err = __ext4_journal_ensure_credits(handle, check_blocks,
>> +				needed_blocks, 0);
>> +		if (err < 0)
>> +			goto update_disksize;
> 
> IMO it would be more logical to have __ext4_journal_ensure_credits() in
> mpage_map_one_extent() where the handle is actually used. Also there it
> would be pretty logical to do:
> 
> 		/* Make sure transaction has enough credits for this extent */
> 		needed_credits = ext4_chunk_trans_blocks(inode, 1);
> 		err = ext4_journal_ensure_credits(handle, needed_credits, 0);
> 
> No need to extend the transaction by more than we need for this current
> extent and also no need to propagate needed credits here.
> 
> If __ext4_journal_ensure_credits() cannot extend the transaction, we can
> just return -EAGAIN (or something like that) and make sure the retry logic
> on ENOSPC or similar transient errors in mpage_map_and_submit_extent()
> applies properly.

Yeah, that would be simpler.

> 
>> +		/*
>> +		 * The credits for the current handle and transaction have
>> +		 * reached their upper limit, stop the handle and initiate a
>> +		 * new transaction. Note that some blocks in this folio may
>> +		 * have been allocated, and these allocated extents are
>> +		 * submitted through the current transaction, but the folio
>> +		 * itself is not submitted. To prevent stale data and
>> +		 * potential deadlock in ordered mode, only the
>> +		 * dioread_nolock mode supports this.
>> +		 */
>> +		if (err > 0) {
>> +			WARN_ON_ONCE(!ext4_should_dioread_nolock(inode));
>> +			mpd->continue_map = 1;
>> +			err = 0;
>> +			goto update_disksize;
>> +		}
>>  	} while (map->m_len);
>>  
>>  update_disksize:
>> @@ -2467,6 +2501,9 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>>  		if (!err)
>>  			err = err2;
>>  	}
>> +	if (!err && mpd->continue_map)
>> +		ext4_get_io_end(io_end);
>> +
> 
> IMHO it would be more logical to not call ext4_put_io_end[_deferred]() in
> ext4_do_writepages() if we see we need to continue doing mapping for the
> current io_end.
> 
> That way it would be also more obvious that you've just reintroduced
> deadlock fixed by 646caa9c8e196 ("ext4: fix deadlock during page
> writeback"). This is actually a fundamental thing because for
> ext4_journal_stop() to complete, we may need IO on the folio to finish
> which means we need io_end to be processed. Even if we avoided the awkward
> case with sync handle described in 646caa9c8e196, to be able to start a new
> handle we may need to complete a previous transaction commit to be able to
> make space in the journal.

Yeah, you are right, I missed the full folios that were attached to the
same io_end in the previous rounds. If we continue to use this solution,
I think we should split the io_end and submit the previous one which
includes those full folios before the previous transaction is
committed.

> 
> Thinking some more about this holding ioend for a folio with partially
> submitted IO is also deadlock prone because mpage_prepare_extent_to_map()
> can call folio_wait_writeback() which will effectively wait for the last
> reference to ioend to be dropped so that underlying extents can be
> converted and folio_writeback bit cleared.

I don't understand this one. The mpage_prepare_extent_to_map() should
call folio_wait_writeback() for the current processing partial folio,
not the previous full folios that were attached to the io_end. This is
because mpd->first_page should be moved forward in mpage_folio_done()
once we complete the previous full folio. Besides, in my current
solution, the current partial folio will not be submitted, the
folio_writeback flag will not be set, so how does this deadlock happen?

> 
> So what I think we need to do is that if we submit part of the folio and
> cannot submit it all, we just redirty the folio and bail out of the mapping
> loop (similarly as in ENOSPC case).

After looking at the ENOSPC case again, I found that the handling of
ENOSPC before we enabling large folio is also wrong, it may case stale
data on 1K block size. Suppose we are processing four bhs on a dirty
page. We map the first bh, and the corresponding io_vec is added to the
io_end, with the unwritten flag set. However, when we attempt to map
the second bh, we bail out of the loop due to ENOSPC. At this point,
ext4_do_writepages()->ext4_put_io_end() will convert the extent of the
first bh to written. However, since the folio has not been committed
(mpage_submit_folio() submit a full folio), it will trigger stale data
issue. Is that right? I suppose we also have to write partial folio out
in this case.

> Then once IO completes
> mpage_prepare_extent_to_map() is able to start working on the folio again.
> Since we cleared dirty bits in the buffers we should not be repeating the
> work we already did...
> 

Hmm, it looks like this solution should work. We should introduce a
partial folio version of mpage_submit_folio(), call it and redirty
the folio once we need to bail out of the loop since insufficient
space or journal credits. But ext4_bio_write_folio() will handle the
the logic of fscrypt case, I'm not familiar with fscrypt, so I'm not
sure it could handle the partial page properly. I'll give it a try.

Thanks,
Yi.

> 
>>  	return err;
>>  }
>>  
>> @@ -2703,7 +2740,7 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>>  	handle_t *handle = NULL;
>>  	struct inode *inode = mpd->inode;
>>  	struct address_space *mapping = inode->i_mapping;
>> -	int needed_blocks, rsv_blocks = 0, ret = 0;
>> +	int needed_blocks, check_blocks, rsv_blocks = 0, ret = 0;
>>  	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
>>  	struct blk_plug plug;
>>  	bool give_up_on_write = false;
>> @@ -2825,10 +2862,13 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>>  
>>  	while (!mpd->scanned_until_end && wbc->nr_to_write > 0) {
>>  		/* For each extent of pages we use new io_end */
>> -		mpd->io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
>>  		if (!mpd->io_submit.io_end) {
>> -			ret = -ENOMEM;
>> -			break;
>> +			mpd->io_submit.io_end =
>> +					ext4_init_io_end(inode, GFP_KERNEL);
>> +			if (!mpd->io_submit.io_end) {
>> +				ret = -ENOMEM;
>> +				break;
>> +			}
>>  		}
>>  
>>  		WARN_ON_ONCE(!mpd->can_map);
>> @@ -2841,10 +2881,13 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>>  		 */
>>  		BUG_ON(ext4_should_journal_data(inode));
>>  		needed_blocks = ext4_da_writepages_trans_blocks(inode);
>> +		check_blocks = ext4_chunk_trans_blocks(inode,
>> +				MAX_WRITEPAGES_EXTENT_LEN);
>>  
>>  		/* start a new transaction */
>>  		handle = ext4_journal_start_with_reserve(inode,
>> -				EXT4_HT_WRITE_PAGE, needed_blocks, rsv_blocks);
>> +				EXT4_HT_WRITE_PAGE, needed_blocks,
>> +				mpd->continue_map ? 0 : rsv_blocks);
>>  		if (IS_ERR(handle)) {
>>  			ret = PTR_ERR(handle);
>>  			ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
>> @@ -2861,6 +2904,7 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>>  		ret = mpage_prepare_extent_to_map(mpd);
>>  		if (!ret && mpd->map.m_len)
>>  			ret = mpage_map_and_submit_extent(handle, mpd,
>> +					needed_blocks, check_blocks,
>>  					&give_up_on_write);
>>  		/*
>>  		 * Caution: If the handle is synchronous,
>> @@ -2894,7 +2938,8 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
>>  			ext4_journal_stop(handle);
>>  		} else
>>  			ext4_put_io_end(mpd->io_submit.io_end);
>> -		mpd->io_submit.io_end = NULL;
>> +		if (ret || !mpd->continue_map)
>> +			mpd->io_submit.io_end = NULL;
>>  
>>  		if (ret == -ENOSPC && sbi->s_journal) {
>>  			/*
>> -- 
>> 2.46.1
>>