lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <3aafd643-3655-420e-93fa-25d0d0ff4f32@huaweicloud.com> Date: Fri, 6 Jun 2025 14:54:21 +0800 From: Zhang Yi <yi.zhang@...weicloud.com> To: Jan Kara <jack@...e.cz> Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca, ojaswin@...ux.ibm.com, yi.zhang@...wei.com, libaokun1@...wei.com, yukuai3@...wei.com, yangerkun@...wei.com Subject: Re: [PATCH 1/5] ext4: restart handle if credits are insufficient during writepages On 2025/6/5 22:04, Jan Kara wrote: > On Fri 30-05-25 14:28:54, Zhang Yi wrote: >> From: Zhang Yi <yi.zhang@...wei.com> >> >> After large folios are supported on ext4, writing back a sufficiently >> large and discontinuous folio may consume a significant number of >> journal credits, placing considerable strain on the journal. For >> example, in a 20GB filesystem with 1K block size and 1MB journal size, >> writing back a 2MB folio could require thousands of credits in the >> worst-case scenario (when each block is discontinuous and distributed >> across different block groups), potentially exceeding the journal size. >> >> Fix this by making the write-back process first reserves credits for one >> page and attempts to extend the transaction if the credits are >> insufficient. In particular, if the credits for a transaction reach >> their upper limit, stop the handle and initiate a new transaction. >> >> Note that since we do not support partial folio writeouts, some blocks >> within this folio may have been allocated. These allocated extents are >> submitted through the current transaction, but the folio itself is not >> submitted. To prevent stale data and potential deadlocks in ordered >> mode, only the dioread_nolock mode supports this solution, as it always >> allocate unwritten extents. >> >> Suggested-by: Jan Kara <jack@...e.cz> >> Signed-off-by: Zhang Yi <yi.zhang@...wei.com> > > Couple of simplification suggestions below and one bigger issue we need to > deal with. > Thanks a lot for your comments and suggestions. >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >> index be9a4cba35fd..5ef34c0c5633 100644 >> --- a/fs/ext4/inode.c >> +++ b/fs/ext4/inode.c >> @@ -1680,6 +1680,7 @@ struct mpage_da_data { >> unsigned int do_map:1; >> unsigned int scanned_until_end:1; >> unsigned int journalled_more_data:1; >> + unsigned int continue_map:1; >> }; >> >> static void mpage_release_unused_pages(struct mpage_da_data *mpd, >> @@ -2367,6 +2368,8 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd) >> * >> * @handle - handle for journal operations >> * @mpd - extent to map >> + * @needed_blocks - journal credits needed for one writepages iteration >> + * @check_blocks - journal credits needed for map one extent >> * @give_up_on_write - we set this to true iff there is a fatal error and there >> * is no hope of writing the data. The caller should discard >> * dirty pages to avoid infinite loops. >> @@ -2383,6 +2386,7 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd) >> */ >> static int mpage_map_and_submit_extent(handle_t *handle, >> struct mpage_da_data *mpd, >> + int needed_blocks, int check_blocks, >> bool *give_up_on_write) >> { >> struct inode *inode = mpd->inode; >> @@ -2393,6 +2397,8 @@ static int mpage_map_and_submit_extent(handle_t *handle, >> ext4_io_end_t *io_end = mpd->io_submit.io_end; >> struct ext4_io_end_vec *io_end_vec; >> >> + mpd->continue_map = 0; >> + >> io_end_vec = ext4_alloc_io_end_vec(io_end); >> if (IS_ERR(io_end_vec)) >> return PTR_ERR(io_end_vec); >> @@ -2439,6 +2445,34 @@ static int mpage_map_and_submit_extent(handle_t *handle, >> err = mpage_map_and_submit_buffers(mpd); >> if (err < 0) >> goto update_disksize; >> + if (!map->m_len) >> + goto update_disksize; >> + >> + /* >> + * For mapping a folio that is sufficiently large and >> + * discontinuous, the current handle credits may be >> + * insufficient, try to extend the handle. >> + */ >> + err = __ext4_journal_ensure_credits(handle, check_blocks, >> + needed_blocks, 0); >> + if (err < 0) >> + goto update_disksize; > > IMO it would be more logical to have __ext4_journal_ensure_credits() in > mpage_map_one_extent() where the handle is actually used. Also there it > would be pretty logical to do: > > /* Make sure transaction has enough credits for this extent */ > needed_credits = ext4_chunk_trans_blocks(inode, 1); > err = ext4_journal_ensure_credits(handle, needed_credits, 0); > > No need to extend the transaction by more than we need for this current > extent and also no need to propagate needed credits here. > > If __ext4_journal_ensure_credits() cannot extend the transaction, we can > just return -EAGAIN (or something like that) and make sure the retry logic > on ENOSPC or similar transient errors in mpage_map_and_submit_extent() > applies properly. Yeah, that would be simpler. > >> + /* >> + * The credits for the current handle and transaction have >> + * reached their upper limit, stop the handle and initiate a >> + * new transaction. Note that some blocks in this folio may >> + * have been allocated, and these allocated extents are >> + * submitted through the current transaction, but the folio >> + * itself is not submitted. To prevent stale data and >> + * potential deadlock in ordered mode, only the >> + * dioread_nolock mode supports this. >> + */ >> + if (err > 0) { >> + WARN_ON_ONCE(!ext4_should_dioread_nolock(inode)); >> + mpd->continue_map = 1; >> + err = 0; >> + goto update_disksize; >> + } >> } while (map->m_len); >> >> update_disksize: >> @@ -2467,6 +2501,9 @@ static int mpage_map_and_submit_extent(handle_t *handle, >> if (!err) >> err = err2; >> } >> + if (!err && mpd->continue_map) >> + ext4_get_io_end(io_end); >> + > > IMHO it would be more logical to not call ext4_put_io_end[_deferred]() in > ext4_do_writepages() if we see we need to continue doing mapping for the > current io_end. > > That way it would be also more obvious that you've just reintroduced > deadlock fixed by 646caa9c8e196 ("ext4: fix deadlock during page > writeback"). This is actually a fundamental thing because for > ext4_journal_stop() to complete, we may need IO on the folio to finish > which means we need io_end to be processed. Even if we avoided the awkward > case with sync handle described in 646caa9c8e196, to be able to start a new > handle we may need to complete a previous transaction commit to be able to > make space in the journal. Yeah, you are right, I missed the full folios that were attached to the same io_end in the previous rounds. If we continue to use this solution, I think we should split the io_end and submit the previous one which includes those full folios before the previous transaction is committed. > > Thinking some more about this holding ioend for a folio with partially > submitted IO is also deadlock prone because mpage_prepare_extent_to_map() > can call folio_wait_writeback() which will effectively wait for the last > reference to ioend to be dropped so that underlying extents can be > converted and folio_writeback bit cleared. I don't understand this one. The mpage_prepare_extent_to_map() should call folio_wait_writeback() for the current processing partial folio, not the previous full folios that were attached to the io_end. This is because mpd->first_page should be moved forward in mpage_folio_done() once we complete the previous full folio. Besides, in my current solution, the current partial folio will not be submitted, the folio_writeback flag will not be set, so how does this deadlock happen? > > So what I think we need to do is that if we submit part of the folio and > cannot submit it all, we just redirty the folio and bail out of the mapping > loop (similarly as in ENOSPC case). After looking at the ENOSPC case again, I found that the handling of ENOSPC before we enabling large folio is also wrong, it may case stale data on 1K block size. Suppose we are processing four bhs on a dirty page. We map the first bh, and the corresponding io_vec is added to the io_end, with the unwritten flag set. However, when we attempt to map the second bh, we bail out of the loop due to ENOSPC. At this point, ext4_do_writepages()->ext4_put_io_end() will convert the extent of the first bh to written. However, since the folio has not been committed (mpage_submit_folio() submit a full folio), it will trigger stale data issue. Is that right? I suppose we also have to write partial folio out in this case. > Then once IO completes > mpage_prepare_extent_to_map() is able to start working on the folio again. > Since we cleared dirty bits in the buffers we should not be repeating the > work we already did... > Hmm, it looks like this solution should work. We should introduce a partial folio version of mpage_submit_folio(), call it and redirty the folio once we need to bail out of the loop since insufficient space or journal credits. But ext4_bio_write_folio() will handle the the logic of fscrypt case, I'm not familiar with fscrypt, so I'm not sure it could handle the partial page properly. I'll give it a try. Thanks, Yi. > >> return err; >> } >> >> @@ -2703,7 +2740,7 @@ static int ext4_do_writepages(struct mpage_da_data *mpd) >> handle_t *handle = NULL; >> struct inode *inode = mpd->inode; >> struct address_space *mapping = inode->i_mapping; >> - int needed_blocks, rsv_blocks = 0, ret = 0; >> + int needed_blocks, check_blocks, rsv_blocks = 0, ret = 0; >> struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb); >> struct blk_plug plug; >> bool give_up_on_write = false; >> @@ -2825,10 +2862,13 @@ static int ext4_do_writepages(struct mpage_da_data *mpd) >> >> while (!mpd->scanned_until_end && wbc->nr_to_write > 0) { >> /* For each extent of pages we use new io_end */ >> - mpd->io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL); >> if (!mpd->io_submit.io_end) { >> - ret = -ENOMEM; >> - break; >> + mpd->io_submit.io_end = >> + ext4_init_io_end(inode, GFP_KERNEL); >> + if (!mpd->io_submit.io_end) { >> + ret = -ENOMEM; >> + break; >> + } >> } >> >> WARN_ON_ONCE(!mpd->can_map); >> @@ -2841,10 +2881,13 @@ static int ext4_do_writepages(struct mpage_da_data *mpd) >> */ >> BUG_ON(ext4_should_journal_data(inode)); >> needed_blocks = ext4_da_writepages_trans_blocks(inode); >> + check_blocks = ext4_chunk_trans_blocks(inode, >> + MAX_WRITEPAGES_EXTENT_LEN); >> >> /* start a new transaction */ >> handle = ext4_journal_start_with_reserve(inode, >> - EXT4_HT_WRITE_PAGE, needed_blocks, rsv_blocks); >> + EXT4_HT_WRITE_PAGE, needed_blocks, >> + mpd->continue_map ? 0 : rsv_blocks); >> if (IS_ERR(handle)) { >> ret = PTR_ERR(handle); >> ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: " >> @@ -2861,6 +2904,7 @@ static int ext4_do_writepages(struct mpage_da_data *mpd) >> ret = mpage_prepare_extent_to_map(mpd); >> if (!ret && mpd->map.m_len) >> ret = mpage_map_and_submit_extent(handle, mpd, >> + needed_blocks, check_blocks, >> &give_up_on_write); >> /* >> * Caution: If the handle is synchronous, >> @@ -2894,7 +2938,8 @@ static int ext4_do_writepages(struct mpage_da_data *mpd) >> ext4_journal_stop(handle); >> } else >> ext4_put_io_end(mpd->io_submit.io_end); >> - mpd->io_submit.io_end = NULL; >> + if (ret || !mpd->continue_map) >> + mpd->io_submit.io_end = NULL; >> >> if (ret == -ENOSPC && sbi->s_journal) { >> /* >> -- >> 2.46.1 >>
Powered by blists - more mailing lists