[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <mugnb73mvlvclccatavdd2rwczz3wl3gs7rh4kwcnejkdh4t6b@na743yuuidlj>
Date: Mon, 15 Sep 2025 12:56:09 +0200
From: Jan Kara <jack@...e.cz>
To: Zhang Yi <yi.zhang@...weicloud.com>
Cc: Jan Kara <jack@...e.cz>,
Ext4 Developers List <linux-ext4@...r.kernel.org>, "Theodore Y. Ts'o" <tytso@....edu>,
Andreas Dilger <adilger.kernel@...ger.ca>, Zhang Yi <yi.zhang@...wei.com>,
"Li,Baokun" <libaokun1@...wei.com>, hsiangkao@...ux.alibaba.com, yangerkun <yangerkun@...wei.com>
Subject: Re: BUG report: an ext4 data corruption issue in nojournal mode
On Sat 13-09-25 11:36:27, Zhang Yi wrote:
> On 9/12/2025 10:42 PM, Jan Kara wrote:
> > Hello!
> >
> > On Fri 12-09-25 11:28:13, Zhang Yi wrote:
> >> Gao Xiang recently discovered a data corruption issue in **nojournal**
> >> mode. After analysis, we found that the problem is after a metadata
> >> block is freed, it can be immediately reallocated as a data block.
> >> However, the metadata on this block may still be in the process of being
> >> written back, which means the new data in this block could potentially
> >> be overwritten by the stale metadata.
> >>
> >> When releasing a metadata block, ext4_forget() calls bforget() in
> >> nojournal mode, which clears the dirty flag on the buffer_head. If the
> >> metadata has not yet started to be written back at this point, there is
> >> no issue. However, if the write-back has already begun but the I/O has
> >> not yet completed, ext4_forget() will have no effect, and the subsequent
> >> ext4_mb_clear_bb() will immediately return the block to the mb
> >> allocator. This block can then be immediately reallocated, potentially
> >> triggering a data loss issue.
> >
> > Yes, I agree this can be a problem.
> >
> >> This issue is somewhat related to this patch set[1] that have been
> >> merged. Before this patch set, clean_bdev_aliases() and
> >> clean_bdev_bh_alias() could ensure that the dirty flag of the block
> >> device buffer was cleared and the write-back was completed before using
> >> newly allocated blocks in most cases. However, this patch set have fixed
> >> a similar issues in journal mode and removed this safeguard because it's
> >> fragile and misses some corner cases[2], increasing the likelihood of
> >> triggering this issue.
> >
> > Right.
> >
> >> Furthermore, I found that this issue theoretically still appears to
> >> persist even in **ordered** journal mode. In the final else branch of
> >> jbd2_journal_forget(), if the metadata block to be released is also
> >> undergoing a write-back, jbd2_journal_forget() will add this buffer to
> >> the current transaction for forgetting. Once the current transaction is
> >> committed, the block can then be reallocated. However, there is no
> >> guarantee that the ongoing I/O will complete. Typically, the undergoing
> >> metadata writeback I/O does not take this long to complete, but it might
> >> be throttled by the block layer or delayed due to anomalies in some slow
> >> I/O processes in the underlying devices. Therefore, although it is
> >> difficult to trigger, it theoretically still exists.
> >
> > I don't think this can actually happen. For writeback to be happening on a
> > buffer it still has to be part of a checkpoint list of some transaction.
> > That means we'll call jbd2_journal_try_remove_checkpoint() which will lock
> > the buffer and that's enough to make sure the buffer writeback has either
> > completed or not yet started. If I missed some case, please tell me.
> >
>
> Yes, jbd2_journal_try_remove_checkpoint() does lock the buffer and check
> the buffer's dirty status under the buffer lock. However. First, it returns
> immediately if the buffer is locked by the write-back process, which means
> that it does not wait the write-back to complete, thus, until the current
> transaction is committed, there is still no guarantee that the I/O will be
> completed.
Right, it will return with EBUSY for a buffer under IO, file the buffer to
BJ_forget list of the running transaction and in principle we're not
guaranteed IO completes before that transaction commits (although in
practice that's always true).
> Second, it unlocks the buffer once it finds the buffer is still
> dirty, if a concurrent write-back happens just after this unlocking and
> before clear_buffer_dirty() in jbd2_journal_forget(), the issue can still
> theoretically happen, right?
Hum, that as well.
> It seems that only the follow changes can make sure the buffer writeback has
> either completed or not yet started (and will never start again). What do
> you think?
>
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index c7867139af69..e4691e445106 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -1772,23 +1772,26 @@ int jbd2_journal_forget(handle_t *handle, struct buffer_head *bh)
> goto drop;
> }
>
> - /*
> - * Otherwise, if the buffer has been written to disk,
> - * it is safe to remove the checkpoint and drop it.
> - */
> - if (jbd2_journal_try_remove_checkpoint(jh) >= 0) {
> - spin_unlock(&journal->j_list_lock);
> - goto drop;
> + lock_buffer(bh);
We hold j_list_lock and b_state_lock here so you cannot lock the buffer...
I think we rather need something like:
/*
* Otherwise, if the buffer has been written to disk,
* it is safe to remove the checkpoint and drop it.
*/
if (jbd2_journal_try_remove_checkpoint(jh) >= 0) {
spin_unlock(&journal->j_list_lock);
goto drop;
}
/*
* The buffer is still not written to disk, we should
* attach this buffer to current transaction so that the
* buffer can be checkpointed only after the current
* transaction commits.
*/
clear_buffer_dirty(bh);
+ wait_for_writeback = 1;
__jbd2_journal_file_buffer(jh, transaction, BJ_Forget);
spin_unlock(&journal->j_list_lock);
}
drop:
__brelse(bh);
spin_unlock(&jh->b_state_lock);
+ if (wait_for_writeback)
+ wait_on_buffer(bh);
jbd2_journal_put_journal_head(jh);
if (drop_reserve) {
...
Honza
> + if (!buffer_dirty(bh)) {
> + /*
> + * If the buffer has been written to disk, it is safe
> + * to remove the checkpoint and drop it.
> + */
> + unlock_buffer(bh);
> + JBUFFER_TRACE(jh, "remove from checkpoint list");
> + __jbd2_journal_remove_checkpoint(jh);
> + } else {
> + /*
> + * Otherwise, the buffer is still not written to disk,
> + * we should attach this buffer to current transaction
> + * so that the buffer can be checkpointed only after
> + * the current transaction commits.
> + */
> + clear_buffer_dirty(bh);
> + unlock_buffer(bh);
> + __jbd2_journal_file_buffer(jh, transaction, BJ_Forget);
> }
> -
> - /*
> - * The buffer is still not written to disk, we should
> - * attach this buffer to current transaction so that the
> - * buffer can be checkpointed only after the current
> - * transaction commits.
> - */
> - clear_buffer_dirty(bh);
> - __jbd2_journal_file_buffer(jh, transaction, BJ_Forget);
> spin_unlock(&journal->j_list_lock);
> }
> drop:
>
>
> >> Consider the fix for now. In the **ordered** journal mode, I suppose we
> >> can add a wait_on_buffer() during the process of the freed buffer in
> >> jbd2_journal_commit_transaction(). This should not significantly impact
> >> performance. In **nojorunal** mode, I do not want to reintroduce
> >> clean_bdev_aliases(). One approach is to add wait_on_buffer() in
> >> __ext4_forget(), but I am concerned that this might impact performance.
> >> However, it seems reasonable to wait for ongoing I/O to complete before
> >> freeing the buffer.
> >
> > I agree calling wait_on_buffer() before calling __bforget() is the best fix
> > for the problem in nojournal mode. Yes, it can slow down some cases where
> > we free metadata blocks that we recently modified but I think it should be
> > relatively rare.
> >
>
> Sure, I will fix it in this way.
>
> Thanks,
> Yi.
>
> >> Otherwise, another solution is we may need to
> >> implement an asynchronous release process that returns the block to the
> >> buddy system only after the I/O operation has completed. However, since
> >> the write-back is triggered by bdev, it appears to be hard to implement
> >> this solution now. What do people think?
> >
> > Yes, that will get rather complicated.
> >
> > Honza
>
--
Jan Kara <jack@...e.com>
SUSE Labs, CR
Powered by blists - more mailing lists