[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <36785255-3f1c-46aa-9df3-f43f3b042e40@huaweicloud.com>
Date: Mon, 15 Sep 2025 20:03:47 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Jan Kara <jack@...e.cz>
Cc: Ext4 Developers List <linux-ext4@...r.kernel.org>,
"Theodore Y. Ts'o" <tytso@....edu>, Andreas Dilger
<adilger.kernel@...ger.ca>, Zhang Yi <yi.zhang@...wei.com>,
"Li,Baokun" <libaokun1@...wei.com>, hsiangkao@...ux.alibaba.com,
yangerkun <yangerkun@...wei.com>
Subject: Re: BUG report: an ext4 data corruption issue in nojournal mode
On 9/15/2025 6:56 PM, Jan Kara wrote:
> On Sat 13-09-25 11:36:27, Zhang Yi wrote:
>> On 9/12/2025 10:42 PM, Jan Kara wrote:
>>> Hello!
>>>
>>> On Fri 12-09-25 11:28:13, Zhang Yi wrote:
>>>> Gao Xiang recently discovered a data corruption issue in **nojournal**
>>>> mode. After analysis, we found that the problem is after a metadata
>>>> block is freed, it can be immediately reallocated as a data block.
>>>> However, the metadata on this block may still be in the process of being
>>>> written back, which means the new data in this block could potentially
>>>> be overwritten by the stale metadata.
>>>>
>>>> When releasing a metadata block, ext4_forget() calls bforget() in
>>>> nojournal mode, which clears the dirty flag on the buffer_head. If the
>>>> metadata has not yet started to be written back at this point, there is
>>>> no issue. However, if the write-back has already begun but the I/O has
>>>> not yet completed, ext4_forget() will have no effect, and the subsequent
>>>> ext4_mb_clear_bb() will immediately return the block to the mb
>>>> allocator. This block can then be immediately reallocated, potentially
>>>> triggering a data loss issue.
>>>
>>> Yes, I agree this can be a problem.
>>>
>>>> This issue is somewhat related to this patch set[1] that have been
>>>> merged. Before this patch set, clean_bdev_aliases() and
>>>> clean_bdev_bh_alias() could ensure that the dirty flag of the block
>>>> device buffer was cleared and the write-back was completed before using
>>>> newly allocated blocks in most cases. However, this patch set have fixed
>>>> a similar issues in journal mode and removed this safeguard because it's
>>>> fragile and misses some corner cases[2], increasing the likelihood of
>>>> triggering this issue.
>>>
>>> Right.
>>>
>>>> Furthermore, I found that this issue theoretically still appears to
>>>> persist even in **ordered** journal mode. In the final else branch of
>>>> jbd2_journal_forget(), if the metadata block to be released is also
>>>> undergoing a write-back, jbd2_journal_forget() will add this buffer to
>>>> the current transaction for forgetting. Once the current transaction is
>>>> committed, the block can then be reallocated. However, there is no
>>>> guarantee that the ongoing I/O will complete. Typically, the undergoing
>>>> metadata writeback I/O does not take this long to complete, but it might
>>>> be throttled by the block layer or delayed due to anomalies in some slow
>>>> I/O processes in the underlying devices. Therefore, although it is
>>>> difficult to trigger, it theoretically still exists.
>>>
>>> I don't think this can actually happen. For writeback to be happening on a
>>> buffer it still has to be part of a checkpoint list of some transaction.
>>> That means we'll call jbd2_journal_try_remove_checkpoint() which will lock
>>> the buffer and that's enough to make sure the buffer writeback has either
>>> completed or not yet started. If I missed some case, please tell me.
>>>
>>
>> Yes, jbd2_journal_try_remove_checkpoint() does lock the buffer and check
>> the buffer's dirty status under the buffer lock. However. First, it returns
>> immediately if the buffer is locked by the write-back process, which means
>> that it does not wait the write-back to complete, thus, until the current
>> transaction is committed, there is still no guarantee that the I/O will be
>> completed.
>
> Right, it will return with EBUSY for a buffer under IO, file the buffer to
> BJ_forget list of the running transaction and in principle we're not
> guaranteed IO completes before that transaction commits (although in
> practice that's always true).
>
>> Second, it unlocks the buffer once it finds the buffer is still
>> dirty, if a concurrent write-back happens just after this unlocking and
>> before clear_buffer_dirty() in jbd2_journal_forget(), the issue can still
>> theoretically happen, right?
>
> Hum, that as well.
>
>> It seems that only the follow changes can make sure the buffer writeback has
>> either completed or not yet started (and will never start again). What do
>> you think?
>>
>> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
>> index c7867139af69..e4691e445106 100644
>> --- a/fs/jbd2/transaction.c
>> +++ b/fs/jbd2/transaction.c
>> @@ -1772,23 +1772,26 @@ int jbd2_journal_forget(handle_t *handle, struct buffer_head *bh)
>> goto drop;
>> }
>>
>> - /*
>> - * Otherwise, if the buffer has been written to disk,
>> - * it is safe to remove the checkpoint and drop it.
>> - */
>> - if (jbd2_journal_try_remove_checkpoint(jh) >= 0) {
>> - spin_unlock(&journal->j_list_lock);
>> - goto drop;
>> + lock_buffer(bh);
>
> We hold j_list_lock and b_state_lock here so you cannot lock the buffer...
> I think we rather need something like:
>
> /*
> * Otherwise, if the buffer has been written to disk,
> * it is safe to remove the checkpoint and drop it.
> */
> if (jbd2_journal_try_remove_checkpoint(jh) >= 0) {
> spin_unlock(&journal->j_list_lock);
> goto drop;
> }
>
> /*
> * The buffer is still not written to disk, we should
> * attach this buffer to current transaction so that the
> * buffer can be checkpointed only after the current
> * transaction commits.
> */
> clear_buffer_dirty(bh);
> + wait_for_writeback = 1;
> __jbd2_journal_file_buffer(jh, transaction, BJ_Forget);
> spin_unlock(&journal->j_list_lock);
> }
> drop:
> __brelse(bh);
> spin_unlock(&jh->b_state_lock);
> + if (wait_for_writeback)
> + wait_on_buffer(bh);
> jbd2_journal_put_journal_head(jh);
> if (drop_reserve) {
> ...
>
Yes, I agree. This is the simplest way to provide a guarantee.
Thanks,
Yi.
Powered by blists - more mailing lists