linux-ext4 - Re: BUG report: an ext4 data corruption issue in nojournal mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <36785255-3f1c-46aa-9df3-f43f3b042e40@huaweicloud.com>
Date: Mon, 15 Sep 2025 20:03:47 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Jan Kara <jack@...e.cz>
Cc: Ext4 Developers List <linux-ext4@...r.kernel.org>,
 "Theodore Y. Ts'o" <tytso@....edu>, Andreas Dilger
 <adilger.kernel@...ger.ca>, Zhang Yi <yi.zhang@...wei.com>,
 "Li,Baokun" <libaokun1@...wei.com>, hsiangkao@...ux.alibaba.com,
 yangerkun <yangerkun@...wei.com>
Subject: Re: BUG report: an ext4 data corruption issue in nojournal mode

On 9/15/2025 6:56 PM, Jan Kara wrote:
> On Sat 13-09-25 11:36:27, Zhang Yi wrote:
>> On 9/12/2025 10:42 PM, Jan Kara wrote:
>>> Hello!
>>>
>>> On Fri 12-09-25 11:28:13, Zhang Yi wrote:
>>>> Gao Xiang recently discovered a data corruption issue in **nojournal**
>>>> mode. After analysis, we found that the problem is after a metadata
>>>> block is freed, it can be immediately reallocated as a data block.
>>>> However, the metadata on this block may still be in the process of being
>>>> written back, which means the new data in this block could potentially
>>>> be overwritten by the stale metadata.
>>>>
>>>> When releasing a metadata block, ext4_forget() calls bforget() in
>>>> nojournal mode, which clears the dirty flag on the buffer_head. If the
>>>> metadata has not yet started to be written back at this point, there is
>>>> no issue. However, if the write-back has already begun but the I/O has
>>>> not yet completed, ext4_forget() will have no effect, and the subsequent
>>>> ext4_mb_clear_bb() will immediately return the block to the mb
>>>> allocator. This block can then be immediately reallocated, potentially
>>>> triggering a data loss issue.
>>>
>>> Yes, I agree this can be a problem.
>>>
>>>> This issue is somewhat related to this patch set[1] that have been
>>>> merged. Before this patch set, clean_bdev_aliases() and
>>>> clean_bdev_bh_alias() could ensure that the dirty flag of the block
>>>> device buffer was cleared and the write-back was completed before using
>>>> newly allocated blocks in most cases. However, this patch set have fixed
>>>> a similar issues in journal mode and removed this safeguard because it's
>>>> fragile and misses some corner cases[2], increasing the likelihood of
>>>> triggering this issue.
>>>
>>> Right.
>>>
>>>> Furthermore, I found that this issue theoretically still appears to
>>>> persist even in **ordered** journal mode. In the final else branch of
>>>> jbd2_journal_forget(), if the metadata block to be released is also
>>>> undergoing a write-back, jbd2_journal_forget() will add this buffer to
>>>> the current transaction for forgetting. Once the current transaction is
>>>> committed, the block can then be reallocated. However, there is no
>>>> guarantee that the ongoing I/O will complete. Typically, the undergoing
>>>> metadata writeback I/O does not take this long to complete, but it might
>>>> be throttled by the block layer or delayed due to anomalies in some slow
>>>> I/O processes in the underlying devices. Therefore, although it is
>>>> difficult to trigger, it theoretically still exists.
>>>
>>> I don't think this can actually happen. For writeback to be happening on a
>>> buffer it still has to be part of a checkpoint list of some transaction.
>>> That means we'll call jbd2_journal_try_remove_checkpoint() which will lock
>>> the buffer and that's enough to make sure the buffer writeback has either
>>> completed or not yet started. If I missed some case, please tell me.
>>>
>>
>> Yes, jbd2_journal_try_remove_checkpoint() does lock the buffer and check
>> the buffer's dirty status under the buffer lock. However. First, it returns
>> immediately if the buffer is locked by the write-back process, which means
>> that it does not wait the write-back to complete, thus, until the current
>> transaction is committed, there is still no guarantee that the I/O will be
>> completed.
> 
> Right, it will return with EBUSY for a buffer under IO, file the buffer to
> BJ_forget list of the running transaction and in principle we're not
> guaranteed IO completes before that transaction commits (although in
> practice that's always true).
> 
>> Second, it unlocks the buffer once it finds the buffer is still
>> dirty, if a concurrent write-back happens just after this unlocking and
>> before clear_buffer_dirty() in jbd2_journal_forget(), the issue can still
>> theoretically happen, right?
> 
> Hum, that as well.
> 
>> It seems that only the follow changes can make sure the buffer writeback has
>> either completed or not yet started (and will never start again). What do
>> you think?
>>
>> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
>> index c7867139af69..e4691e445106 100644
>> --- a/fs/jbd2/transaction.c
>> +++ b/fs/jbd2/transaction.c
>> @@ -1772,23 +1772,26 @@ int jbd2_journal_forget(handle_t *handle, struct buffer_head *bh)
>>  			goto drop;
>>  		}
>>
>> -		/*
>> -		 * Otherwise, if the buffer has been written to disk,
>> -		 * it is safe to remove the checkpoint and drop it.
>> -		 */
>> -		if (jbd2_journal_try_remove_checkpoint(jh) >= 0) {
>> -			spin_unlock(&journal->j_list_lock);
>> -			goto drop;
>> +		lock_buffer(bh);
> 
> We hold j_list_lock and b_state_lock here so you cannot lock the buffer...
> I think we rather need something like:
> 
>                 /*
>                  * Otherwise, if the buffer has been written to disk,
>                  * it is safe to remove the checkpoint and drop it.
>                  */     
>                 if (jbd2_journal_try_remove_checkpoint(jh) >= 0) {
>                         spin_unlock(&journal->j_list_lock);
>                         goto drop;
>                 }
> 
>                 /*
>                  * The buffer is still not written to disk, we should
>                  * attach this buffer to current transaction so that the
>                  * buffer can be checkpointed only after the current
>                  * transaction commits.
>                  */
>                 clear_buffer_dirty(bh);
> +		wait_for_writeback = 1;
> 		__jbd2_journal_file_buffer(jh, transaction, BJ_Forget);
> 		spin_unlock(&journal->j_list_lock);
> 	}
> drop:
> 	__brelse(bh);
> 	spin_unlock(&jh->b_state_lock);
> +	if (wait_for_writeback)
> +		wait_on_buffer(bh);
> 	jbd2_journal_put_journal_head(jh);
> 	if (drop_reserve) {
> ...
> 

Yes, I agree. This is the simplest way to provide a guarantee.

Thanks,
Yi.