linux-ext4 - Re: BUG report: an ext4 data corruption issue in nojournal mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7b0ccbbc-edc1-4123-94b9-fa19f79ea968@huaweicloud.com>
Date: Sat, 13 Sep 2025 11:36:27 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Jan Kara <jack@...e.cz>
Cc: Ext4 Developers List <linux-ext4@...r.kernel.org>,
 "Theodore Y. Ts'o" <tytso@....edu>, Andreas Dilger
 <adilger.kernel@...ger.ca>, Zhang Yi <yi.zhang@...wei.com>,
 "Li,Baokun" <libaokun1@...wei.com>, hsiangkao@...ux.alibaba.com,
 yangerkun <yangerkun@...wei.com>
Subject: Re: BUG report: an ext4 data corruption issue in nojournal mode

On 9/12/2025 10:42 PM, Jan Kara wrote:
> Hello!
> 
> On Fri 12-09-25 11:28:13, Zhang Yi wrote:
>> Gao Xiang recently discovered a data corruption issue in **nojournal**
>> mode. After analysis, we found that the problem is after a metadata
>> block is freed, it can be immediately reallocated as a data block.
>> However, the metadata on this block may still be in the process of being
>> written back, which means the new data in this block could potentially
>> be overwritten by the stale metadata.
>>
>> When releasing a metadata block, ext4_forget() calls bforget() in
>> nojournal mode, which clears the dirty flag on the buffer_head. If the
>> metadata has not yet started to be written back at this point, there is
>> no issue. However, if the write-back has already begun but the I/O has
>> not yet completed, ext4_forget() will have no effect, and the subsequent
>> ext4_mb_clear_bb() will immediately return the block to the mb
>> allocator. This block can then be immediately reallocated, potentially
>> triggering a data loss issue.
> 
> Yes, I agree this can be a problem.
> 
>> This issue is somewhat related to this patch set[1] that have been
>> merged. Before this patch set, clean_bdev_aliases() and
>> clean_bdev_bh_alias() could ensure that the dirty flag of the block
>> device buffer was cleared and the write-back was completed before using
>> newly allocated blocks in most cases. However, this patch set have fixed
>> a similar issues in journal mode and removed this safeguard because it's
>> fragile and misses some corner cases[2], increasing the likelihood of
>> triggering this issue.
> 
> Right.
> 
>> Furthermore, I found that this issue theoretically still appears to
>> persist even in **ordered** journal mode. In the final else branch of
>> jbd2_journal_forget(), if the metadata block to be released is also
>> undergoing a write-back, jbd2_journal_forget() will add this buffer to
>> the current transaction for forgetting. Once the current transaction is
>> committed, the block can then be reallocated. However, there is no
>> guarantee that the ongoing I/O will complete. Typically, the undergoing
>> metadata writeback I/O does not take this long to complete, but it might
>> be throttled by the block layer or delayed due to anomalies in some slow
>> I/O processes in the underlying devices. Therefore, although it is
>> difficult to trigger, it theoretically still exists.
> 
> I don't think this can actually happen. For writeback to be happening on a
> buffer it still has to be part of a checkpoint list of some transaction.
> That means we'll call jbd2_journal_try_remove_checkpoint() which will lock
> the buffer and that's enough to make sure the buffer writeback has either
> completed or not yet started. If I missed some case, please tell me.
> 

Yes, jbd2_journal_try_remove_checkpoint() does lock the buffer and check
the buffer's dirty status under the buffer lock. However. First, it returns
immediately if the buffer is locked by the write-back process, which means
that it does not wait the write-back to complete, thus, until the current
transaction is committed, there is still no guarantee that the I/O will be
completed. Second, it unlocks the buffer once it finds the buffer is still
dirty, if a concurrent write-back happens just after this unlocking and
before clear_buffer_dirty() in jbd2_journal_forget(), the issue can still
theoretically happen, right?

It seems that only the follow changes can make sure the buffer writeback has
either completed or not yet started (and will never start again). What do
you think?

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index c7867139af69..e4691e445106 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1772,23 +1772,26 @@ int jbd2_journal_forget(handle_t *handle, struct buffer_head *bh)
 			goto drop;
 		}

-		/*
-		 * Otherwise, if the buffer has been written to disk,
-		 * it is safe to remove the checkpoint and drop it.
-		 */
-		if (jbd2_journal_try_remove_checkpoint(jh) >= 0) {
-			spin_unlock(&journal->j_list_lock);
-			goto drop;
+		lock_buffer(bh);
+		if (!buffer_dirty(bh)) {
+			/*
+			 * If the buffer has been written to disk, it is safe
+			 * to remove the checkpoint and drop it.
+			 */
+			unlock_buffer(bh);
+			JBUFFER_TRACE(jh, "remove from checkpoint list");
+			__jbd2_journal_remove_checkpoint(jh);
+		} else {
+			/*
+			 * Otherwise, the buffer is still not written to disk,
+			 * we should attach this buffer to current transaction
+			 * so that the buffer can be checkpointed only after
+			 * the current transaction commits.
+			 */
+			clear_buffer_dirty(bh);
+			unlock_buffer(bh);
+			__jbd2_journal_file_buffer(jh, transaction, BJ_Forget);
 		}
-
-		/*
-		 * The buffer is still not written to disk, we should
-		 * attach this buffer to current transaction so that the
-		 * buffer can be checkpointed only after the current
-		 * transaction commits.
-		 */
-		clear_buffer_dirty(bh);
-		__jbd2_journal_file_buffer(jh, transaction, BJ_Forget);
 		spin_unlock(&journal->j_list_lock);
 	}
 drop:


>> Consider the fix for now. In the **ordered** journal mode, I suppose we
>> can add a wait_on_buffer() during the process of the freed buffer in
>> jbd2_journal_commit_transaction(). This should not significantly impact
>> performance. In **nojorunal** mode, I do not want to reintroduce
>> clean_bdev_aliases(). One approach is to add wait_on_buffer() in
>> __ext4_forget(), but I am concerned that this might impact performance.
>> However, it seems reasonable to wait for ongoing I/O to complete before
>> freeing the buffer.
> 
> I agree calling wait_on_buffer() before calling __bforget() is the best fix
> for the problem in nojournal mode. Yes, it can slow down some cases where
> we free metadata blocks that we recently modified but I think it should be
> relatively rare.
> 

Sure, I will fix it in this way.

Thanks,
Yi.

>> Otherwise, another solution is we may need to
>> implement an asynchronous release process that returns the block to the
>> buddy system only after the I/O operation has completed. However, since
>> the write-back is triggered by bdev, it appears to be hard to implement
>> this solution now. What do people think?
> 
> Yes, that will get rather complicated.
> 
> 								Honza