linux-ext4 - Re: BUG report: an ext4 data corruption issue in nojournal mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <mugnb73mvlvclccatavdd2rwczz3wl3gs7rh4kwcnejkdh4t6b@na743yuuidlj>
Date: Mon, 15 Sep 2025 12:56:09 +0200
From: Jan Kara <jack@...e.cz>
To: Zhang Yi <yi.zhang@...weicloud.com>
Cc: Jan Kara <jack@...e.cz>, 
	Ext4 Developers List <linux-ext4@...r.kernel.org>, "Theodore Y. Ts'o" <tytso@....edu>, 
	Andreas Dilger <adilger.kernel@...ger.ca>, Zhang Yi <yi.zhang@...wei.com>, 
	"Li,Baokun" <libaokun1@...wei.com>, hsiangkao@...ux.alibaba.com, yangerkun <yangerkun@...wei.com>
Subject: Re: BUG report: an ext4 data corruption issue in nojournal mode

On Sat 13-09-25 11:36:27, Zhang Yi wrote:
> On 9/12/2025 10:42 PM, Jan Kara wrote:
> > Hello!
> > 
> > On Fri 12-09-25 11:28:13, Zhang Yi wrote:
> >> Gao Xiang recently discovered a data corruption issue in **nojournal**
> >> mode. After analysis, we found that the problem is after a metadata
> >> block is freed, it can be immediately reallocated as a data block.
> >> However, the metadata on this block may still be in the process of being
> >> written back, which means the new data in this block could potentially
> >> be overwritten by the stale metadata.
> >>
> >> When releasing a metadata block, ext4_forget() calls bforget() in
> >> nojournal mode, which clears the dirty flag on the buffer_head. If the
> >> metadata has not yet started to be written back at this point, there is
> >> no issue. However, if the write-back has already begun but the I/O has
> >> not yet completed, ext4_forget() will have no effect, and the subsequent
> >> ext4_mb_clear_bb() will immediately return the block to the mb
> >> allocator. This block can then be immediately reallocated, potentially
> >> triggering a data loss issue.
> > 
> > Yes, I agree this can be a problem.
> > 
> >> This issue is somewhat related to this patch set[1] that have been
> >> merged. Before this patch set, clean_bdev_aliases() and
> >> clean_bdev_bh_alias() could ensure that the dirty flag of the block
> >> device buffer was cleared and the write-back was completed before using
> >> newly allocated blocks in most cases. However, this patch set have fixed
> >> a similar issues in journal mode and removed this safeguard because it's
> >> fragile and misses some corner cases[2], increasing the likelihood of
> >> triggering this issue.
> > 
> > Right.
> > 
> >> Furthermore, I found that this issue theoretically still appears to
> >> persist even in **ordered** journal mode. In the final else branch of
> >> jbd2_journal_forget(), if the metadata block to be released is also
> >> undergoing a write-back, jbd2_journal_forget() will add this buffer to
> >> the current transaction for forgetting. Once the current transaction is
> >> committed, the block can then be reallocated. However, there is no
> >> guarantee that the ongoing I/O will complete. Typically, the undergoing
> >> metadata writeback I/O does not take this long to complete, but it might
> >> be throttled by the block layer or delayed due to anomalies in some slow
> >> I/O processes in the underlying devices. Therefore, although it is
> >> difficult to trigger, it theoretically still exists.
> > 
> > I don't think this can actually happen. For writeback to be happening on a
> > buffer it still has to be part of a checkpoint list of some transaction.
> > That means we'll call jbd2_journal_try_remove_checkpoint() which will lock
> > the buffer and that's enough to make sure the buffer writeback has either
> > completed or not yet started. If I missed some case, please tell me.
> > 
> 
> Yes, jbd2_journal_try_remove_checkpoint() does lock the buffer and check
> the buffer's dirty status under the buffer lock. However. First, it returns
> immediately if the buffer is locked by the write-back process, which means
> that it does not wait the write-back to complete, thus, until the current
> transaction is committed, there is still no guarantee that the I/O will be
> completed.

Right, it will return with EBUSY for a buffer under IO, file the buffer to
BJ_forget list of the running transaction and in principle we're not
guaranteed IO completes before that transaction commits (although in
practice that's always true).

> Second, it unlocks the buffer once it finds the buffer is still
> dirty, if a concurrent write-back happens just after this unlocking and
> before clear_buffer_dirty() in jbd2_journal_forget(), the issue can still
> theoretically happen, right?

Hum, that as well.

> It seems that only the follow changes can make sure the buffer writeback has
> either completed or not yet started (and will never start again). What do
> you think?
> 
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index c7867139af69..e4691e445106 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -1772,23 +1772,26 @@ int jbd2_journal_forget(handle_t *handle, struct buffer_head *bh)
>  			goto drop;
>  		}
> 
> -		/*
> -		 * Otherwise, if the buffer has been written to disk,
> -		 * it is safe to remove the checkpoint and drop it.
> -		 */
> -		if (jbd2_journal_try_remove_checkpoint(jh) >= 0) {
> -			spin_unlock(&journal->j_list_lock);
> -			goto drop;
> +		lock_buffer(bh);

We hold j_list_lock and b_state_lock here so you cannot lock the buffer...
I think we rather need something like:

                /*
                 * Otherwise, if the buffer has been written to disk,
                 * it is safe to remove the checkpoint and drop it.
                 */     
                if (jbd2_journal_try_remove_checkpoint(jh) >= 0) {
                        spin_unlock(&journal->j_list_lock);
                        goto drop;
                }

                /*
                 * The buffer is still not written to disk, we should
                 * attach this buffer to current transaction so that the
                 * buffer can be checkpointed only after the current
                 * transaction commits.
                 */
                clear_buffer_dirty(bh);
+		wait_for_writeback = 1;
		__jbd2_journal_file_buffer(jh, transaction, BJ_Forget);
		spin_unlock(&journal->j_list_lock);
	}
drop:
	__brelse(bh);
	spin_unlock(&jh->b_state_lock);
+	if (wait_for_writeback)
+		wait_on_buffer(bh);
	jbd2_journal_put_journal_head(jh);
	if (drop_reserve) {
...

								Honza

> +		if (!buffer_dirty(bh)) {
> +			/*
> +			 * If the buffer has been written to disk, it is safe
> +			 * to remove the checkpoint and drop it.
> +			 */
> +			unlock_buffer(bh);
> +			JBUFFER_TRACE(jh, "remove from checkpoint list");
> +			__jbd2_journal_remove_checkpoint(jh);
> +		} else {
> +			/*
> +			 * Otherwise, the buffer is still not written to disk,
> +			 * we should attach this buffer to current transaction
> +			 * so that the buffer can be checkpointed only after
> +			 * the current transaction commits.
> +			 */
> +			clear_buffer_dirty(bh);
> +			unlock_buffer(bh);
> +			__jbd2_journal_file_buffer(jh, transaction, BJ_Forget);
>  		}
> -
> -		/*
> -		 * The buffer is still not written to disk, we should
> -		 * attach this buffer to current transaction so that the
> -		 * buffer can be checkpointed only after the current
> -		 * transaction commits.
> -		 */
> -		clear_buffer_dirty(bh);
> -		__jbd2_journal_file_buffer(jh, transaction, BJ_Forget);
>  		spin_unlock(&journal->j_list_lock);
>  	}
>  drop:
> 
> 
> >> Consider the fix for now. In the **ordered** journal mode, I suppose we
> >> can add a wait_on_buffer() during the process of the freed buffer in
> >> jbd2_journal_commit_transaction(). This should not significantly impact
> >> performance. In **nojorunal** mode, I do not want to reintroduce
> >> clean_bdev_aliases(). One approach is to add wait_on_buffer() in
> >> __ext4_forget(), but I am concerned that this might impact performance.
> >> However, it seems reasonable to wait for ongoing I/O to complete before
> >> freeing the buffer.
> > 
> > I agree calling wait_on_buffer() before calling __bforget() is the best fix
> > for the problem in nojournal mode. Yes, it can slow down some cases where
> > we free metadata blocks that we recently modified but I think it should be
> > relatively rare.
> > 
> 
> Sure, I will fix it in this way.
> 
> Thanks,
> Yi.
> 
> >> Otherwise, another solution is we may need to
> >> implement an asynchronous release process that returns the block to the
> >> buddy system only after the I/O operation has completed. However, since
> >> the write-back is triggered by bdev, it appears to be hard to implement
> >> this solution now. What do people think?
> > 
> > Yes, that will get rather complicated.
> > 
> > 								Honza
> 
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR