linux-kernel - Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aYrYwhO5LvIYbxWg@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
Date: Tue, 10 Feb 2026 12:35:38 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: Zhang Yi <yi.zhang@...wei.com>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
        jack@...e.cz, ritesh.list@...il.com, hch@...radead.org,
        djwong@...nel.org, yi.zhang@...weicloud.com, yizhang089@...il.com,
        libaokun1@...wei.com, yangerkun@...wei.com, yukuai@...as.com
Subject: Re: [PATCH -next v2 03/22] ext4: only order data when partially
 block truncating down

On Tue, Feb 03, 2026 at 02:25:03PM +0800, Zhang Yi wrote:
> Currently, __ext4_block_zero_page_range() is called in the following
> four cases to zero out the data in partial blocks:
> 
> 1. Truncate down.
> 2. Truncate up.
> 3. Perform block allocation (e.g., fallocate) or append writes across a
>    range extending beyond the end of the file (EOF).
> 4. Partial block punch hole.
> 
> If the default ordered data mode is used, __ext4_block_zero_page_range()
> will write back the zeroed data to the disk through the order mode after
> zeroing out.
> 
> Among the cases 1,2 and 3 described above, only case 1 actually requires
> this ordered write. Assuming no one intentionally bypasses the file
> system to write directly to the disk. When performing a truncate down
> operation, ensuring that the data beyond the EOF is zeroed out before
> updating i_disksize is sufficient to prevent old data from being exposed
> when the file is later extended. In other words, as long as the on-disk
> data in case 1 can be properly zeroed out, only the data in memory needs
> to be zeroed out in cases 2 and 3, without requiring ordered data.
> 
> Case 4 does not require ordered data because the entire punch hole
> operation does not provide atomicity guarantees. Therefore, it's safe to
> move the ordered data operation from __ext4_block_zero_page_range() to
> ext4_truncate().
> 
> It should be noted that after this change, we can only determine whether
> to perform ordered data operations based on whether the target block has
> been zeroed, rather than on the state of the buffer head. Consequently,
> unnecessary ordered data operations may occur when truncating an
> unwritten dirty block. However, this scenario is relatively rare, so the
> overall impact is minimal.
> 
> This is prepared for the conversion to the iomap infrastructure since it
> doesn't use ordered data mode and requires active writeback, which
> reduces the complexity of the conversion.

Hi Yi,

Took me quite some time to understand what we are doing here, I'll
just add my understanding here to confirm/document :) 

So your argument is that currently all paths that change the i_size take
care of zeroing the (newsize, eof block boundary) before i_size change
is seen by users:
  - dio does it in iomap_dio_bio_iter if IOMAP_UNWRITTEN (true for first allocation)
	- buffered IO/mmap write does it in ext4_da_write_begin() ->
		ext4_block_write_begin() for buffer_new (true for first allocation)
	- falloc doesn't zero the new eof block but it allocates an unwrit
		extent so no stale data issue. When an allocation happens from the
		above 2 methods then we anyways will zero it.
	- truncate down also takes care of this via ext4_truncate() ->
		ext4_block_truncate_page()

Now, parallely there are also codepaths that say grow the i_size but
then also zero the (old_size, block boundary) range before the i_size
commits. This is so that they want to be sure the newly visible range
doesn't expose stale data.
For example:
  - truncate up from 2kb to 8kb will zero (2kb,4kb) via ext4_block_truncate_page()
  - with i_size = 2kb, buffered IO at 6kb would zero 2kb,4kb in ext4_da_write_end()
  - I'm unable to see if/where we do it via dio path.

You originally proposed that we can remove the logic to zeroout
(old_size, block_boundary) in data=ordered fashion, ie we don't need to
trigger the zeroout IO before the i_size change commits, we can just zero the
range in memory because we would have already zeroed them earlier when
we had allocated at old_isize, or truncated down to old_isize.

To this Jan pointed out that although we take care to zeroout (new_size,
block_boundary) its not enough because we could still end up with data
past eof:

1. race of buffered write vs mmap write past eof. i_size = 2kb,
   we write (2kb, 3kb). 
2. The write goes through but we crash before i_size=3kb txn can commit.
   Again we have data past 2kb ie the eof block.

Now, Im still looking into this part but the reason we want to get rid of
this data=ordered IO is so that we don't trigger a writeback due to
journal commit which tries to acquire folio_lock of a folio already
locked by iomap. However we will now try an alternate way to get past
this.

Is my understanding correct?

Regards,
ojaswin

PS: -g auto tests are passing (no regressions) with 64k and 4k bs on
powerpc 64k pagesize box so thats nice :D 

> 
> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
> ---
>  fs/ext4/inode.c | 32 +++++++++++++++++++-------------
>  1 file changed, 19 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f856ea015263..20b60abcf777 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4106,19 +4106,10 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>  	folio_zero_range(folio, offset, length);
>  	BUFFER_TRACE(bh, "zeroed end of block");
>  
> -	if (ext4_should_journal_data(inode)) {
> +	if (ext4_should_journal_data(inode))
>  		err = ext4_dirty_journalled_data(handle, bh);
> -	} else {
> +	else
>  		mark_buffer_dirty(bh);
> -		/*
> -		 * Only the written block requires ordered data to prevent
> -		 * exposing stale data.
> -		 */
> -		if (!buffer_unwritten(bh) && !buffer_delay(bh) &&
> -		    ext4_should_order_data(inode))
> -			err = ext4_jbd2_inode_add_write(handle, inode, from,
> -					length);
> -	}
>  	if (!err && did_zero)
>  		*did_zero = true;
>  
> @@ -4578,8 +4569,23 @@ int ext4_truncate(struct inode *inode)
>  		goto out_trace;
>  	}
>  
> -	if (inode->i_size & (inode->i_sb->s_blocksize - 1))
> -		ext4_block_truncate_page(handle, mapping, inode->i_size);
> +	if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
> +		unsigned int zero_len;
> +
> +		zero_len = ext4_block_truncate_page(handle, mapping,
> +						    inode->i_size);
> +		if (zero_len < 0) {
> +			err = zero_len;
> +			goto out_stop;
> +		}
> +		if (zero_len && !IS_DAX(inode) &&
> +		    ext4_should_order_data(inode)) {
> +			err = ext4_jbd2_inode_add_write(handle, inode,
> +					inode->i_size, zero_len);
> +			if (err)
> +				goto out_stop;
> +		}
> +	}
>  
>  	/*
>  	 * We add the inode to the orphan list, so that if this
> -- 
> 2.52.0
>