linux-kernel - Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <04b0a510-0a97-464f-a6d3-8410fff9243d@gmail.com>
Date: Tue, 10 Feb 2026 23:57:03 +0800
From: Zhang Yi <yizhang089@...il.com>
To: Ojaswin Mujoo <ojaswin@...ux.ibm.com>, Zhang Yi <yi.zhang@...wei.com>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
 linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
 jack@...e.cz, ritesh.list@...il.com, hch@...radead.org, djwong@...nel.org,
 yi.zhang@...weicloud.com, libaokun1@...wei.com, yangerkun@...wei.com,
 yukuai@...as.com
Subject: Re: [PATCH -next v2 03/22] ext4: only order data when partially block
 truncating down

On 2/10/2026 3:05 PM, Ojaswin Mujoo wrote:
> On Tue, Feb 03, 2026 at 02:25:03PM +0800, Zhang Yi wrote:
>> Currently, __ext4_block_zero_page_range() is called in the following
>> four cases to zero out the data in partial blocks:
>>
>> 1. Truncate down.
>> 2. Truncate up.
>> 3. Perform block allocation (e.g., fallocate) or append writes across a
>>     range extending beyond the end of the file (EOF).
>> 4. Partial block punch hole.
>>
>> If the default ordered data mode is used, __ext4_block_zero_page_range()
>> will write back the zeroed data to the disk through the order mode after
>> zeroing out.
>>
>> Among the cases 1,2 and 3 described above, only case 1 actually requires
>> this ordered write. Assuming no one intentionally bypasses the file
>> system to write directly to the disk. When performing a truncate down
>> operation, ensuring that the data beyond the EOF is zeroed out before
>> updating i_disksize is sufficient to prevent old data from being exposed
>> when the file is later extended. In other words, as long as the on-disk
>> data in case 1 can be properly zeroed out, only the data in memory needs
>> to be zeroed out in cases 2 and 3, without requiring ordered data.
>>
>> Case 4 does not require ordered data because the entire punch hole
>> operation does not provide atomicity guarantees. Therefore, it's safe to
>> move the ordered data operation from __ext4_block_zero_page_range() to
>> ext4_truncate().
>>
>> It should be noted that after this change, we can only determine whether
>> to perform ordered data operations based on whether the target block has
>> been zeroed, rather than on the state of the buffer head. Consequently,
>> unnecessary ordered data operations may occur when truncating an
>> unwritten dirty block. However, this scenario is relatively rare, so the
>> overall impact is minimal.
>>
>> This is prepared for the conversion to the iomap infrastructure since it
>> doesn't use ordered data mode and requires active writeback, which
>> reduces the complexity of the conversion.
> 
> Hi Yi,
> 
> Took me quite some time to understand what we are doing here, I'll
> just add my understanding here to confirm/document :)

Hi, Ojaswin!

Thank you for review and test this series.

> 
> So your argument is that currently all paths that change the i_size take
> care of zeroing the (newsize, eof block boundary) before i_size change
> is seen by users:
>    - dio does it in iomap_dio_bio_iter if IOMAP_UNWRITTEN (true for first allocation)
> 	- buffered IO/mmap write does it in ext4_da_write_begin() ->
> 		ext4_block_write_begin() for buffer_new (true for first allocation)
> 	- falloc doesn't zero the new eof block but it allocates an unwrit
> 		extent so no stale data issue. When an allocation happens from the
> 		above 2 methods then we anyways will zero it.

These two zeroing operations mentioned above are mainly used to 
initialize newly allocated blocks, which is not the main focus of this 
discussion.

The focus of this discussion is how to clear the portions of allocated 
blocks that extend beyond the EOF.

> 	- truncate down also takes care of this via ext4_truncate() ->
> 		ext4_block_truncate_page()
> 
> Now, parallely there are also codepaths that say grow the i_size but
> then also zero the (old_size, block boundary) range before the i_size
> commits. This is so that they want to be sure the newly visible range
> doesn't expose stale data.
> For example:
>    - truncate up from 2kb to 8kb will zero (2kb,4kb) via ext4_block_truncate_page()
>    - with i_size = 2kb, buffered IO at 6kb would zero 2kb,4kb in ext4_da_write_end()

Yes, you are right.

>    - I'm unable to see if/where we do it via dio path.

I don't see it too, so I think this is also a problem.

> 
> You originally proposed that we can remove the logic to zeroout
> (old_size, block_boundary) in data=ordered fashion, ie we don't need to
> trigger the zeroout IO before the i_size change commits, we can just zero the
> range in memory because we would have already zeroed them earlier when
> we had allocated at old_isize, or truncated down to old_isize.

Yes.

> 
> To this Jan pointed out that although we take care to zeroout (new_size,
> block_boundary) its not enough because we could still end up with data
> past eof:
> 
> 1. race of buffered write vs mmap write past eof. i_size = 2kb,
>     we write (2kb, 3kb).
> 2. The write goes through but we crash before i_size=3kb txn can commit.
>     Again we have data past 2kb ie the eof block.
> 

Yes.

> Now, Im still looking into this part but the reason we want to get rid of
> this data=ordered IO is so that we don't trigger a writeback due to
> journal commit which tries to acquire folio_lock of a folio already
> locked by iomap.

Yes, and iomap will start a new transaction under the folio lock, which 
may also wait the current committing transaction to finish.

> However we will now try an alternate way to get past
> this.
> 
> Is my understanding correct?

Yes.

Cheers,
Yi.

> 
> Regards,
> ojaswin
> 
> PS: -g auto tests are passing (no regressions) with 64k and 4k bs on
> powerpc 64k pagesize box so thats nice :D
> 
>>
>> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
>> ---
>>   fs/ext4/inode.c | 32 +++++++++++++++++++-------------
>>   1 file changed, 19 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index f856ea015263..20b60abcf777 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -4106,19 +4106,10 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>>   	folio_zero_range(folio, offset, length);
>>   	BUFFER_TRACE(bh, "zeroed end of block");
>>   
>> -	if (ext4_should_journal_data(inode)) {
>> +	if (ext4_should_journal_data(inode))
>>   		err = ext4_dirty_journalled_data(handle, bh);
>> -	} else {
>> +	else
>>   		mark_buffer_dirty(bh);
>> -		/*
>> -		 * Only the written block requires ordered data to prevent
>> -		 * exposing stale data.
>> -		 */
>> -		if (!buffer_unwritten(bh) && !buffer_delay(bh) &&
>> -		    ext4_should_order_data(inode))
>> -			err = ext4_jbd2_inode_add_write(handle, inode, from,
>> -					length);
>> -	}
>>   	if (!err && did_zero)
>>   		*did_zero = true;
>>   
>> @@ -4578,8 +4569,23 @@ int ext4_truncate(struct inode *inode)
>>   		goto out_trace;
>>   	}
>>   
>> -	if (inode->i_size & (inode->i_sb->s_blocksize - 1))
>> -		ext4_block_truncate_page(handle, mapping, inode->i_size);
>> +	if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
>> +		unsigned int zero_len;
>> +
>> +		zero_len = ext4_block_truncate_page(handle, mapping,
>> +						    inode->i_size);
>> +		if (zero_len < 0) {
>> +			err = zero_len;
>> +			goto out_stop;
>> +		}
>> +		if (zero_len && !IS_DAX(inode) &&
>> +		    ext4_should_order_data(inode)) {
>> +			err = ext4_jbd2_inode_add_write(handle, inode,
>> +					inode->i_size, zero_len);
>> +			if (err)
>> +				goto out_stop;
>> +		}
>> +	}
>>   
>>   	/*
>>   	 * We add the inode to the orphan list, so that if this
>> -- 
>> 2.52.0
>>