linux-kernel - Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b889332b-9c0c-46d1-af61-1f2426c8c305@huaweicloud.com>
Date: Wed, 4 Feb 2026 14:42:46 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Jan Kara <jack@...e.cz>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
 linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
 ojaswin@...ux.ibm.com, ritesh.list@...il.com, hch@...radead.org,
 djwong@...nel.org, Zhang Yi <yi.zhang@...wei.com>, yi.zhang@...weicloud.com,
 yizhang089@...il.com, libaokun1@...wei.com, yangerkun@...wei.com,
 yukuai@...as.com
Subject: Re: [PATCH -next v2 03/22] ext4: only order data when partially block
 truncating down

Hi, Jan!

On 2/3/2026 5:59 PM, Jan Kara wrote:
> On Tue 03-02-26 14:25:03, Zhang Yi wrote:
>> Currently, __ext4_block_zero_page_range() is called in the following
>> four cases to zero out the data in partial blocks:
>>
>> 1. Truncate down.
>> 2. Truncate up.
>> 3. Perform block allocation (e.g., fallocate) or append writes across a
>>    range extending beyond the end of the file (EOF).
>> 4. Partial block punch hole.
>>
>> If the default ordered data mode is used, __ext4_block_zero_page_range()
>> will write back the zeroed data to the disk through the order mode after
>> zeroing out.
>>
>> Among the cases 1,2 and 3 described above, only case 1 actually requires
>> this ordered write. Assuming no one intentionally bypasses the file
>> system to write directly to the disk. When performing a truncate down
>> operation, ensuring that the data beyond the EOF is zeroed out before
>> updating i_disksize is sufficient to prevent old data from being exposed
>> when the file is later extended. In other words, as long as the on-disk
>> data in case 1 can be properly zeroed out, only the data in memory needs
>> to be zeroed out in cases 2 and 3, without requiring ordered data.
> 
> Hum, I'm not sure this is correct. The tail block of the file is not
> necessarily zeroed out beyond EOF (as mmap writes can race with page
> writeback and modify the tail block contents beyond EOF before we really
> submit it to the device). Thus after this commit if you truncate up, just
> zero out the newly exposed contents in the page cache and dirty it, then
> the transaction with the i_disksize update commits (I see nothing
> preventing it) and then you crash, you can observe file with the new size
> but non-zero content in the newly exposed area. Am I missing something?
> 

Well, I think you are right! I missed the mmap write race condition that
happens during the writeback submitting I/O. Thank you a lot for pointing
this out. I thought of two possible solutions:

1. We also add explicit writeback operations to the truncate-up and
   post-EOF append writes. This solution is the most straightforward but
   may cause some performance overhead. However, since at most only one
   block is written, the impact is likely limited. Additionally, I
   observed that the implementation of the XFS file system also adopts a
   similar approach in its truncate up and down operation. (But it is
   somewhat strange that XFS also appears to have the same issue with
   post-EOF append writes; it only zero out the partial block in
   xfs_file_write_checks(), but it neither explicitly writeback zeroed
   data nor employs any other mechanism to ensure that the zero data
   writebacks before the metadata is written to disk.)

2. Resolve this race condition, ensure that there are no non-zero data
   in the post-EOF partial blocks on the disk. I observed that after the
   writeback holds the folio lock and calls folio_clear_dirty_for_io(),
   mmap writes will re-trigger the page fault. Perhaps we can filter out
   the EOF folio based on i_size in ext4_page_mkwrite(),
   block_page_mkwrite() and iomap_page_mkwrite(), and then call
   folio_wait_writeback() to wait for this partial folio writeback to
   complete. This seems can break the race condition without introducing
   too much overhead (no?).

What do you think? Any other suggestions are also welcome.

Thanks,
Yi.

>> Case 4 does not require ordered data because the entire punch hole
>> operation does not provide atomicity guarantees. Therefore, it's safe to
>> move the ordered data operation from __ext4_block_zero_page_range() to
>> ext4_truncate().
> 
> I agree hole punching can already expose intermediate results in case of
> crash so there removing the ordered mode handling is safe.
> 
> 								Honza
> 
>> It should be noted that after this change, we can only determine whether
>> to perform ordered data operations based on whether the target block has
>> been zeroed, rather than on the state of the buffer head. Consequently,
>> unnecessary ordered data operations may occur when truncating an
>> unwritten dirty block. However, this scenario is relatively rare, so the
>> overall impact is minimal.
>>
>> This is prepared for the conversion to the iomap infrastructure since it
>> doesn't use ordered data mode and requires active writeback, which
>> reduces the complexity of the conversion.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
>> ---
>>  fs/ext4/inode.c | 32 +++++++++++++++++++-------------
>>  1 file changed, 19 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index f856ea015263..20b60abcf777 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -4106,19 +4106,10 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>>  	folio_zero_range(folio, offset, length);
>>  	BUFFER_TRACE(bh, "zeroed end of block");
>>  
>> -	if (ext4_should_journal_data(inode)) {
>> +	if (ext4_should_journal_data(inode))
>>  		err = ext4_dirty_journalled_data(handle, bh);
>> -	} else {
>> +	else
>>  		mark_buffer_dirty(bh);
>> -		/*
>> -		 * Only the written block requires ordered data to prevent
>> -		 * exposing stale data.
>> -		 */
>> -		if (!buffer_unwritten(bh) && !buffer_delay(bh) &&
>> -		    ext4_should_order_data(inode))
>> -			err = ext4_jbd2_inode_add_write(handle, inode, from,
>> -					length);
>> -	}
>>  	if (!err && did_zero)
>>  		*did_zero = true;
>>  
>> @@ -4578,8 +4569,23 @@ int ext4_truncate(struct inode *inode)
>>  		goto out_trace;
>>  	}
>>  
>> -	if (inode->i_size & (inode->i_sb->s_blocksize - 1))
>> -		ext4_block_truncate_page(handle, mapping, inode->i_size);
>> +	if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
>> +		unsigned int zero_len;
>> +
>> +		zero_len = ext4_block_truncate_page(handle, mapping,
>> +						    inode->i_size);
>> +		if (zero_len < 0) {
>> +			err = zero_len;
>> +			goto out_stop;
>> +		}
>> +		if (zero_len && !IS_DAX(inode) &&
>> +		    ext4_should_order_data(inode)) {
>> +			err = ext4_jbd2_inode_add_write(handle, inode,
>> +					inode->i_size, zero_len);
>> +			if (err)
>> +				goto out_stop;
>> +		}
>> +	}
>>  
>>  	/*
>>  	 * We add the inode to the orphan list, so that if this
>> -- 
>> 2.52.0
>>