linux-kernel - Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1dad3113-7b84-40a0-8c7e-da30ae5cba8e@huaweicloud.com>
Date: Thu, 5 Feb 2026 15:50:38 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Jan Kara <jack@...e.cz>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
 linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
 ojaswin@...ux.ibm.com, ritesh.list@...il.com, hch@...radead.org,
 djwong@...nel.org, Zhang Yi <yi.zhang@...wei.com>, yizhang089@...il.com,
 libaokun1@...wei.com, yangerkun@...wei.com, yukuai@...as.com
Subject: Re: [PATCH -next v2 03/22] ext4: only order data when partially block
 truncating down

On 2/4/2026 10:18 PM, Jan Kara wrote:
> On Wed 04-02-26 14:42:46, Zhang Yi wrote:
>> On 2/3/2026 5:59 PM, Jan Kara wrote:
>>> On Tue 03-02-26 14:25:03, Zhang Yi wrote:
>>>> Currently, __ext4_block_zero_page_range() is called in the following
>>>> four cases to zero out the data in partial blocks:
>>>>
>>>> 1. Truncate down.
>>>> 2. Truncate up.
>>>> 3. Perform block allocation (e.g., fallocate) or append writes across a
>>>>    range extending beyond the end of the file (EOF).
>>>> 4. Partial block punch hole.
>>>>
>>>> If the default ordered data mode is used, __ext4_block_zero_page_range()
>>>> will write back the zeroed data to the disk through the order mode after
>>>> zeroing out.
>>>>
>>>> Among the cases 1,2 and 3 described above, only case 1 actually requires
>>>> this ordered write. Assuming no one intentionally bypasses the file
>>>> system to write directly to the disk. When performing a truncate down
>>>> operation, ensuring that the data beyond the EOF is zeroed out before
>>>> updating i_disksize is sufficient to prevent old data from being exposed
>>>> when the file is later extended. In other words, as long as the on-disk
>>>> data in case 1 can be properly zeroed out, only the data in memory needs
>>>> to be zeroed out in cases 2 and 3, without requiring ordered data.
>>>
>>> Hum, I'm not sure this is correct. The tail block of the file is not
>>> necessarily zeroed out beyond EOF (as mmap writes can race with page
>>> writeback and modify the tail block contents beyond EOF before we really
>>> submit it to the device). Thus after this commit if you truncate up, just
>>> zero out the newly exposed contents in the page cache and dirty it, then
>>> the transaction with the i_disksize update commits (I see nothing
>>> preventing it) and then you crash, you can observe file with the new size
>>> but non-zero content in the newly exposed area. Am I missing something?
>>>
>>
>> Well, I think you are right! I missed the mmap write race condition that
>> happens during the writeback submitting I/O. Thank you a lot for pointing
>> this out. I thought of two possible solutions:
>>
>> 1. We also add explicit writeback operations to the truncate-up and
>>    post-EOF append writes. This solution is the most straightforward but
>>    may cause some performance overhead. However, since at most only one
>>    block is written, the impact is likely limited. Additionally, I
>>    observed that the implementation of the XFS file system also adopts a
>>    similar approach in its truncate up and down operation. (But it is
>>    somewhat strange that XFS also appears to have the same issue with
>>    post-EOF append writes; it only zero out the partial block in
>>    xfs_file_write_checks(), but it neither explicitly writeback zeroed
>>    data nor employs any other mechanism to ensure that the zero data
>>    writebacks before the metadata is written to disk.)
>>
>> 2. Resolve this race condition, ensure that there are no non-zero data
>>    in the post-EOF partial blocks on the disk. I observed that after the
>>    writeback holds the folio lock and calls folio_clear_dirty_for_io(),
>>    mmap writes will re-trigger the page fault. Perhaps we can filter out
>>    the EOF folio based on i_size in ext4_page_mkwrite(),
>>    block_page_mkwrite() and iomap_page_mkwrite(), and then call
>>    folio_wait_writeback() to wait for this partial folio writeback to
>>    complete. This seems can break the race condition without introducing
>>    too much overhead (no?).
>>
>> What do you think? Any other suggestions are also welcome.
> 
> Hum, I like the option 2 because IMO non-zero data beyond EOF is a
> corner-case quirk which unnecessarily complicates rather common paths. But
> I'm not sure we can easily get rid of it. It can happen for example when
> you do appending write inside a block. The page is written back but before
> the transaction with i_disksize update commits we crash. Then again we have
> a non-zero content inside the block beyond EOF.

Yes, indeed. From this perspective, it seems difficult to avoid non-zero
content within the block beyond the EOF.

> 
> So the only realistic option I see is to ensure tail of the block gets
> zeroed on disk before the transaction with i_disksize update commits in the
> cases of truncate up or write beyond EOF. data=ordered mode machinery is an
> asynchronous way how to achieve this. We could also just synchronously
> writeback the block where needed but the latency hit of such operation is
> going to be significant so I'm quite sure some workload somewhere will
> notice although the truncate up / write beyond EOF operations triggering this
> are not too common.

Yes, I agree.

> So why do you need to get rid of these data=ordered
> mode usages? I guess because with iomap keeping our transaction handle ->
> folio lock ordering is complicated? Last time I looked it seemed still
> possible to keep it though.
> 

Yes, that's one reason. There's another reason is that we also need to
implement partial folio submits for iomap.

When the journal process is waiting for a folio to be written back
(which contains an ordered block), and the folio also contains unmapped
blocks with a block size smaller than the folio size, if the regular
writeback process has already started committing this folio (and set the
writeback flag), then a deadlock may occur while mapping the remaining
unmapped blocks. This is because the writeback flag is cleared only
after the entire folio are processed and committed. If we want to support
partial folio submit for iomap, we need to be careful to prevent adding
additional performance overhead in the case of severe fragmentation.

Therefore, this aspect of the logic is complicated and subtle. As we
discussed in patch 0, if we can avoid using the data=ordered mode in
append write and online defrag, then this would be the only remaining
corner case. I'm not sure if it is worth implementing this and adjusting
the lock ordering.

> Another possibility would be to just *submit* the write synchronously and
> use data=ordered mode machinery only to wait for IO to complete before the
> transaction commits. That way it should be safe to start a transaction
> while holding folio lock and thus the iomap conversion would be easier.
> 
> 								Honza

IIUC, this solution seems can avoid adjusting the lock ordering, but partial
folio submission still needs to be implemented, is my understanding right?
This is because although we have already submitted this zeroed partial EOF
block, when the journal process is waiting for this folio, this folio is
being written back, and there are other blocks in this folio that need to be
mapped.

Cheers,
Yi.