linux-kernel - Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6b3fddd2-047b-4639-b54f-554b16a0ef36@huawei.com>
Date: Fri, 6 Feb 2026 09:14:53 +0800
From: Baokun Li <libaokun1@...wei.com>
To: Jan Kara <jack@...e.cz>
CC: Zhang Yi <yi.zhang@...weicloud.com>, <linux-ext4@...r.kernel.org>,
	<linux-fsdevel@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<tytso@....edu>, <adilger.kernel@...ger.ca>, <ojaswin@...ux.ibm.com>,
	<ritesh.list@...il.com>, <hch@...radead.org>, <djwong@...nel.org>, Zhang Yi
	<yi.zhang@...wei.com>, <yizhang089@...il.com>, <yangerkun@...wei.com>,
	<yukuai@...as.com>, <libaokun9@...il.com>, <libaokun9@...il.com>
Subject: Re: [PATCH -next v2 03/22] ext4: only order data when partially block
 truncating down

On 2026-02-05 22:07, Jan Kara wrote:
> On Thu 05-02-26 11:27:09, Baokun Li wrote:
>> On 2026-02-04 22:18, Jan Kara wrote:
>>> Hi Zhang!
>>>
>>> On Wed 04-02-26 14:42:46, Zhang Yi wrote:
>>>> On 2/3/2026 5:59 PM, Jan Kara wrote:
>>>>> On Tue 03-02-26 14:25:03, Zhang Yi wrote:
>>>>>> Currently, __ext4_block_zero_page_range() is called in the following
>>>>>> four cases to zero out the data in partial blocks:
>>>>>>
>>>>>> 1. Truncate down.
>>>>>> 2. Truncate up.
>>>>>> 3. Perform block allocation (e.g., fallocate) or append writes across a
>>>>>>    range extending beyond the end of the file (EOF).
>>>>>> 4. Partial block punch hole.
>>>>>>
>>>>>> If the default ordered data mode is used, __ext4_block_zero_page_range()
>>>>>> will write back the zeroed data to the disk through the order mode after
>>>>>> zeroing out.
>>>>>>
>>>>>> Among the cases 1,2 and 3 described above, only case 1 actually requires
>>>>>> this ordered write. Assuming no one intentionally bypasses the file
>>>>>> system to write directly to the disk. When performing a truncate down
>>>>>> operation, ensuring that the data beyond the EOF is zeroed out before
>>>>>> updating i_disksize is sufficient to prevent old data from being exposed
>>>>>> when the file is later extended. In other words, as long as the on-disk
>>>>>> data in case 1 can be properly zeroed out, only the data in memory needs
>>>>>> to be zeroed out in cases 2 and 3, without requiring ordered data.
>>>>> Hum, I'm not sure this is correct. The tail block of the file is not
>>>>> necessarily zeroed out beyond EOF (as mmap writes can race with page
>>>>> writeback and modify the tail block contents beyond EOF before we really
>>>>> submit it to the device). Thus after this commit if you truncate up, just
>>>>> zero out the newly exposed contents in the page cache and dirty it, then
>>>>> the transaction with the i_disksize update commits (I see nothing
>>>>> preventing it) and then you crash, you can observe file with the new size
>>>>> but non-zero content in the newly exposed area. Am I missing something?
>>>>>
>>>> Well, I think you are right! I missed the mmap write race condition that
>>>> happens during the writeback submitting I/O. Thank you a lot for pointing
>>>> this out. I thought of two possible solutions:
>>>>
>>>> 1. We also add explicit writeback operations to the truncate-up and
>>>>    post-EOF append writes. This solution is the most straightforward but
>>>>    may cause some performance overhead. However, since at most only one
>>>>    block is written, the impact is likely limited. Additionally, I
>>>>    observed that the implementation of the XFS file system also adopts a
>>>>    similar approach in its truncate up and down operation. (But it is
>>>>    somewhat strange that XFS also appears to have the same issue with
>>>>    post-EOF append writes; it only zero out the partial block in
>>>>    xfs_file_write_checks(), but it neither explicitly writeback zeroed
>>>>    data nor employs any other mechanism to ensure that the zero data
>>>>    writebacks before the metadata is written to disk.)
>>>>
>>>> 2. Resolve this race condition, ensure that there are no non-zero data
>>>>    in the post-EOF partial blocks on the disk. I observed that after the
>>>>    writeback holds the folio lock and calls folio_clear_dirty_for_io(),
>>>>    mmap writes will re-trigger the page fault. Perhaps we can filter out
>>>>    the EOF folio based on i_size in ext4_page_mkwrite(),
>>>>    block_page_mkwrite() and iomap_page_mkwrite(), and then call
>>>>    folio_wait_writeback() to wait for this partial folio writeback to
>>>>    complete. This seems can break the race condition without introducing
>>>>    too much overhead (no?).
>>>>
>>>> What do you think? Any other suggestions are also welcome.
>>> Hum, I like the option 2 because IMO non-zero data beyond EOF is a
>>> corner-case quirk which unnecessarily complicates rather common paths. But
>>> I'm not sure we can easily get rid of it. It can happen for example when
>>> you do appending write inside a block. The page is written back but before
>>> the transaction with i_disksize update commits we crash. Then again we have
>>> a non-zero content inside the block beyond EOF.
>>>
>>> So the only realistic option I see is to ensure tail of the block gets
>>> zeroed on disk before the transaction with i_disksize update commits in the
>>> cases of truncate up or write beyond EOF. data=ordered mode machinery is an
>>> asynchronous way how to achieve this. We could also just synchronously
>>> writeback the block where needed but the latency hit of such operation is
>>> going to be significant so I'm quite sure some workload somewhere will
>>> notice although the truncate up / write beyond EOF operations triggering this
>>> are not too common. So why do you need to get rid of these data=ordered
>>> mode usages? I guess because with iomap keeping our transaction handle ->
>>> folio lock ordering is complicated? Last time I looked it seemed still
>>> possible to keep it though.
>>>
>>> Another possibility would be to just *submit* the write synchronously and
>>> use data=ordered mode machinery only to wait for IO to complete before the
>>> transaction commits. That way it should be safe to start a transaction
>>> while holding folio lock and thus the iomap conversion would be easier.
>>>
>>> 								Honza
>> Can we treat EOF blocks as metadata and update them in the same
>> transaction as i_disksize? Although this would introduce some
>> management and journaling overhead, it could avoid the deadlock
>> of "writeback -> start handle -> trigger writeback".
> No, IMHO that would get too difficult. Just look at the hoops data=journal
> mode has to jump through to make page cache handling work with the
> journalling machinery. And you'd now have that for all the inodes. So I
> think some form of data=ordered machinery is much simpler to reason about.
>
> 								Honza

Indeed, this is a bit tricky.


Regards,
Baokun