[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9b7e93da-65dd-4574-be7f-4ec88bce4da7@huawei.com>
Date: Thu, 5 Feb 2026 11:27:09 +0800
From: Baokun Li <libaokun1@...wei.com>
To: Jan Kara <jack@...e.cz>, Zhang Yi <yi.zhang@...weicloud.com>
CC: <linux-ext4@...r.kernel.org>, <linux-fsdevel@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <tytso@....edu>, <adilger.kernel@...ger.ca>,
<ojaswin@...ux.ibm.com>, <ritesh.list@...il.com>, <hch@...radead.org>,
<djwong@...nel.org>, Zhang Yi <yi.zhang@...wei.com>, <yizhang089@...il.com>,
<yangerkun@...wei.com>, <yukuai@...as.com>, Baokun Li <libaokun1@...wei.com>,
<libaokun9@...il.com>
Subject: Re: [PATCH -next v2 03/22] ext4: only order data when partially block
truncating down
On 2026-02-04 22:18, Jan Kara wrote:
> Hi Zhang!
>
> On Wed 04-02-26 14:42:46, Zhang Yi wrote:
>> On 2/3/2026 5:59 PM, Jan Kara wrote:
>>> On Tue 03-02-26 14:25:03, Zhang Yi wrote:
>>>> Currently, __ext4_block_zero_page_range() is called in the following
>>>> four cases to zero out the data in partial blocks:
>>>>
>>>> 1. Truncate down.
>>>> 2. Truncate up.
>>>> 3. Perform block allocation (e.g., fallocate) or append writes across a
>>>> range extending beyond the end of the file (EOF).
>>>> 4. Partial block punch hole.
>>>>
>>>> If the default ordered data mode is used, __ext4_block_zero_page_range()
>>>> will write back the zeroed data to the disk through the order mode after
>>>> zeroing out.
>>>>
>>>> Among the cases 1,2 and 3 described above, only case 1 actually requires
>>>> this ordered write. Assuming no one intentionally bypasses the file
>>>> system to write directly to the disk. When performing a truncate down
>>>> operation, ensuring that the data beyond the EOF is zeroed out before
>>>> updating i_disksize is sufficient to prevent old data from being exposed
>>>> when the file is later extended. In other words, as long as the on-disk
>>>> data in case 1 can be properly zeroed out, only the data in memory needs
>>>> to be zeroed out in cases 2 and 3, without requiring ordered data.
>>> Hum, I'm not sure this is correct. The tail block of the file is not
>>> necessarily zeroed out beyond EOF (as mmap writes can race with page
>>> writeback and modify the tail block contents beyond EOF before we really
>>> submit it to the device). Thus after this commit if you truncate up, just
>>> zero out the newly exposed contents in the page cache and dirty it, then
>>> the transaction with the i_disksize update commits (I see nothing
>>> preventing it) and then you crash, you can observe file with the new size
>>> but non-zero content in the newly exposed area. Am I missing something?
>>>
>> Well, I think you are right! I missed the mmap write race condition that
>> happens during the writeback submitting I/O. Thank you a lot for pointing
>> this out. I thought of two possible solutions:
>>
>> 1. We also add explicit writeback operations to the truncate-up and
>> post-EOF append writes. This solution is the most straightforward but
>> may cause some performance overhead. However, since at most only one
>> block is written, the impact is likely limited. Additionally, I
>> observed that the implementation of the XFS file system also adopts a
>> similar approach in its truncate up and down operation. (But it is
>> somewhat strange that XFS also appears to have the same issue with
>> post-EOF append writes; it only zero out the partial block in
>> xfs_file_write_checks(), but it neither explicitly writeback zeroed
>> data nor employs any other mechanism to ensure that the zero data
>> writebacks before the metadata is written to disk.)
>>
>> 2. Resolve this race condition, ensure that there are no non-zero data
>> in the post-EOF partial blocks on the disk. I observed that after the
>> writeback holds the folio lock and calls folio_clear_dirty_for_io(),
>> mmap writes will re-trigger the page fault. Perhaps we can filter out
>> the EOF folio based on i_size in ext4_page_mkwrite(),
>> block_page_mkwrite() and iomap_page_mkwrite(), and then call
>> folio_wait_writeback() to wait for this partial folio writeback to
>> complete. This seems can break the race condition without introducing
>> too much overhead (no?).
>>
>> What do you think? Any other suggestions are also welcome.
> Hum, I like the option 2 because IMO non-zero data beyond EOF is a
> corner-case quirk which unnecessarily complicates rather common paths. But
> I'm not sure we can easily get rid of it. It can happen for example when
> you do appending write inside a block. The page is written back but before
> the transaction with i_disksize update commits we crash. Then again we have
> a non-zero content inside the block beyond EOF.
>
> So the only realistic option I see is to ensure tail of the block gets
> zeroed on disk before the transaction with i_disksize update commits in the
> cases of truncate up or write beyond EOF. data=ordered mode machinery is an
> asynchronous way how to achieve this. We could also just synchronously
> writeback the block where needed but the latency hit of such operation is
> going to be significant so I'm quite sure some workload somewhere will
> notice although the truncate up / write beyond EOF operations triggering this
> are not too common. So why do you need to get rid of these data=ordered
> mode usages? I guess because with iomap keeping our transaction handle ->
> folio lock ordering is complicated? Last time I looked it seemed still
> possible to keep it though.
>
> Another possibility would be to just *submit* the write synchronously and
> use data=ordered mode machinery only to wait for IO to complete before the
> transaction commits. That way it should be safe to start a transaction
> while holding folio lock and thus the iomap conversion would be easier.
>
> Honza
Can we treat EOF blocks as metadata and update them in the same
transaction as i_disksize? Although this would introduce some
management and journaling overhead, it could avoid the deadlock
of "writeback -> start handle -> trigger writeback".
Regards,
Baokun
Powered by blists - more mailing lists