linux-kernel - Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <s434ifpengcthkmohmc6vvmvppx4o2k2ctk2p3it55ncgce3je@irbt7xpdnnzu>
Date: Thu, 5 Feb 2026 15:07:35 +0100
From: Jan Kara <jack@...e.cz>
To: Baokun Li <libaokun1@...wei.com>
Cc: Jan Kara <jack@...e.cz>, Zhang Yi <yi.zhang@...weicloud.com>, 
	linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org, 
	tytso@....edu, adilger.kernel@...ger.ca, ojaswin@...ux.ibm.com, 
	ritesh.list@...il.com, hch@...radead.org, djwong@...nel.org, 
	Zhang Yi <yi.zhang@...wei.com>, yizhang089@...il.com, yangerkun@...wei.com, yukuai@...as.com, 
	libaokun9@...il.com
Subject: Re: [PATCH -next v2 03/22] ext4: only order data when partially
 block truncating down

On Thu 05-02-26 11:27:09, Baokun Li wrote:
> On 2026-02-04 22:18, Jan Kara wrote:
> > Hi Zhang!
> >
> > On Wed 04-02-26 14:42:46, Zhang Yi wrote:
> >> On 2/3/2026 5:59 PM, Jan Kara wrote:
> >>> On Tue 03-02-26 14:25:03, Zhang Yi wrote:
> >>>> Currently, __ext4_block_zero_page_range() is called in the following
> >>>> four cases to zero out the data in partial blocks:
> >>>>
> >>>> 1. Truncate down.
> >>>> 2. Truncate up.
> >>>> 3. Perform block allocation (e.g., fallocate) or append writes across a
> >>>>    range extending beyond the end of the file (EOF).
> >>>> 4. Partial block punch hole.
> >>>>
> >>>> If the default ordered data mode is used, __ext4_block_zero_page_range()
> >>>> will write back the zeroed data to the disk through the order mode after
> >>>> zeroing out.
> >>>>
> >>>> Among the cases 1,2 and 3 described above, only case 1 actually requires
> >>>> this ordered write. Assuming no one intentionally bypasses the file
> >>>> system to write directly to the disk. When performing a truncate down
> >>>> operation, ensuring that the data beyond the EOF is zeroed out before
> >>>> updating i_disksize is sufficient to prevent old data from being exposed
> >>>> when the file is later extended. In other words, as long as the on-disk
> >>>> data in case 1 can be properly zeroed out, only the data in memory needs
> >>>> to be zeroed out in cases 2 and 3, without requiring ordered data.
> >>> Hum, I'm not sure this is correct. The tail block of the file is not
> >>> necessarily zeroed out beyond EOF (as mmap writes can race with page
> >>> writeback and modify the tail block contents beyond EOF before we really
> >>> submit it to the device). Thus after this commit if you truncate up, just
> >>> zero out the newly exposed contents in the page cache and dirty it, then
> >>> the transaction with the i_disksize update commits (I see nothing
> >>> preventing it) and then you crash, you can observe file with the new size
> >>> but non-zero content in the newly exposed area. Am I missing something?
> >>>
> >> Well, I think you are right! I missed the mmap write race condition that
> >> happens during the writeback submitting I/O. Thank you a lot for pointing
> >> this out. I thought of two possible solutions:
> >>
> >> 1. We also add explicit writeback operations to the truncate-up and
> >>    post-EOF append writes. This solution is the most straightforward but
> >>    may cause some performance overhead. However, since at most only one
> >>    block is written, the impact is likely limited. Additionally, I
> >>    observed that the implementation of the XFS file system also adopts a
> >>    similar approach in its truncate up and down operation. (But it is
> >>    somewhat strange that XFS also appears to have the same issue with
> >>    post-EOF append writes; it only zero out the partial block in
> >>    xfs_file_write_checks(), but it neither explicitly writeback zeroed
> >>    data nor employs any other mechanism to ensure that the zero data
> >>    writebacks before the metadata is written to disk.)
> >>
> >> 2. Resolve this race condition, ensure that there are no non-zero data
> >>    in the post-EOF partial blocks on the disk. I observed that after the
> >>    writeback holds the folio lock and calls folio_clear_dirty_for_io(),
> >>    mmap writes will re-trigger the page fault. Perhaps we can filter out
> >>    the EOF folio based on i_size in ext4_page_mkwrite(),
> >>    block_page_mkwrite() and iomap_page_mkwrite(), and then call
> >>    folio_wait_writeback() to wait for this partial folio writeback to
> >>    complete. This seems can break the race condition without introducing
> >>    too much overhead (no?).
> >>
> >> What do you think? Any other suggestions are also welcome.
> > Hum, I like the option 2 because IMO non-zero data beyond EOF is a
> > corner-case quirk which unnecessarily complicates rather common paths. But
> > I'm not sure we can easily get rid of it. It can happen for example when
> > you do appending write inside a block. The page is written back but before
> > the transaction with i_disksize update commits we crash. Then again we have
> > a non-zero content inside the block beyond EOF.
> >
> > So the only realistic option I see is to ensure tail of the block gets
> > zeroed on disk before the transaction with i_disksize update commits in the
> > cases of truncate up or write beyond EOF. data=ordered mode machinery is an
> > asynchronous way how to achieve this. We could also just synchronously
> > writeback the block where needed but the latency hit of such operation is
> > going to be significant so I'm quite sure some workload somewhere will
> > notice although the truncate up / write beyond EOF operations triggering this
> > are not too common. So why do you need to get rid of these data=ordered
> > mode usages? I guess because with iomap keeping our transaction handle ->
> > folio lock ordering is complicated? Last time I looked it seemed still
> > possible to keep it though.
> >
> > Another possibility would be to just *submit* the write synchronously and
> > use data=ordered mode machinery only to wait for IO to complete before the
> > transaction commits. That way it should be safe to start a transaction
> > while holding folio lock and thus the iomap conversion would be easier.
> >
> > 								Honza
> 
> Can we treat EOF blocks as metadata and update them in the same
> transaction as i_disksize? Although this would introduce some
> management and journaling overhead, it could avoid the deadlock
> of "writeback -> start handle -> trigger writeback".

No, IMHO that would get too difficult. Just look at the hoops data=journal
mode has to jump through to make page cache handling work with the
journalling machinery. And you'd now have that for all the inodes. So I
think some form of data=ordered machinery is much simpler to reason about.

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR