linux-ext4 - Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aYyetHqxs9leOLFM@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
Date: Wed, 11 Feb 2026 20:53:00 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: Zhang Yi <yizhang089@...il.com>
Cc: Zhang Yi <yi.zhang@...wei.com>, linux-ext4@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        tytso@....edu, adilger.kernel@...ger.ca, jack@...e.cz,
        ritesh.list@...il.com, hch@...radead.org, djwong@...nel.org,
        yi.zhang@...weicloud.com, libaokun1@...wei.com, yangerkun@...wei.com,
        yukuai@...as.com
Subject: Re: [PATCH -next v2 03/22] ext4: only order data when partially
 block truncating down

On Tue, Feb 10, 2026 at 11:57:03PM +0800, Zhang Yi wrote:
> On 2/10/2026 3:05 PM, Ojaswin Mujoo wrote:
> > On Tue, Feb 03, 2026 at 02:25:03PM +0800, Zhang Yi wrote:
> > > Currently, __ext4_block_zero_page_range() is called in the following
> > > four cases to zero out the data in partial blocks:
> > > 
> > > 1. Truncate down.
> > > 2. Truncate up.
> > > 3. Perform block allocation (e.g., fallocate) or append writes across a
> > >     range extending beyond the end of the file (EOF).
> > > 4. Partial block punch hole.
> > > 
> > > If the default ordered data mode is used, __ext4_block_zero_page_range()
> > > will write back the zeroed data to the disk through the order mode after
> > > zeroing out.
> > > 
> > > Among the cases 1,2 and 3 described above, only case 1 actually requires
> > > this ordered write. Assuming no one intentionally bypasses the file
> > > system to write directly to the disk. When performing a truncate down
> > > operation, ensuring that the data beyond the EOF is zeroed out before
> > > updating i_disksize is sufficient to prevent old data from being exposed
> > > when the file is later extended. In other words, as long as the on-disk
> > > data in case 1 can be properly zeroed out, only the data in memory needs
> > > to be zeroed out in cases 2 and 3, without requiring ordered data.
> > > 
> > > Case 4 does not require ordered data because the entire punch hole
> > > operation does not provide atomicity guarantees. Therefore, it's safe to
> > > move the ordered data operation from __ext4_block_zero_page_range() to
> > > ext4_truncate().
> > > 
> > > It should be noted that after this change, we can only determine whether
> > > to perform ordered data operations based on whether the target block has
> > > been zeroed, rather than on the state of the buffer head. Consequently,
> > > unnecessary ordered data operations may occur when truncating an
> > > unwritten dirty block. However, this scenario is relatively rare, so the
> > > overall impact is minimal.
> > > 
> > > This is prepared for the conversion to the iomap infrastructure since it
> > > doesn't use ordered data mode and requires active writeback, which
> > > reduces the complexity of the conversion.
> > 
> > Hi Yi,
> > 
> > Took me quite some time to understand what we are doing here, I'll
> > just add my understanding here to confirm/document :)
> 
> Hi, Ojaswin!
> 
> Thank you for review and test this series.
> 
> > 
> > So your argument is that currently all paths that change the i_size take
> > care of zeroing the (newsize, eof block boundary) before i_size change
> > is seen by users:
> >    - dio does it in iomap_dio_bio_iter if IOMAP_UNWRITTEN (true for first allocation)
> > 	- buffered IO/mmap write does it in ext4_da_write_begin() ->
> > 		ext4_block_write_begin() for buffer_new (true for first allocation)
> > 	- falloc doesn't zero the new eof block but it allocates an unwrit
> > 		extent so no stale data issue. When an allocation happens from the
> > 		above 2 methods then we anyways will zero it.
> 
> These two zeroing operations mentioned above are mainly used to initialize
> newly allocated blocks, which is not the main focus of this discussion.
> 
> The focus of this discussion is how to clear the portions of allocated
> blocks that extend beyond the EOF.
> 
> > 	- truncate down also takes care of this via ext4_truncate() ->
> > 		ext4_block_truncate_page()
> > 
> > Now, parallely there are also codepaths that say grow the i_size but
> > then also zero the (old_size, block boundary) range before the i_size
> > commits. This is so that they want to be sure the newly visible range
> > doesn't expose stale data.
> > For example:
> >    - truncate up from 2kb to 8kb will zero (2kb,4kb) via ext4_block_truncate_page()
> >    - with i_size = 2kb, buffered IO at 6kb would zero 2kb,4kb in ext4_da_write_end()
> 
> Yes, you are right.
> 
> >    - I'm unable to see if/where we do it via dio path.
> 
> I don't see it too, so I think this is also a problem.
> 
> > 
> > You originally proposed that we can remove the logic to zeroout
> > (old_size, block_boundary) in data=ordered fashion, ie we don't need to
> > trigger the zeroout IO before the i_size change commits, we can just zero the
> > range in memory because we would have already zeroed them earlier when
> > we had allocated at old_isize, or truncated down to old_isize.
> 
> Yes.
> 
> > 
> > To this Jan pointed out that although we take care to zeroout (new_size,
> > block_boundary) its not enough because we could still end up with data
> > past eof:
> > 
> > 1. race of buffered write vs mmap write past eof. i_size = 2kb,
> >     we write (2kb, 3kb).
> > 2. The write goes through but we crash before i_size=3kb txn can commit.
> >     Again we have data past 2kb ie the eof block.
> > 
> 
> Yes.
> 
> > Now, Im still looking into this part but the reason we want to get rid of
> > this data=ordered IO is so that we don't trigger a writeback due to
> > journal commit which tries to acquire folio_lock of a folio already
> > locked by iomap.
> 
> Yes, and iomap will start a new transaction under the folio lock, which may
> also wait the current committing transaction to finish.

Hi Yi,

Ahh okay got it, thanks for confirming.

Regards,
ojaswin

> 
> > However we will now try an alternate way to get past
> > this.
> > 
> > Is my understanding correct?
> 
> Yes.
> 
> Cheers,
> Yi.
> 
> > 
> > Regards,
> > ojaswin
> >