linux-kernel - Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <7hy5g3bp5whis4was5mqg3u6t37lwayi6j7scvpbuoqsbe5adc@mh5zxvml3oe7>
Date: Thu, 5 Feb 2026 16:05:02 +0100
From: Jan Kara <jack@...e.cz>
To: Zhang Yi <yi.zhang@...weicloud.com>
Cc: Jan Kara <jack@...e.cz>, linux-ext4@...r.kernel.org, 
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org, tytso@....edu, 
	adilger.kernel@...ger.ca, ojaswin@...ux.ibm.com, ritesh.list@...il.com, hch@...radead.org, 
	djwong@...nel.org, Zhang Yi <yi.zhang@...wei.com>, yizhang089@...il.com, 
	libaokun1@...wei.com, yangerkun@...wei.com, yukuai@...as.com
Subject: Re: [PATCH -next v2 03/22] ext4: only order data when partially
 block truncating down

On Thu 05-02-26 15:50:38, Zhang Yi wrote:
> On 2/4/2026 10:18 PM, Jan Kara wrote:
> > So why do you need to get rid of these data=ordered
> > mode usages? I guess because with iomap keeping our transaction handle ->
> > folio lock ordering is complicated? Last time I looked it seemed still
> > possible to keep it though.
> 
> Yes, that's one reason. There's another reason is that we also need to
> implement partial folio submits for iomap.
> 
> When the journal process is waiting for a folio to be written back
> (which contains an ordered block), and the folio also contains unmapped
> blocks with a block size smaller than the folio size, if the regular
> writeback process has already started committing this folio (and set the
> writeback flag), then a deadlock may occur while mapping the remaining
> unmapped blocks. This is because the writeback flag is cleared only
> after the entire folio are processed and committed. If we want to support
> partial folio submit for iomap, we need to be careful to prevent adding
> additional performance overhead in the case of severe fragmentation.

Yeah, this logic is currently handled by ext4_bio_write_folio(). And the
deadlocks are currently resolved by grabbing transaction handle before we
go and lock any page for writeback. But I agree that with iomap it may be
tricky to keep this scheme.

> Therefore, this aspect of the logic is complicated and subtle. As we
> discussed in patch 0, if we can avoid using the data=ordered mode in
> append write and online defrag, then this would be the only remaining
> corner case. I'm not sure if it is worth implementing this and adjusting
> the lock ordering.
> 
> > Another possibility would be to just *submit* the write synchronously and
> > use data=ordered mode machinery only to wait for IO to complete before the
> > transaction commits. That way it should be safe to start a transaction
> 
> IIUC, this solution seems can avoid adjusting the lock ordering, but partial
> folio submission still needs to be implemented, is my understanding right?
> This is because although we have already submitted this zeroed partial EOF
> block, when the journal process is waiting for this folio, this folio is
> being written back, and there are other blocks in this folio that need to be
> mapped.

That's a good question. If we submit the tail folio from truncation code,
we could just submit the full folio write and there's no need to restrict
ourselves only to mapped blocks. But you are correct that if this IO
completes but the folio had holes in it and the hole gets filled in by
write before the transaction with i_disksize update commits, jbd2 commit
could still race with flush worker writing this folio again and the
deadlock could happen. Hrm...

So how about the following: We expand our io_end processing with the
ability to journal i_disksize updates after page writeback completes. Then
when doing truncate up or appending writes, we keep i_disksize at the old
value and just zero folio tails in the page cache, mark the folio dirty and
update i_size. When submitting writeback for a folio beyond current
i_disksize we make sure writepages submits IO for all the folios from
current i_disksize upwards. When io_end processing happens after completed
folio writeback, we update i_disksize to min(i_size, end of IO). This
should take care of non-zero data exposure issues and with "delay map"
processing Baokun works on all the inode metadata updates will happen after
IO completion anyway so it will be nicely batched up in one transaction.
It's a big change but so far I think it should work. What do you think?

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR