linux-kernel - Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3ea033c1-8d32-4c82-baea-c383fa1d9e2a@huaweicloud.com>
Date: Fri, 6 Feb 2026 19:09:53 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Jan Kara <jack@...e.cz>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
 linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
 ojaswin@...ux.ibm.com, ritesh.list@...il.com, hch@...radead.org,
 djwong@...nel.org, Zhang Yi <yi.zhang@...wei.com>, yizhang089@...il.com,
 libaokun1@...wei.com, yangerkun@...wei.com, yukuai@...as.com
Subject: Re: [PATCH -next v2 03/22] ext4: only order data when partially block
 truncating down

On 2/5/2026 11:05 PM, Jan Kara wrote:
> On Thu 05-02-26 15:50:38, Zhang Yi wrote:
>> On 2/4/2026 10:18 PM, Jan Kara wrote:
>>> So why do you need to get rid of these data=ordered
>>> mode usages? I guess because with iomap keeping our transaction handle ->
>>> folio lock ordering is complicated? Last time I looked it seemed still
>>> possible to keep it though.
>>
>> Yes, that's one reason. There's another reason is that we also need to
>> implement partial folio submits for iomap.
>>
>> When the journal process is waiting for a folio to be written back
>> (which contains an ordered block), and the folio also contains unmapped
>> blocks with a block size smaller than the folio size, if the regular
>> writeback process has already started committing this folio (and set the
>> writeback flag), then a deadlock may occur while mapping the remaining
>> unmapped blocks. This is because the writeback flag is cleared only
>> after the entire folio are processed and committed. If we want to support
>> partial folio submit for iomap, we need to be careful to prevent adding
>> additional performance overhead in the case of severe fragmentation.
> 
> Yeah, this logic is currently handled by ext4_bio_write_folio(). And the
> deadlocks are currently resolved by grabbing transaction handle before we
> go and lock any page for writeback. But I agree that with iomap it may be
> tricky to keep this scheme.
> 
>> Therefore, this aspect of the logic is complicated and subtle. As we
>> discussed in patch 0, if we can avoid using the data=ordered mode in
>> append write and online defrag, then this would be the only remaining
>> corner case. I'm not sure if it is worth implementing this and adjusting
>> the lock ordering.
>>
>>> Another possibility would be to just *submit* the write synchronously and
>>> use data=ordered mode machinery only to wait for IO to complete before the
>>> transaction commits. That way it should be safe to start a transaction
>>
>> IIUC, this solution seems can avoid adjusting the lock ordering, but partial
>> folio submission still needs to be implemented, is my understanding right?
>> This is because although we have already submitted this zeroed partial EOF
>> block, when the journal process is waiting for this folio, this folio is
>> being written back, and there are other blocks in this folio that need to be
>> mapped.
> 
> That's a good question. If we submit the tail folio from truncation code,
> we could just submit the full folio write and there's no need to restrict
> ourselves only to mapped blocks. But you are correct that if this IO
> completes but the folio had holes in it and the hole gets filled in by
> write before the transaction with i_disksize update commits, jbd2 commit
> could still race with flush worker writing this folio again and the
> deadlock could happen. Hrm...
> 
Yes!

> So how about the following:

Let me see, please correct me if my understanding is wrong, ana there are
also some points I don't get.

> We expand our io_end processing with the
> ability to journal i_disksize updates after page writeback completes. Then
> when doing truncate up or appending writes, we keep i_disksize at the old
> value and just zero folio tails in the page cache, mark the folio dirty and
> update i_size.

I think we need to submit this zeroed folio here as well. Because,

1) In the case of truncate up, if we don't submit, the i_disksize may have to
   wait a long time (until the folio writeback is complete, which takes about
   30 seconds by default) before being updated, which is too long.
2) In the case of appending writes. Assume that the folio written beyond this
   one is written back first, we have to wait this zeroed folio to be write
   back and then update i_disksize, so we can't wait too long either.

Right?

> When submitting writeback for a folio beyond current
> i_disksize we make sure writepages submits IO for all the folios from
> current i_disksize upwards.

Why "all the folios"? IIUC, we only wait the zeroed EOF folio is sufficient.

> When io_end processing happens after completed
> folio writeback, we update i_disksize to min(i_size, end of IO).

Yeah, in the case of append write back. Assume we append write the folio 2
and folio 3,

       old_idisksize  new_isize
       |             |
     [WWZZ][WWWW][WWWW]
       1  |  2     3
          A

Assume that folio 1 first completes the writeback, then we update i_disksize
to pos A when the writeback is complete. Assume that folio 2 or 3 completes
first, we should wait(e.g. call filemap_fdatawait_range_keep_errors() or
something like) folio 1 to complete and then update i_disksize to new_isize.

But in the case of truncate up, We will only write back this zeroed folio. If
the new i_size exceeds the end of this folio, how should we update i_disksize
to the correct value?

For example, we truncate the file from old old_idisksize to new_isize, but we
only zero and writeback folio 1, in the end_io processing of folio 1, we can
only update the i_disksize to A, but we can never update it to new_isize. Am
I missing something ?

       old_idisksize new_isize
       |             |
     [WWZZ]...hole ...
       1  |
          A

> This
> should take care of non-zero data exposure issues and with "delay map"
> processing Baokun works on all the inode metadata updates will happen after
> IO completion anyway so it will be nicely batched up in one transaction.

Currently, my iomap convert implementation always enables dioread_nolock,
so I feel that this solution can be achieved even without the "delay map"
feature. After we have the "delay map", we can extend this to the
buffer_head path.

Thanks,
Yi.

> It's a big change but so far I think it should work. What do you think?
> 
> 								Honza