linux-ext4 - Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ac1f8bd8-926e-4182-a5a3-a111b49ecafc@huaweicloud.com>
Date: Tue, 10 Feb 2026 20:02:51 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Jan Kara <jack@...e.cz>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
 linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
 ojaswin@...ux.ibm.com, ritesh.list@...il.com, hch@...radead.org,
 djwong@...nel.org, yizhang089@...il.com, libaokun1@...wei.com,
 yangerkun@...wei.com, yukuai@...as.com, Zhang Yi <yi.zhang@...weicloud.com>
Subject: Re: [PATCH -next v2 03/22] ext4: only order data when partially block
 truncating down

On 2/9/2026 4:28 PM, Zhang Yi wrote:
> On 2/6/2026 11:35 PM, Jan Kara wrote:
>> On Fri 06-02-26 19:09:53, Zhang Yi wrote:
>>> On 2/5/2026 11:05 PM, Jan Kara wrote:
>>>> So how about the following:
>>>
>>> Let me see, please correct me if my understanding is wrong, ana there are
>>> also some points I don't get.
>>>
>>>> We expand our io_end processing with the
>>>> ability to journal i_disksize updates after page writeback completes. Then

While I was extending the end_io path of buffered_head to support updating
i_disksize, I found another problem that requires discussion.

Supporting updates to i_disksize in end_io requires starting a handle, which
conflicts with the data=ordered mode because folios written back through the
journal process cannot initiate any handles; otherwise, this may lead to a
deadlock. This limitation does not affect the iomap path, as it does not use
the data=ordered mode at all.  However, in the buffered_head path, online
defragmentation (if this change works, it should be the last user) still uses
the data=ordered mode.

Assume that during online defragmentation, after the EOF partial block is
copied and swapped, the transaction submitting process could be raced by a
concurrent truncate-up operation. Then, when the journal process commits this
block, the i_disksize needs to be updated after the I/O is complete. Finally,
it may trigger a deadlock issue when it starting a new transaction. Conversely,
if we do truncate up first, and then perform the EOF block swap operation just
after it, the same problem will also occur.

Even if we perform synchronous writeback for the EOF block in mext_move_extent(),
it still won't work. This is because swapped blocks that have entered the
ordered list could potentially become EOF blocks at any time before the
transaction is committed (e.g., a concurrent truncate down happens).

Therefore, I was thinking, perhaps currently we have to keep the buffer_head
path as it is, and only modify the truncate up and append write for the iomap
path. What do you think?

Cheers,
Yi.

>>>> when doing truncate up or appending writes, we keep i_disksize at the old
>>>> value and just zero folio tails in the page cache, mark the folio dirty and
>>>> update i_size.
>>>
>>> I think we need to submit this zeroed folio here as well. Because,
>>>
>>> 1) In the case of truncate up, if we don't submit, the i_disksize may have to
>>>    wait a long time (until the folio writeback is complete, which takes about
>>>    30 seconds by default) before being updated, which is too long.
>>
>> Correct but I'm not sure it matters. Current delalloc writes behave in the
>> same way already. For simplicity I'd thus prefer to not treat truncate up
>> in a special way but if we decide this indeed desirable, we can either
>> submit the tail folio immediately, or schedule work with earlier writeback.
>>
>>> 2) In the case of appending writes. Assume that the folio written beyond this
>>>    one is written back first, we have to wait this zeroed folio to be write
>>>    back and then update i_disksize, so we can't wait too long either.
>>
>> Correct, update of i_disksize after writeback of folios beyond current
>> i_disksize is blocked by the writeback of the tail folio.
>>
>>>> When submitting writeback for a folio beyond current
>>>> i_disksize we make sure writepages submits IO for all the folios from
>>>> current i_disksize upwards.
>>>
>>> Why "all the folios"? IIUC, we only wait the zeroed EOF folio is sufficient.
>>
>> I was worried about a case like:
>>
>> We have 4k blocksize, file is i_disksize 2k. Now you do:
>> pwrite(file, buf, 1, 6k);
>> pwrite(file, buf, 1, 10k);
>> pwrite(file, buf, 1, 14k);
>>
>> The pwrite at offset 6k needs to zero the tail of the folio with index 0,
>> pwrite at 10k needs to zero the tail of the folio with index 1, etc. And
>> for us to safely advance i_disksize to 14k+1, I though all the folios (and
>> zeroed out tails) need to be written out. But that's actually not the case.
>> We need to make sure the zeroed tail is written out only if the underlying
>> block is already allocated and marked as written at the time of zeroing.
>> And the blocks underlying intermediate i_size values will never be allocated
>> and written without advancing i_disksize to them. So I think you're
>> correct, we always have at most one tail folio - the one surrounding
>> current i_disksize - which needs to be written out to safely advance
>> i_disksize and we don't care about folios inbetween.
>>
>>>> When io_end processing happens after completed
>>>> folio writeback, we update i_disksize to min(i_size, end of IO).
>>>
>>> Yeah, in the case of append write back. Assume we append write the folio 2
>>> and folio 3,
>>>
>>>        old_idisksize  new_isize
>>>        |             |
>>>      [WWZZ][WWWW][WWWW]
>>>        1  |  2     3
>>>           A
>>>
>>> Assume that folio 1 first completes the writeback, then we update i_disksize
>>> to pos A when the writeback is complete. Assume that folio 2 or 3 completes
>>> first, we should wait(e.g. call filemap_fdatawait_range_keep_errors() or
>>> something like) folio 1 to complete and then update i_disksize to new_isize.
>>>
>>> But in the case of truncate up, We will only write back this zeroed folio. If
>>> the new i_size exceeds the end of this folio, how should we update i_disksize
>>> to the correct value?
>>>
>>> For example, we truncate the file from old old_idisksize to new_isize, but we
>>> only zero and writeback folio 1, in the end_io processing of folio 1, we can
>>> only update the i_disksize to A, but we can never update it to new_isize. Am
>>> I missing something ?
>>>
>>>        old_idisksize new_isize
>>>        |             |
>>>      [WWZZ]...hole ...
>>>        1  |
>>>           A
>>
>> Good question. Based on the analysis above one option would be to setup
>> writeback of page straddling current i_disksize to update i_disksize to
>> current i_size on completion. That would be simple but would have an
>> unpleasant side effect that in case of a crash after append write we could
>> see increased i_disksize but zeros instead of written data. Another option
>> would be to update i_disksize on completion to the beginning of the first
>> dirty folio behind the written back range or i_size of there's not such
>> folio. This would still be relatively simple and mostly deal with "zeros
>> instead of data" problem.
> 
> Ha, good idea! I think it should work. I will try the second option, thank
> you a lot for this suggestion. :)
> 
>>
>>>> This
>>>> should take care of non-zero data exposure issues and with "delay map"
>>>> processing Baokun works on all the inode metadata updates will happen after
>>>> IO completion anyway so it will be nicely batched up in one transaction.
>>>
>>> Currently, my iomap convert implementation always enables dioread_nolock,
>>
>> Yes, BTW I think you could remove no-dioread_nolock paths before doing the
>> conversion to simplify matters a bit. I don't think it's seriously used
>> anywhere anymore.
>>
> 
> Sure. After removing the no-dioread_nolock paths, the behavior of the
> buffer_head path (extents-based and no-journal data mode) and the iomap path
> in append write and truncate operations can be made consistent.
> 
> Cheers,
> Yi.
> 
>>> so I feel that this solution can be achieved even without the "delay map"
>>> feature. After we have the "delay map", we can extend this to the
>>> buffer_head path.
>>
>> I agree, delay map is not necessary for this to work. But it will make
>> things likely faster.
>>
>> 								Honza
>