linux-kernel - Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <665b8293-60a2-4d4d-aef5-cb1f9c3c0c13@huaweicloud.com>
Date: Mon, 9 Feb 2026 16:28:06 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Jan Kara <jack@...e.cz>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
 linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
 ojaswin@...ux.ibm.com, ritesh.list@...il.com, hch@...radead.org,
 djwong@...nel.org, Zhang Yi <yi.zhang@...wei.com>, yizhang089@...il.com,
 libaokun1@...wei.com, yangerkun@...wei.com, yukuai@...as.com
Subject: Re: [PATCH -next v2 03/22] ext4: only order data when partially block
 truncating down

On 2/6/2026 11:35 PM, Jan Kara wrote:
> On Fri 06-02-26 19:09:53, Zhang Yi wrote:
>> On 2/5/2026 11:05 PM, Jan Kara wrote:
>>> So how about the following:
>>
>> Let me see, please correct me if my understanding is wrong, ana there are
>> also some points I don't get.
>>
>>> We expand our io_end processing with the
>>> ability to journal i_disksize updates after page writeback completes. Then
>>> when doing truncate up or appending writes, we keep i_disksize at the old
>>> value and just zero folio tails in the page cache, mark the folio dirty and
>>> update i_size.
>>
>> I think we need to submit this zeroed folio here as well. Because,
>>
>> 1) In the case of truncate up, if we don't submit, the i_disksize may have to
>>    wait a long time (until the folio writeback is complete, which takes about
>>    30 seconds by default) before being updated, which is too long.
> 
> Correct but I'm not sure it matters. Current delalloc writes behave in the
> same way already. For simplicity I'd thus prefer to not treat truncate up
> in a special way but if we decide this indeed desirable, we can either
> submit the tail folio immediately, or schedule work with earlier writeback.
> 
>> 2) In the case of appending writes. Assume that the folio written beyond this
>>    one is written back first, we have to wait this zeroed folio to be write
>>    back and then update i_disksize, so we can't wait too long either.
> 
> Correct, update of i_disksize after writeback of folios beyond current
> i_disksize is blocked by the writeback of the tail folio.
> 
>>> When submitting writeback for a folio beyond current
>>> i_disksize we make sure writepages submits IO for all the folios from
>>> current i_disksize upwards.
>>
>> Why "all the folios"? IIUC, we only wait the zeroed EOF folio is sufficient.
> 
> I was worried about a case like:
> 
> We have 4k blocksize, file is i_disksize 2k. Now you do:
> pwrite(file, buf, 1, 6k);
> pwrite(file, buf, 1, 10k);
> pwrite(file, buf, 1, 14k);
> 
> The pwrite at offset 6k needs to zero the tail of the folio with index 0,
> pwrite at 10k needs to zero the tail of the folio with index 1, etc. And
> for us to safely advance i_disksize to 14k+1, I though all the folios (and
> zeroed out tails) need to be written out. But that's actually not the case.
> We need to make sure the zeroed tail is written out only if the underlying
> block is already allocated and marked as written at the time of zeroing.
> And the blocks underlying intermediate i_size values will never be allocated
> and written without advancing i_disksize to them. So I think you're
> correct, we always have at most one tail folio - the one surrounding
> current i_disksize - which needs to be written out to safely advance
> i_disksize and we don't care about folios inbetween.
> 
>>> When io_end processing happens after completed
>>> folio writeback, we update i_disksize to min(i_size, end of IO).
>>
>> Yeah, in the case of append write back. Assume we append write the folio 2
>> and folio 3,
>>
>>        old_idisksize  new_isize
>>        |             |
>>      [WWZZ][WWWW][WWWW]
>>        1  |  2     3
>>           A
>>
>> Assume that folio 1 first completes the writeback, then we update i_disksize
>> to pos A when the writeback is complete. Assume that folio 2 or 3 completes
>> first, we should wait(e.g. call filemap_fdatawait_range_keep_errors() or
>> something like) folio 1 to complete and then update i_disksize to new_isize.
>>
>> But in the case of truncate up, We will only write back this zeroed folio. If
>> the new i_size exceeds the end of this folio, how should we update i_disksize
>> to the correct value?
>>
>> For example, we truncate the file from old old_idisksize to new_isize, but we
>> only zero and writeback folio 1, in the end_io processing of folio 1, we can
>> only update the i_disksize to A, but we can never update it to new_isize. Am
>> I missing something ?
>>
>>        old_idisksize new_isize
>>        |             |
>>      [WWZZ]...hole ...
>>        1  |
>>           A
> 
> Good question. Based on the analysis above one option would be to setup
> writeback of page straddling current i_disksize to update i_disksize to
> current i_size on completion. That would be simple but would have an
> unpleasant side effect that in case of a crash after append write we could
> see increased i_disksize but zeros instead of written data. Another option
> would be to update i_disksize on completion to the beginning of the first
> dirty folio behind the written back range or i_size of there's not such
> folio. This would still be relatively simple and mostly deal with "zeros
> instead of data" problem.

Ha, good idea! I think it should work. I will try the second option, thank
you a lot for this suggestion. :)

> 
>>> This
>>> should take care of non-zero data exposure issues and with "delay map"
>>> processing Baokun works on all the inode metadata updates will happen after
>>> IO completion anyway so it will be nicely batched up in one transaction.
>>
>> Currently, my iomap convert implementation always enables dioread_nolock,
> 
> Yes, BTW I think you could remove no-dioread_nolock paths before doing the
> conversion to simplify matters a bit. I don't think it's seriously used
> anywhere anymore.
> 

Sure. After removing the no-dioread_nolock paths, the behavior of the
buffer_head path (extents-based and no-journal data mode) and the iomap path
in append write and truncate operations can be made consistent.

Cheers,
Yi.

>> so I feel that this solution can be achieved even without the "delay map"
>> feature. After we have the "delay map", we can extend this to the
>> buffer_head path.
> 
> I agree, delay map is not necessary for this to work. But it will make
> things likely faster.
> 
> 								Honza