[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <665b8293-60a2-4d4d-aef5-cb1f9c3c0c13@huaweicloud.com>
Date: Mon, 9 Feb 2026 16:28:06 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Jan Kara <jack@...e.cz>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
ojaswin@...ux.ibm.com, ritesh.list@...il.com, hch@...radead.org,
djwong@...nel.org, Zhang Yi <yi.zhang@...wei.com>, yizhang089@...il.com,
libaokun1@...wei.com, yangerkun@...wei.com, yukuai@...as.com
Subject: Re: [PATCH -next v2 03/22] ext4: only order data when partially block
truncating down
On 2/6/2026 11:35 PM, Jan Kara wrote:
> On Fri 06-02-26 19:09:53, Zhang Yi wrote:
>> On 2/5/2026 11:05 PM, Jan Kara wrote:
>>> So how about the following:
>>
>> Let me see, please correct me if my understanding is wrong, ana there are
>> also some points I don't get.
>>
>>> We expand our io_end processing with the
>>> ability to journal i_disksize updates after page writeback completes. Then
>>> when doing truncate up or appending writes, we keep i_disksize at the old
>>> value and just zero folio tails in the page cache, mark the folio dirty and
>>> update i_size.
>>
>> I think we need to submit this zeroed folio here as well. Because,
>>
>> 1) In the case of truncate up, if we don't submit, the i_disksize may have to
>> wait a long time (until the folio writeback is complete, which takes about
>> 30 seconds by default) before being updated, which is too long.
>
> Correct but I'm not sure it matters. Current delalloc writes behave in the
> same way already. For simplicity I'd thus prefer to not treat truncate up
> in a special way but if we decide this indeed desirable, we can either
> submit the tail folio immediately, or schedule work with earlier writeback.
>
>> 2) In the case of appending writes. Assume that the folio written beyond this
>> one is written back first, we have to wait this zeroed folio to be write
>> back and then update i_disksize, so we can't wait too long either.
>
> Correct, update of i_disksize after writeback of folios beyond current
> i_disksize is blocked by the writeback of the tail folio.
>
>>> When submitting writeback for a folio beyond current
>>> i_disksize we make sure writepages submits IO for all the folios from
>>> current i_disksize upwards.
>>
>> Why "all the folios"? IIUC, we only wait the zeroed EOF folio is sufficient.
>
> I was worried about a case like:
>
> We have 4k blocksize, file is i_disksize 2k. Now you do:
> pwrite(file, buf, 1, 6k);
> pwrite(file, buf, 1, 10k);
> pwrite(file, buf, 1, 14k);
>
> The pwrite at offset 6k needs to zero the tail of the folio with index 0,
> pwrite at 10k needs to zero the tail of the folio with index 1, etc. And
> for us to safely advance i_disksize to 14k+1, I though all the folios (and
> zeroed out tails) need to be written out. But that's actually not the case.
> We need to make sure the zeroed tail is written out only if the underlying
> block is already allocated and marked as written at the time of zeroing.
> And the blocks underlying intermediate i_size values will never be allocated
> and written without advancing i_disksize to them. So I think you're
> correct, we always have at most one tail folio - the one surrounding
> current i_disksize - which needs to be written out to safely advance
> i_disksize and we don't care about folios inbetween.
>
>>> When io_end processing happens after completed
>>> folio writeback, we update i_disksize to min(i_size, end of IO).
>>
>> Yeah, in the case of append write back. Assume we append write the folio 2
>> and folio 3,
>>
>> old_idisksize new_isize
>> | |
>> [WWZZ][WWWW][WWWW]
>> 1 | 2 3
>> A
>>
>> Assume that folio 1 first completes the writeback, then we update i_disksize
>> to pos A when the writeback is complete. Assume that folio 2 or 3 completes
>> first, we should wait(e.g. call filemap_fdatawait_range_keep_errors() or
>> something like) folio 1 to complete and then update i_disksize to new_isize.
>>
>> But in the case of truncate up, We will only write back this zeroed folio. If
>> the new i_size exceeds the end of this folio, how should we update i_disksize
>> to the correct value?
>>
>> For example, we truncate the file from old old_idisksize to new_isize, but we
>> only zero and writeback folio 1, in the end_io processing of folio 1, we can
>> only update the i_disksize to A, but we can never update it to new_isize. Am
>> I missing something ?
>>
>> old_idisksize new_isize
>> | |
>> [WWZZ]...hole ...
>> 1 |
>> A
>
> Good question. Based on the analysis above one option would be to setup
> writeback of page straddling current i_disksize to update i_disksize to
> current i_size on completion. That would be simple but would have an
> unpleasant side effect that in case of a crash after append write we could
> see increased i_disksize but zeros instead of written data. Another option
> would be to update i_disksize on completion to the beginning of the first
> dirty folio behind the written back range or i_size of there's not such
> folio. This would still be relatively simple and mostly deal with "zeros
> instead of data" problem.
Ha, good idea! I think it should work. I will try the second option, thank
you a lot for this suggestion. :)
>
>>> This
>>> should take care of non-zero data exposure issues and with "delay map"
>>> processing Baokun works on all the inode metadata updates will happen after
>>> IO completion anyway so it will be nicely batched up in one transaction.
>>
>> Currently, my iomap convert implementation always enables dioread_nolock,
>
> Yes, BTW I think you could remove no-dioread_nolock paths before doing the
> conversion to simplify matters a bit. I don't think it's seriously used
> anywhere anymore.
>
Sure. After removing the no-dioread_nolock paths, the behavior of the
buffer_head path (extents-based and no-journal data mode) and the iomap path
in append write and truncate operations can be made consistent.
Cheers,
Yi.
>> so I feel that this solution can be achieved even without the "delay map"
>> feature. After we have the "delay map", we can extend this to the
>> buffer_head path.
>
> I agree, delay map is not necessary for this to work. But it will make
> things likely faster.
>
> Honza
Powered by blists - more mailing lists