linux-kernel - Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <yhy4cgc4fnk7tzfejuhy6m6ljo425ebpg6khss6vtvpidg6lyp@5xcyabxrl6zm>
Date: Fri, 6 Feb 2026 16:35:50 +0100
From: Jan Kara <jack@...e.cz>
To: Zhang Yi <yi.zhang@...weicloud.com>
Cc: Jan Kara <jack@...e.cz>, linux-ext4@...r.kernel.org, 
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org, tytso@....edu, 
	adilger.kernel@...ger.ca, ojaswin@...ux.ibm.com, ritesh.list@...il.com, hch@...radead.org, 
	djwong@...nel.org, Zhang Yi <yi.zhang@...wei.com>, yizhang089@...il.com, 
	libaokun1@...wei.com, yangerkun@...wei.com, yukuai@...as.com
Subject: Re: [PATCH -next v2 03/22] ext4: only order data when partially
 block truncating down

On Fri 06-02-26 19:09:53, Zhang Yi wrote:
> On 2/5/2026 11:05 PM, Jan Kara wrote:
> > So how about the following:
> 
> Let me see, please correct me if my understanding is wrong, ana there are
> also some points I don't get.
> 
> > We expand our io_end processing with the
> > ability to journal i_disksize updates after page writeback completes. Then
> > when doing truncate up or appending writes, we keep i_disksize at the old
> > value and just zero folio tails in the page cache, mark the folio dirty and
> > update i_size.
> 
> I think we need to submit this zeroed folio here as well. Because,
> 
> 1) In the case of truncate up, if we don't submit, the i_disksize may have to
>    wait a long time (until the folio writeback is complete, which takes about
>    30 seconds by default) before being updated, which is too long.

Correct but I'm not sure it matters. Current delalloc writes behave in the
same way already. For simplicity I'd thus prefer to not treat truncate up
in a special way but if we decide this indeed desirable, we can either
submit the tail folio immediately, or schedule work with earlier writeback.

> 2) In the case of appending writes. Assume that the folio written beyond this
>    one is written back first, we have to wait this zeroed folio to be write
>    back and then update i_disksize, so we can't wait too long either.

Correct, update of i_disksize after writeback of folios beyond current
i_disksize is blocked by the writeback of the tail folio.

> > When submitting writeback for a folio beyond current
> > i_disksize we make sure writepages submits IO for all the folios from
> > current i_disksize upwards.
> 
> Why "all the folios"? IIUC, we only wait the zeroed EOF folio is sufficient.

I was worried about a case like:

We have 4k blocksize, file is i_disksize 2k. Now you do:
pwrite(file, buf, 1, 6k);
pwrite(file, buf, 1, 10k);
pwrite(file, buf, 1, 14k);

The pwrite at offset 6k needs to zero the tail of the folio with index 0,
pwrite at 10k needs to zero the tail of the folio with index 1, etc. And
for us to safely advance i_disksize to 14k+1, I though all the folios (and
zeroed out tails) need to be written out. But that's actually not the case.
We need to make sure the zeroed tail is written out only if the underlying
block is already allocated and marked as written at the time of zeroing.
And the blocks underlying intermediate i_size values will never be allocated
and written without advancing i_disksize to them. So I think you're
correct, we always have at most one tail folio - the one surrounding
current i_disksize - which needs to be written out to safely advance
i_disksize and we don't care about folios inbetween.

> > When io_end processing happens after completed
> > folio writeback, we update i_disksize to min(i_size, end of IO).
> 
> Yeah, in the case of append write back. Assume we append write the folio 2
> and folio 3,
> 
>        old_idisksize  new_isize
>        |             |
>      [WWZZ][WWWW][WWWW]
>        1  |  2     3
>           A
> 
> Assume that folio 1 first completes the writeback, then we update i_disksize
> to pos A when the writeback is complete. Assume that folio 2 or 3 completes
> first, we should wait(e.g. call filemap_fdatawait_range_keep_errors() or
> something like) folio 1 to complete and then update i_disksize to new_isize.
> 
> But in the case of truncate up, We will only write back this zeroed folio. If
> the new i_size exceeds the end of this folio, how should we update i_disksize
> to the correct value?
> 
> For example, we truncate the file from old old_idisksize to new_isize, but we
> only zero and writeback folio 1, in the end_io processing of folio 1, we can
> only update the i_disksize to A, but we can never update it to new_isize. Am
> I missing something ?
> 
>        old_idisksize new_isize
>        |             |
>      [WWZZ]...hole ...
>        1  |
>           A

Good question. Based on the analysis above one option would be to setup
writeback of page straddling current i_disksize to update i_disksize to
current i_size on completion. That would be simple but would have an
unpleasant side effect that in case of a crash after append write we could
see increased i_disksize but zeros instead of written data. Another option
would be to update i_disksize on completion to the beginning of the first
dirty folio behind the written back range or i_size of there's not such
folio. This would still be relatively simple and mostly deal with "zeros
instead of data" problem.

> > This
> > should take care of non-zero data exposure issues and with "delay map"
> > processing Baokun works on all the inode metadata updates will happen after
> > IO completion anyway so it will be nicely batched up in one transaction.
> 
> Currently, my iomap convert implementation always enables dioread_nolock,

Yes, BTW I think you could remove no-dioread_nolock paths before doing the
conversion to simplify matters a bit. I don't think it's seriously used
anywhere anymore.

> so I feel that this solution can be achieved even without the "delay map"
> feature. After we have the "delay map", we can extend this to the
> buffer_head path.

I agree, delay map is not necessary for this to work. But it will make
things likely faster.

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR