linux-kernel - Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <DDB4CC13-C509-478E-81C3-F37240016A69@dilger.ca>
Date: Fri, 28 Nov 2025 12:52:30 -0700
From: Andreas Dilger <adilger@...ger.ca>
To: Jan Kara <jack@...e.cz>
Cc: Ojaswin Mujoo <ojaswin@...ux.ibm.com>,
 Zhang Yi <yi.zhang@...weicloud.com>,
 linux-ext4@...r.kernel.org,
 linux-fsdevel@...r.kernel.org,
 linux-kernel@...r.kernel.org,
 tytso@....edu,
 yi.zhang@...wei.com,
 yizhang089@...il.com,
 libaokun1@...wei.com,
 yangerkun@...wei.com
Subject: Re: [PATCH v2 03/13] ext4: don't zero the entire extent if
 EXT4_EXT_DATA_PARTIAL_VALID1



> On Nov 28, 2025, at 4:14 AM, Jan Kara <jack@...e.cz> wrote:
> 
> On Fri 28-11-25 12:58:22, Ojaswin Mujoo wrote:
>> On Thu, Nov 27, 2025 at 02:41:52PM +0100, Jan Kara wrote:
>>> Good catch on the data exposure issue! First I'd like to discuss whether
>>> there isn't a way to fix these problems in a way that doesn't make the
>>> already complex code even more complex. My observation is that
>>> EXT4_EXT_MAY_ZEROOUT is only set in ext4_ext_convert_to_initialized() and
>>> in ext4_split_convert_extents() which both call ext4_split_extent(). The
>>> actual extent zeroing happens in ext4_split_extent_at() and in
>>> ext4_ext_convert_to_initialized(). I think the code would be much clearer
>>> if we just centralized all the zeroing in ext4_split_extent(). At that
>>> place the situation is actually pretty simple:
>> 
>> This is exactly what I was playing with in my local tree to refactor this
>> particular part of code :). I agree that ext4_split_extent() is a much
>> better place to do the zeroout and it looks much cleaner but I agree
>> with Yi that it might be better to do it after fixing the stale
>> exposures so backports are straight forward. 
>> 
>> Am I correct in understanding that you are suggesting to zeroout
>> proactively if we are below max_zeroout before even trying to extent
>> split (which seems be done in ext4_ext_convert_to_initialized() as well)?
> 
> Yes. I was suggesting to effectively keep the behavior from
> ext4_ext_convert_to_initialized().
> 
>> In this case, I have 2 concerns:
>> 
>>> 
>>> 1) 'ex' is unwritten, 'map' describes part with already written data which
>>> we want to convert to initialized (generally IO completion situation) => we
>>> can zero out boundaries if they are smaller than max_zeroout or if extent
>>> split fails.
>> 
>> Firstly, I know you mentioned in another email that zeroout of small ranges
>> gives us a performance win but is it really faster on average than
>> extent manipulation?
> 
> I guess it depends on the storage and the details of the extent tree. But
> it definitely does help in cases like when you have large unwritten extent
> and then start writing randomly 4k blocks into it because this zeroout
> logic effectively limits the fragmentation of the extent tree. Overall
> sequentially writing a few blocks more of zeros is very cheap practically
> with any storage while fragmenting the extent tree becomes expensive rather
> quickly (you generally get deeper extent tree due to smaller extents etc.).

The zeroout logic is not primarily an issue with the extent tree complexity.
I agree with Ojaswin that in the common case the extent split would not
cause a new index block to be written, though it can become unwieldy in the
extreme case.

As Jan wrote, the main performance win is to avoid writing a bunch of
small discontiguous blocks.  For HDD *and* flash, the overhead of writing
several separate small blocks is much higher than writing a single 32KiB
or 64KiB block to the storage.  Multiple separate blocks means more items
in the queue and submitted to storage, separate seeks on an HDD and/or read-
modify-write on a RAID controller, or erase blocks on a flash device.

It also defers the conversion of those unwritten extents to a later time,
when they would need to be processed again anyway if the blocks were written.

I was also considering whether the unwritten blocks would save on reads,
but I suspect that would not be the case either.  Doing sequential reads
would need to submit multiple small reads to the device and then zeroout
the unwritten blocks instead of a single 32KiB or 64KiB read (which is
basically free once the request is processed.

> 
>> For example, for case 1 where both zeroout and splitting need
>> journalling, I understand that splitting has high journal overhead in
>> worst case, where tree might grow, but more often than not we would be
>> manipulating within the same leaf so journalling only 1 bh (same as
>> zeroout). In which case seems like zeroout might be slower no matter
>> how fast the IO can be done. So proactive zeroout might be for beneficial
>> for case 3 than case 1.
> 
> I agree that initially while the split extents still fit into the same leaf
> block, zero out is likely to be somewhat slower but over the longer term
> the gains from less extent fragmentation win.

I doubt that writing a single contiguous 32KiB chunk is ever going to be
slower than writing 2 or 3 separate 4KiB chunks to the storage.  _Maybe_
if it was NVRAM, but I don't think that would be the common case?

>>> 2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
>>> submission) => the split is opportunistic here, if we cannot split due to
>>> ENOSPC, just go on and deal with it at IO completion time. No zeroing
>>> needed.
>>> 
>>> 3) 'ex' is written, 'map' describes part that should be converted to
>>> unwritten => we can zero out the 'map' part if smaller than max_zeroout or
>>> if extent split fails.
>> 
>> Proactive zeroout before trying split does seem benficial to help us
>> avoid journal overhead for split. However, judging from
>> ext4_ext_convert_to_initialized(), max zeroout comes from
>> sbi->s_extent_max_zeroout_kb which is hardcoded to 32 irrespective of
>> the IO device, so that means theres a chance a zeroout might be pretty
>> slow if say we are doing it on a device than doesn't support accelerated
>> zeroout operations. Maybe we need to be more intelligent in setting
>> s_extent_max_zeroout_kb?
> 
> You can also tune the value in sysfs. I'm not 100% sure how the kernel
> could do a better guess. Also I think 32k works mostly because it is small
> enough to be cheap to write but already large enough to noticeably reduce
> fragmentation for some pathological workloads (you can easily get 1/4 of
> the extents than without this logic). But I'm open to ideas if you have
> some.

Aligning this size with the flash erase block size might be a win?
It may be that 32KiB is still large enough today (I've heard of 16KiB
sector flash devices arriving soon, and IIRC 64KiB sectors are the
norm for HDDs if anyone still cares).  Having this tuned automatically
by the physical device characteristics (like max(32KiB, sector size) or
similar if the flash erase block size is available somehow in the kernel)
would future proof this as device sizes continue to grow.


Cheers, Andreas