lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aSs-YExXqLQ0k49M@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
Date: Sun, 30 Nov 2025 00:11:36 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: Andreas Dilger <adilger@...ger.ca>
Cc: Jan Kara <jack@...e.cz>, Zhang Yi <yi.zhang@...weicloud.com>,
        linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-kernel@...r.kernel.org, tytso@....edu, yi.zhang@...wei.com,
        yizhang089@...il.com, libaokun1@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH v2 03/13] ext4: don't zero the entire extent if
 EXT4_EXT_DATA_PARTIAL_VALID1

On Fri, Nov 28, 2025 at 12:52:30PM -0700, Andreas Dilger wrote:
> 
> 
> > On Nov 28, 2025, at 4:14 AM, Jan Kara <jack@...e.cz> wrote:
> > 
> > On Fri 28-11-25 12:58:22, Ojaswin Mujoo wrote:
> >> On Thu, Nov 27, 2025 at 02:41:52PM +0100, Jan Kara wrote:
> >>> Good catch on the data exposure issue! First I'd like to discuss whether
> >>> there isn't a way to fix these problems in a way that doesn't make the
> >>> already complex code even more complex. My observation is that
> >>> EXT4_EXT_MAY_ZEROOUT is only set in ext4_ext_convert_to_initialized() and
> >>> in ext4_split_convert_extents() which both call ext4_split_extent(). The
> >>> actual extent zeroing happens in ext4_split_extent_at() and in
> >>> ext4_ext_convert_to_initialized(). I think the code would be much clearer
> >>> if we just centralized all the zeroing in ext4_split_extent(). At that
> >>> place the situation is actually pretty simple:
> >> 
> >> This is exactly what I was playing with in my local tree to refactor this
> >> particular part of code :). I agree that ext4_split_extent() is a much
> >> better place to do the zeroout and it looks much cleaner but I agree
> >> with Yi that it might be better to do it after fixing the stale
> >> exposures so backports are straight forward. 
> >> 
> >> Am I correct in understanding that you are suggesting to zeroout
> >> proactively if we are below max_zeroout before even trying to extent
> >> split (which seems be done in ext4_ext_convert_to_initialized() as well)?
> > 
> > Yes. I was suggesting to effectively keep the behavior from
> > ext4_ext_convert_to_initialized().
> > 
> >> In this case, I have 2 concerns:
> >> 
> >>> 
> >>> 1) 'ex' is unwritten, 'map' describes part with already written data which
> >>> we want to convert to initialized (generally IO completion situation) => we
> >>> can zero out boundaries if they are smaller than max_zeroout or if extent
> >>> split fails.
> >> 
> >> Firstly, I know you mentioned in another email that zeroout of small ranges
> >> gives us a performance win but is it really faster on average than
> >> extent manipulation?
> > 
> > I guess it depends on the storage and the details of the extent tree. But
> > it definitely does help in cases like when you have large unwritten extent
> > and then start writing randomly 4k blocks into it because this zeroout
> > logic effectively limits the fragmentation of the extent tree. Overall
> > sequentially writing a few blocks more of zeros is very cheap practically
> > with any storage while fragmenting the extent tree becomes expensive rather
> > quickly (you generally get deeper extent tree due to smaller extents etc.).
> 
> The zeroout logic is not primarily an issue with the extent tree complexity.
> I agree with Ojaswin that in the common case the extent split would not
> cause a new index block to be written, though it can become unwieldy in the
> extreme case.
> 
> As Jan wrote, the main performance win is to avoid writing a bunch of
> small discontiguous blocks.  For HDD *and* flash, the overhead of writing
> several separate small blocks is much higher than writing a single 32KiB
> or 64KiB block to the storage.  Multiple separate blocks means more items
> in the queue and submitted to storage, separate seeks on an HDD and/or read-
> modify-write on a RAID controller, or erase blocks on a flash device.

Okay makes sense, so basically zeroout helps in the long run by reducing
fragmentation.

> 
> It also defers the conversion of those unwritten extents to a later time,
> when they would need to be processed again anyway if the blocks were written.
> 
> I was also considering whether the unwritten blocks would save on reads,
> but I suspect that would not be the case either.  Doing sequential reads
> would need to submit multiple small reads to the device and then zeroout
> the unwritten blocks instead of a single 32KiB or 64KiB read (which is
> basically free once the request is processed.
> 
> > 
> >> For example, for case 1 where both zeroout and splitting need
> >> journalling, I understand that splitting has high journal overhead in
> >> worst case, where tree might grow, but more often than not we would be
> >> manipulating within the same leaf so journalling only 1 bh (same as
> >> zeroout). In which case seems like zeroout might be slower no matter
> >> how fast the IO can be done. So proactive zeroout might be for beneficial
> >> for case 3 than case 1.
> > 
> > I agree that initially while the split extents still fit into the same leaf
> > block, zero out is likely to be somewhat slower but over the longer term
> > the gains from less extent fragmentation win.
> 
> I doubt that writing a single contiguous 32KiB chunk is ever going to be
> slower than writing 2 or 3 separate 4KiB chunks to the storage.  _Maybe_
> if it was NVRAM, but I don't think that would be the common case?
> 
> >>> 2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
> >>> submission) => the split is opportunistic here, if we cannot split due to
> >>> ENOSPC, just go on and deal with it at IO completion time. No zeroing
> >>> needed.
> >>> 
> >>> 3) 'ex' is written, 'map' describes part that should be converted to
> >>> unwritten => we can zero out the 'map' part if smaller than max_zeroout or
> >>> if extent split fails.
> >> 
> >> Proactive zeroout before trying split does seem benficial to help us
> >> avoid journal overhead for split. However, judging from
> >> ext4_ext_convert_to_initialized(), max zeroout comes from
> >> sbi->s_extent_max_zeroout_kb which is hardcoded to 32 irrespective of
> >> the IO device, so that means theres a chance a zeroout might be pretty
> >> slow if say we are doing it on a device than doesn't support accelerated
> >> zeroout operations. Maybe we need to be more intelligent in setting
> >> s_extent_max_zeroout_kb?
> > 
> > You can also tune the value in sysfs. I'm not 100% sure how the kernel
> > could do a better guess. Also I think 32k works mostly because it is small
> > enough to be cheap to write but already large enough to noticeably reduce
> > fragmentation for some pathological workloads (you can easily get 1/4 of
> > the extents than without this logic). But I'm open to ideas if you have
> > some.
> 
> Aligning this size with the flash erase block size might be a win?
> It may be that 32KiB is still large enough today (I've heard of 16KiB
> sector flash devices arriving soon, and IIRC 64KiB sectors are the
> norm for HDDs if anyone still cares).  Having this tuned automatically
> by the physical device characteristics (like max(32KiB, sector size) or
> similar if the flash erase block size is available somehow in the kernel)
> would future proof this as device sizes continue to grow.

okay so do you think it makes sense to consider the
max_write_zeroes_sectors() value of the underlying device. Enterprise
NVMEs and SCSI disks can utilize WRITE_ZEROES or WRITE_SAME which can
allow us to potentially zeroout much larger ranges with minimal overhead
of data movement. Maybe we can leverage these to zeroout more
aggressively without much of a performance penalty.

Let me try to see if i can find a disk and try a POC for this.

Regards,
ojaswin

> 
> 
> Cheers, Andreas
> 
> 
> 
> 
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ