[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <i3voptrv4rm3q3by7gksrgmgy2n5flchuveugjll5cchustm4z@qvixahynpize>
Date: Fri, 28 Nov 2025 12:14:56 +0100
From: Jan Kara <jack@...e.cz>
To: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
Cc: Jan Kara <jack@...e.cz>, Zhang Yi <yi.zhang@...weicloud.com>,
linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
tytso@....edu, adilger.kernel@...ger.ca, yi.zhang@...wei.com,
yizhang089@...il.com, libaokun1@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH v2 03/13] ext4: don't zero the entire extent if
EXT4_EXT_DATA_PARTIAL_VALID1
On Fri 28-11-25 12:58:22, Ojaswin Mujoo wrote:
> On Thu, Nov 27, 2025 at 02:41:52PM +0100, Jan Kara wrote:
> > Good catch on the data exposure issue! First I'd like to discuss whether
> > there isn't a way to fix these problems in a way that doesn't make the
> > already complex code even more complex. My observation is that
> > EXT4_EXT_MAY_ZEROOUT is only set in ext4_ext_convert_to_initialized() and
> > in ext4_split_convert_extents() which both call ext4_split_extent(). The
> > actual extent zeroing happens in ext4_split_extent_at() and in
> > ext4_ext_convert_to_initialized(). I think the code would be much clearer
> > if we just centralized all the zeroing in ext4_split_extent(). At that
> > place the situation is actually pretty simple:
>
> This is exactly what I was playing with in my local tree to refactor this
> particular part of code :). I agree that ext4_split_extent() is a much
> better place to do the zeroout and it looks much cleaner but I agree
> with Yi that it might be better to do it after fixing the stale
> exposures so backports are straight forward.
>
> Am I correct in understanding that you are suggesting to zeroout
> proactively if we are below max_zeroout before even trying to extent
> split (which seems be done in ext4_ext_convert_to_initialized() as well)?
Yes. I was suggesting to effectively keep the behavior from
ext4_ext_convert_to_initialized().
> In this case, I have 2 concerns:
>
> >
> > 1) 'ex' is unwritten, 'map' describes part with already written data which
> > we want to convert to initialized (generally IO completion situation) => we
> > can zero out boundaries if they are smaller than max_zeroout or if extent
> > split fails.
>
> Firstly, I know you mentioned in another email that zeroout of small ranges
> gives us a performance win but is it really faster on average than
> extent manipulation?
I guess it depends on the storage and the details of the extent tree. But
it definitely does help in cases like when you have large unwritten extent
and then start writing randomly 4k blocks into it because this zeroout
logic effectively limits the fragmentation of the extent tree. Overall
sequentially writing a few blocks more of zeros is very cheap practically
with any storage while fragmenting the extent tree becomes expensive rather
quickly (you generally get deeper extent tree due to smaller extents etc.).
> For example, for case 1 where both zeroout and splitting need
> journalling, I understand that splitting has high journal overhead in worst case,
> where tree might grow, but more often than not we would be manipulating
> within the same leaf so journalling only 1 bh (same as zeroout). In which case
> seems like zeroout might be slower no matter how fast the IO can be
> done. So proactive zeroout might be for beneficial for case 3 than case
> 1.
I agree that initially while the split extents still fit into the same leaf
block, zero out is likely to be somewhat slower but over the longer term
the gains from less extent fragmentation win.
> > 2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
> > submission) => the split is opportunistic here, if we cannot split due to
> > ENOSPC, just go on and deal with it at IO completion time. No zeroing
> > needed.
> >
> > 3) 'ex' is written, 'map' describes part that should be converted to
> > unwritten => we can zero out the 'map' part if smaller than max_zeroout or
> > if extent split fails.
>
> Proactive zeroout before trying split does seem benficial to help us
> avoid journal overhead for split. However, judging from
> ext4_ext_convert_to_initialized(), max zeroout comes from
> sbi->s_extent_max_zeroout_kb which is hardcoded to 32 irrespective of
> the IO device, so that means theres a chance a zeroout might be pretty
> slow if say we are doing it on a device than doesn't support accelerated
> zeroout operations. Maybe we need to be more intelligent in setting
> s_extent_max_zeroout_kb?
You can also tune the value in sysfs. I'm not 100% sure how the kernel
could do a better guess. Also I think 32k works mostly because it is small
enough to be cheap to write but already large enough to noticeably reduce
fragmentation for some pathological workloads (you can easily get 1/4 of
the extents than without this logic). But I'm open to ideas if you have
some.
Honza
--
Jan Kara <jack@...e.com>
SUSE Labs, CR
Powered by blists - more mailing lists