[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aSmvoGwGKo4bX8J8@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
Date: Fri, 28 Nov 2025 19:50:47 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: Jan Kara <jack@...e.cz>
Cc: Zhang Yi <yi.zhang@...weicloud.com>, linux-ext4@...r.kernel.org,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
tytso@....edu, adilger.kernel@...ger.ca, yi.zhang@...wei.com,
yizhang089@...il.com, libaokun1@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH v2 03/13] ext4: don't zero the entire extent if
EXT4_EXT_DATA_PARTIAL_VALID1
On Fri, Nov 28, 2025 at 12:14:56PM +0100, Jan Kara wrote:
> On Fri 28-11-25 12:58:22, Ojaswin Mujoo wrote:
> > On Thu, Nov 27, 2025 at 02:41:52PM +0100, Jan Kara wrote:
> > > Good catch on the data exposure issue! First I'd like to discuss whether
> > > there isn't a way to fix these problems in a way that doesn't make the
> > > already complex code even more complex. My observation is that
> > > EXT4_EXT_MAY_ZEROOUT is only set in ext4_ext_convert_to_initialized() and
> > > in ext4_split_convert_extents() which both call ext4_split_extent(). The
> > > actual extent zeroing happens in ext4_split_extent_at() and in
> > > ext4_ext_convert_to_initialized(). I think the code would be much clearer
> > > if we just centralized all the zeroing in ext4_split_extent(). At that
> > > place the situation is actually pretty simple:
> >
> > This is exactly what I was playing with in my local tree to refactor this
> > particular part of code :). I agree that ext4_split_extent() is a much
> > better place to do the zeroout and it looks much cleaner but I agree
> > with Yi that it might be better to do it after fixing the stale
> > exposures so backports are straight forward.
> >
> > Am I correct in understanding that you are suggesting to zeroout
> > proactively if we are below max_zeroout before even trying to extent
> > split (which seems be done in ext4_ext_convert_to_initialized() as well)?
>
> Yes. I was suggesting to effectively keep the behavior from
> ext4_ext_convert_to_initialized().
>
> > In this case, I have 2 concerns:
> >
> > >
> > > 1) 'ex' is unwritten, 'map' describes part with already written data which
> > > we want to convert to initialized (generally IO completion situation) => we
> > > can zero out boundaries if they are smaller than max_zeroout or if extent
> > > split fails.
> >
> > Firstly, I know you mentioned in another email that zeroout of small ranges
> > gives us a performance win but is it really faster on average than
> > extent manipulation?
>
> I guess it depends on the storage and the details of the extent tree. But
> it definitely does help in cases like when you have large unwritten extent
> and then start writing randomly 4k blocks into it because this zeroout
> logic effectively limits the fragmentation of the extent tree. Overall
> sequentially writing a few blocks more of zeros is very cheap practically
> with any storage while fragmenting the extent tree becomes expensive rather
> quickly (you generally get deeper extent tree due to smaller extents etc.).
Got it, makes sense. This approach is definitely worth a try and then
maybe we can run a few benchmarks and see how proactive zeroout works.
>
> > For example, for case 1 where both zeroout and splitting need
> > journalling, I understand that splitting has high journal overhead in worst case,
> > where tree might grow, but more often than not we would be manipulating
> > within the same leaf so journalling only 1 bh (same as zeroout). In which case
> > seems like zeroout might be slower no matter how fast the IO can be
> > done. So proactive zeroout might be for beneficial for case 3 than case
> > 1.
>
> I agree that initially while the split extents still fit into the same leaf
> block, zero out is likely to be somewhat slower but over the longer term
> the gains from less extent fragmentation win.
>
> > > 2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
> > > submission) => the split is opportunistic here, if we cannot split due to
> > > ENOSPC, just go on and deal with it at IO completion time. No zeroing
> > > needed.
> > >
> > > 3) 'ex' is written, 'map' describes part that should be converted to
> > > unwritten => we can zero out the 'map' part if smaller than max_zeroout or
> > > if extent split fails.
> >
> > Proactive zeroout before trying split does seem benficial to help us
> > avoid journal overhead for split. However, judging from
> > ext4_ext_convert_to_initialized(), max zeroout comes from
> > sbi->s_extent_max_zeroout_kb which is hardcoded to 32 irrespective of
> > the IO device, so that means theres a chance a zeroout might be pretty
> > slow if say we are doing it on a device than doesn't support accelerated
> > zeroout operations. Maybe we need to be more intelligent in setting
> > s_extent_max_zeroout_kb?
>
> You can also tune the value in sysfs. I'm not 100% sure how the kernel
Yeah but I feel the average users dont usually tune sysfs values
> could do a better guess. Also I think 32k works mostly because it is small
> enough to be cheap to write but already large enough to noticeably reduce
> fragmentation for some pathological workloads (you can easily get 1/4 of
> the extents than without this logic). But I'm open to ideas if you have
> some.
Yes, Im yet to check in depth about it but the block layer does advertise a
max_write_zeroes_sectors() which might help us tune the value a better,
and perhaps even support higher zeroout without losing performance. I'll
look into it a bit more.
Regards,
ojaswin
>
> Honza
> --
> Jan Kara <jack@...e.com>
> SUSE Labs, CR
Powered by blists - more mailing lists