linux-ext4 - Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aSlPFohdm8IfB7r7@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
Date: Fri, 28 Nov 2025 12:58:22 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: Jan Kara <jack@...e.cz>
Cc: Zhang Yi <yi.zhang@...weicloud.com>, linux-ext4@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        tytso@....edu, adilger.kernel@...ger.ca, yi.zhang@...wei.com,
        yizhang089@...il.com, libaokun1@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH v2 03/13] ext4: don't zero the entire extent if
 EXT4_EXT_DATA_PARTIAL_VALID1

On Thu, Nov 27, 2025 at 02:41:52PM +0100, Jan Kara wrote:
> On Fri 21-11-25 14:08:01, Zhang Yi wrote:
> > From: Zhang Yi <yi.zhang@...wei.com>
> > 
> > When allocating initialized blocks from a large unwritten extent, or
> > when splitting an unwritten extent during end I/O and converting it to
> > initialized, there is currently a potential issue of stale data if the
> > extent needs to be split in the middle.
> > 
> >        0  A      B  N
> >        [UUUUUUUUUUUU]    U: unwritten extent
> >        [--DDDDDDDD--]    D: valid data
> >           |<-  ->| ----> this range needs to be initialized
> > 
> > ext4_split_extent() first try to split this extent at B with
> > EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but
> > ext4_split_extent_at() failed to split this extent due to temporary lack
> > of space. It zeroout B to N and mark the entire extent from 0 to N
> > as written.
> > 
> >        0  A      B  N
> >        [WWWWWWWWWWWW]    W: written extent
> >        [SSDDDDDDDDZZ]    Z: zeroed, S: stale data
> > 
> > ext4_split_extent() then try to split this extent at A with
> > EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and left
> > a stale written extent from 0 to A.
> > 
> >        0  A      B   N
> >        [WW|WWWWWWWWWW]
> >        [SS|DDDDDDDDZZ]
> > 
> > Fix this by pass EXT4_EXT_DATA_PARTIAL_VALID1 to ext4_split_extent_at()
> > when splitting at B, don't convert the entire extent to written and left
> > it as unwritten after zeroing out B to N. The remaining work is just
> > like the standard two-part split. ext4_split_extent() will pass the
> > EXT4_EXT_DATA_VALID2 flag when it calls ext4_split_extent_at() for the
> > second time, allowing it to properly handle the split. If the split is
> > successful, it will keep extent from 0 to A as unwritten.
> > 
> > Signed-off-by: Zhang Yi <yi.zhang@...wei.com>
> 
> Good catch on the data exposure issue! First I'd like to discuss whether
> there isn't a way to fix these problems in a way that doesn't make the
> already complex code even more complex. My observation is that
> EXT4_EXT_MAY_ZEROOUT is only set in ext4_ext_convert_to_initialized() and
> in ext4_split_convert_extents() which both call ext4_split_extent(). The
> actual extent zeroing happens in ext4_split_extent_at() and in
> ext4_ext_convert_to_initialized(). I think the code would be much clearer
> if we just centralized all the zeroing in ext4_split_extent(). At that
> place the situation is actually pretty simple:

This is exactly what I was playing with in my local tree to refactor this
particular part of code :). I agree that ext4_split_extent() is a much
better place to do the zeroout and it looks much cleaner but I agree
with Yi that it might be better to do it after fixing the stale
exposures so backports are straight forward. 

Am I correct in understanding that you are suggesting to zeroout
proactively if we are below max_zeroout before even trying to extent
split (which seems be done in ext4_ext_convert_to_initialized() as well)?

In this case, I have 2 concerns:

> 
> 1) 'ex' is unwritten, 'map' describes part with already written data which
> we want to convert to initialized (generally IO completion situation) => we
> can zero out boundaries if they are smaller than max_zeroout or if extent
> split fails.

Firstly, I know you mentioned in another email that zeroout of small ranges
gives us a performance win but is it really faster on average than
extent manipulation?

For example, for case 1 where both zeroout and splitting need
journalling, I understand that splitting has high journal overhead in worst case,
where tree might grow, but more often than not we would be manipulating
within the same leaf so journalling only 1 bh (same as zeroout). In which case
seems like zeroout might be slower no matter how fast the IO can be
done. So proactive zeroout might be for beneficial for case 3 than case
1.

> 
> 2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
> submission) => the split is opportunistic here, if we cannot split due to
> ENOSPC, just go on and deal with it at IO completion time. No zeroing
> needed.
> 
> 3) 'ex' is written, 'map' describes part that should be converted to
> unwritten => we can zero out the 'map' part if smaller than max_zeroout or
> if extent split fails.

Proactive zeroout before trying split does seem benficial to help us
avoid journal overhead for split. However, judging from
ext4_ext_convert_to_initialized(), max zeroout comes from
sbi->s_extent_max_zeroout_kb which is hardcoded to 32 irrespective of
the IO device, so that means theres a chance a zeroout might be pretty
slow if say we are doing it on a device than doesn't support accelerated
zeroout operations. Maybe we need to be more intelligent in setting
s_extent_max_zeroout_kb?

> 
> This should all result in a relatively straightforward code where we can
> distinguish the three cases based on 'ex' and passed flags, we should be
> able to drop the 'EXT4_EXT_DATA_VALID*' flags and logic (possibly we could
> drop the 'split_flag' argument of ext4_split_extent() altogether), and fix
> the data exposure issues at the same time. What do you think? Am I missing
> some case?
> 
> 								Honza
> 
> > ---
> >  fs/ext4/extents.c | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index f7aa497e5d6c..cafe66cb562f 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -3294,6 +3294,13 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
> >  		err = ext4_ext_zeroout(inode, &zero_ex);
> >  		if (err)
> >  			goto fix_extent_len;
> > +		/*
> > +		 * The first half contains partially valid data, the splitting
> > +		 * of this extent has not been completed, fix extent length
> > +		 * and ext4_split_extent() split will the first half again.
> > +		 */
> > +		if (split_flag & EXT4_EXT_DATA_PARTIAL_VALID1)
> > +			goto fix_extent_len;
> >  
> >  		/* update the extent length and mark as initialized */
> >  		ex->ee_len = cpu_to_le16(ee_len);
> > @@ -3364,7 +3371,9 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
> >  			split_flag1 |= EXT4_EXT_MARK_UNWRIT1 |
> >  				       EXT4_EXT_MARK_UNWRIT2;
> >  		if (split_flag & EXT4_EXT_DATA_VALID2)
> > -			split_flag1 |= EXT4_EXT_DATA_ENTIRE_VALID1;
> > +			split_flag1 |= map->m_lblk > ee_block ?
> > +				       EXT4_EXT_DATA_PARTIAL_VALID1 :
> > +				       EXT4_EXT_DATA_ENTIRE_VALID1;
> >  		path = ext4_split_extent_at(handle, inode, path,
> >  				map->m_lblk + map->m_len, split_flag1, flags1);
> >  		if (IS_ERR(path))
> > -- 
> > 2.46.1
> > 
> -- 
> Jan Kara <jack@...e.com>
> SUSE Labs, CR