linux-kernel - Re: [PATCH v3 1/3] ocfs2: give ocfs2 the ability to reclaim suballoc free bg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6bt4hapd3fij4nlekl5dj2g2bxgcmfujuxfzb37lpjfypbzda6@ahgdnhjablpi>
Date: Wed, 12 Nov 2025 14:59:38 +0800
From: Heming Zhao <heming.zhao@...e.com>
To: Joseph Qi <joseph.qi@...ux.alibaba.com>, glass.su@...e.com
Cc: ocfs2-devel@...ts.linux.dev, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 1/3] ocfs2: give ocfs2 the ability to reclaim suballoc
 free bg

Hi Joseph,

Sorry for waking up this thread. Please see my comments below.

On Wed, Oct 09, 2024 at 10:56:47AM +0800, Heming Zhao wrote:
> On 10/8/24 15:16, Joseph Qi wrote:
> > 
> > 
> > On 9/8/24 10:07 PM, Heming Zhao wrote:
> > > The current ocfs2 code can't reclaim suballocator block group space.
> > > This cause ocfs2 to hold onto a lot of space in some cases. for example,
> > > when creating lots of small files, the space is held/managed by
> > > '//inode_alloc'. After the user deletes all the small files, the space
> > > never returns to '//global_bitmap'. This issue prevents ocfs2 from
> > > providing the needed space even when there is enough free space in a
> > > small ocfs2 volume.
> > > This patch gives ocfs2 the ability to reclaim suballoc free space when
> > > the block group is freed. For performance reasons, this patch keeps
> > > the first suballocator block group.
> > > 
> > > Signed-off-by: Heming Zhao <heming.zhao@...e.com>
> > > Reviewed-by: Su Yue <glass.su@...e.com>
> > > ---
> > >   fs/ocfs2/suballoc.c | 302 ++++++++++++++++++++++++++++++++++++++++++--
> > >   1 file changed, 292 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
> > > index f7b483f0de2a..d62010166c34 100644
> > > --- a/fs/ocfs2/suballoc.c
> > > +++ b/fs/ocfs2/suballoc.c
> > > @@ -294,6 +294,68 @@ static int ocfs2_validate_group_descriptor(struct super_block *sb,
> > >   	return ocfs2_validate_gd_self(sb, bh, 0);
> > >   }
> > > +/*
> > > + * hint gd may already be released in _ocfs2_free_suballoc_bits(),
> > > + * we first check gd descriptor signature, then do the
> > > + * ocfs2_read_group_descriptor() jobs.
> > > + *
> > > + * When the group descriptor is invalid, we return 'rc=0' and
> > > + * '*released=1'. The caller should handle this case. Otherwise,
> > > + * we return the real error code.
> > > + */
> > > +static int ocfs2_read_hint_group_descriptor(struct inode *inode,
> > > +			struct ocfs2_dinode *di, u64 gd_blkno,
> > > +			struct buffer_head **bh, int *released)
> > > +{
> > > +	int rc;
> > > +	struct buffer_head *tmp = *bh;
> > > +	struct ocfs2_group_desc *gd;
> > > +
> > > +	*released = 0;
> > 
> > I'd like the caller is responsible for the initialization.
> 
> OK.
> 
> > 
> > > +
> > > +	rc = ocfs2_read_block(INODE_CACHE(inode), gd_blkno, &tmp, NULL);
> > > +	if (rc)
> > > +		goto out;
> > > +
> > > +	gd = (struct ocfs2_group_desc *) tmp->b_data;
> > > +	if (!OCFS2_IS_VALID_GROUP_DESC(gd)) {
> > 
> > How to distinguish the release case or a bug?
> 

I rechecked the code and withdraw my previous comments, as the release gap
doesn't exist.
The reason: ocfs2 locks alloc_inode before calling threads A and B.

Redarding the question: How to distinguish the release case or a bug?

(Hope I correctly understand your question) my answer:
ocfs2_read_hint_group_descriptor is derived from ocfs2_read_group_descriptor.
The differ is:
- ocfs2_read_group_descriptor calls ocfs2_read_block() and passes the
  'validate':ocfs2_validate_group_descriptor.
- ocfs2_read_hint_group_descriptor calls ocfs2_read_block and passes the 
  'validate':null

The job of distinguishing between a release case or a bug is similar to that of
ocfs2_read_group_descriptor. I directly call ocfs2_validate_group_descriptor
and ocfs2_validate_gd_parent after the GD signature is correct. If we
encounter a bug case, the subsequent validation functions will handle it.
Btw, in the new function _reclaim_to_main_bm(), the
memset(group, 0, sizeof(struct ocfs2_group_desc)) will clean up all group info.
Therefore, after applying this patch, the GD area is full with ZEROs after being
released.

Why can't we reuse the existing ocfs2_validate_group_descriptor()?
It calls ocfs2_validate_gd_self(), which triggers do_error() and makes the volume
read-only.

Thanks,
Heming

> Good question.
> 
> Before this patch, OCFS2 never releases suballocator space.
> The ocfs2_read_group_descriptor() doesn't need to handle the
> case of reading a bad 'struct ocfs2_group_desc'.
> 
> After this patch, there is a gap between
> _ocfs2_free_suballoc_bits() and ocfs2_read_hint_group_descriptor().
> 
> 
>      thread A                          thread B
> -------------------------------------------------------------
> ocfs2_claim_suballoc_bits
>  hint is not zero
>   ocfs2_search_one_group
>    + ocfs2_read_hint_group_descriptor
>    | + OCFS2_IS_VALID_GROUP_DESC(gd)
>    |    returns true
>    |                                _ocfs2_free_suballoc_bits
>    + ...                             + free the last bit of gd
>    |                                    + release gd
>    + ocfs2_block_group_set_bits
>       uses released gd, data corruption
> --------------------------------------------------------------
> 
> I plan to introduce a new cache_info flag 'OCFS2_CACHE_CLEAN_GD' to protect this case.
> e.g. (just demo, not tested)
> 
> 
>      thread A                          thread B
> -------------------------------------------------------------
> ocfs2_read_hint_group_descriptor()
>   ocfs2_read_block()
> 
>   //protect code begin
>   ci = INODE_CACHE(inode);
>   ocfs2_metadata_cache_io_lock(ci);
>   if (ci->ci_flags & OCFS2_CACHE_CLEAN_GD)
>       goto free_bh;
>   ocfs2_metadata_cache_io_unlock(ci);
>   //protect code end
> 
>   gd = (struct ocfs2_group_desc *) tmp->b_data;
>   if (!OCFS2_IS_VALID_GROUP_DESC(gd)) {
>      ... ...
>   }
> 
>                               _ocfs2_free_suballoc_bits()
>                                 ... ...
>                                 if (ocfs2_is_cluster_bitmap(alloc_inode) ||
>                                     (le32_to_cpu(rec->c_free) != (le32_to_cpu(rec->c_total) - 1)) ||
>                                     (le16_to_cpu(cl->cl_next_free_rec) == 1)) {
>                                         goto bail;
>                                 }
> 
>                                 //protect code begin
>                                 ci = INODE_CACHE(alloc_inode);
>                                 ocfs2_metadata_cache_io_lock(ci);
>                                 if (ci->ci_num_cached > 1) {
>                                         goto bail;
>                                 }
>                                 ci->ci_flags |= OCFS2_CACHE_CLEAN_GD;
>                                 ocfs2_metadata_cache_io_unlock(ci);
>                                 //protect code end
> 
>                                 _ocfs2_reclaim_suballoc_to_main(handle, alloc_inode, alloc_bh, group_bh);
> --------------------------------------------------------------
> 
> > 
> > > +		/*
> > > +		 * Invalid gd cache was set in ocfs2_read_block(),
> > > ... ...
> > > +/*
> > > + * Reclaim the suballocator managed space to main bitmap.
> > > + * This function first works on the suballocator then switch to the
> > > + * main bitmap.
> > > + *
> > > + * handle: The transaction handle
> > > + * alloc_inode: The suballoc inode
> > > + * alloc_bh: The buffer_head of suballoc inode
> > > + * group_bh: The group descriptor buffer_head of suballocator managed.
> > > + *           Caller should release the input group_bh.
> > > + */
> > > +static int _reclaim_to_main_bm(handle_t *handle,
> > 
> > Better to rename it to _ocfs2_reclaim_suballoc_to_main().
> 
> OK.
> > 
> > > +			struct inode *alloc_inode,
> > > +			struct buffer_head *alloc_bh,
> > > +			struct buffer_head *group_bh)
> > > +{
> > > +	int idx, status = 0;
> > > +	int i, next_free_rec, len = 0;
> > > +	__le16 old_bg_contig_free_bits = 0;
> > > ... ...
> > > +	le32_add_cpu(&rec->c_free, count);
> > >   	tmp_used = le32_to_cpu(fe->id1.bitmap1.i_used);
> > >   	fe->id1.bitmap1.i_used = cpu_to_le32(tmp_used - count);
> > >   	ocfs2_journal_dirty(handle, alloc_bh);
> > > +	/*
> > > +	 * Reclaim suballocator free space.
> > > +	 * Bypass: global_bitmap, not empty rec, first rec in cl_recs[]
> > 
> > s/not empty rec/non empty rec
> 
> OK.
> 
> /Heming