[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6bt4hapd3fij4nlekl5dj2g2bxgcmfujuxfzb37lpjfypbzda6@ahgdnhjablpi>
Date: Wed, 12 Nov 2025 14:59:38 +0800
From: Heming Zhao <heming.zhao@...e.com>
To: Joseph Qi <joseph.qi@...ux.alibaba.com>, glass.su@...e.com
Cc: ocfs2-devel@...ts.linux.dev, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 1/3] ocfs2: give ocfs2 the ability to reclaim suballoc
free bg
Hi Joseph,
Sorry for waking up this thread. Please see my comments below.
On Wed, Oct 09, 2024 at 10:56:47AM +0800, Heming Zhao wrote:
> On 10/8/24 15:16, Joseph Qi wrote:
> >
> >
> > On 9/8/24 10:07 PM, Heming Zhao wrote:
> > > The current ocfs2 code can't reclaim suballocator block group space.
> > > This cause ocfs2 to hold onto a lot of space in some cases. for example,
> > > when creating lots of small files, the space is held/managed by
> > > '//inode_alloc'. After the user deletes all the small files, the space
> > > never returns to '//global_bitmap'. This issue prevents ocfs2 from
> > > providing the needed space even when there is enough free space in a
> > > small ocfs2 volume.
> > > This patch gives ocfs2 the ability to reclaim suballoc free space when
> > > the block group is freed. For performance reasons, this patch keeps
> > > the first suballocator block group.
> > >
> > > Signed-off-by: Heming Zhao <heming.zhao@...e.com>
> > > Reviewed-by: Su Yue <glass.su@...e.com>
> > > ---
> > > fs/ocfs2/suballoc.c | 302 ++++++++++++++++++++++++++++++++++++++++++--
> > > 1 file changed, 292 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
> > > index f7b483f0de2a..d62010166c34 100644
> > > --- a/fs/ocfs2/suballoc.c
> > > +++ b/fs/ocfs2/suballoc.c
> > > @@ -294,6 +294,68 @@ static int ocfs2_validate_group_descriptor(struct super_block *sb,
> > > return ocfs2_validate_gd_self(sb, bh, 0);
> > > }
> > > +/*
> > > + * hint gd may already be released in _ocfs2_free_suballoc_bits(),
> > > + * we first check gd descriptor signature, then do the
> > > + * ocfs2_read_group_descriptor() jobs.
> > > + *
> > > + * When the group descriptor is invalid, we return 'rc=0' and
> > > + * '*released=1'. The caller should handle this case. Otherwise,
> > > + * we return the real error code.
> > > + */
> > > +static int ocfs2_read_hint_group_descriptor(struct inode *inode,
> > > + struct ocfs2_dinode *di, u64 gd_blkno,
> > > + struct buffer_head **bh, int *released)
> > > +{
> > > + int rc;
> > > + struct buffer_head *tmp = *bh;
> > > + struct ocfs2_group_desc *gd;
> > > +
> > > + *released = 0;
> >
> > I'd like the caller is responsible for the initialization.
>
> OK.
>
> >
> > > +
> > > + rc = ocfs2_read_block(INODE_CACHE(inode), gd_blkno, &tmp, NULL);
> > > + if (rc)
> > > + goto out;
> > > +
> > > + gd = (struct ocfs2_group_desc *) tmp->b_data;
> > > + if (!OCFS2_IS_VALID_GROUP_DESC(gd)) {
> >
> > How to distinguish the release case or a bug?
>
I rechecked the code and withdraw my previous comments, as the release gap
doesn't exist.
The reason: ocfs2 locks alloc_inode before calling threads A and B.
Redarding the question: How to distinguish the release case or a bug?
(Hope I correctly understand your question) my answer:
ocfs2_read_hint_group_descriptor is derived from ocfs2_read_group_descriptor.
The differ is:
- ocfs2_read_group_descriptor calls ocfs2_read_block() and passes the
'validate':ocfs2_validate_group_descriptor.
- ocfs2_read_hint_group_descriptor calls ocfs2_read_block and passes the
'validate':null
The job of distinguishing between a release case or a bug is similar to that of
ocfs2_read_group_descriptor. I directly call ocfs2_validate_group_descriptor
and ocfs2_validate_gd_parent after the GD signature is correct. If we
encounter a bug case, the subsequent validation functions will handle it.
Btw, in the new function _reclaim_to_main_bm(), the
memset(group, 0, sizeof(struct ocfs2_group_desc)) will clean up all group info.
Therefore, after applying this patch, the GD area is full with ZEROs after being
released.
Why can't we reuse the existing ocfs2_validate_group_descriptor()?
It calls ocfs2_validate_gd_self(), which triggers do_error() and makes the volume
read-only.
Thanks,
Heming
> Good question.
>
> Before this patch, OCFS2 never releases suballocator space.
> The ocfs2_read_group_descriptor() doesn't need to handle the
> case of reading a bad 'struct ocfs2_group_desc'.
>
> After this patch, there is a gap between
> _ocfs2_free_suballoc_bits() and ocfs2_read_hint_group_descriptor().
>
>
> thread A thread B
> -------------------------------------------------------------
> ocfs2_claim_suballoc_bits
> hint is not zero
> ocfs2_search_one_group
> + ocfs2_read_hint_group_descriptor
> | + OCFS2_IS_VALID_GROUP_DESC(gd)
> | returns true
> | _ocfs2_free_suballoc_bits
> + ... + free the last bit of gd
> | + release gd
> + ocfs2_block_group_set_bits
> uses released gd, data corruption
> --------------------------------------------------------------
>
> I plan to introduce a new cache_info flag 'OCFS2_CACHE_CLEAN_GD' to protect this case.
> e.g. (just demo, not tested)
>
>
> thread A thread B
> -------------------------------------------------------------
> ocfs2_read_hint_group_descriptor()
> ocfs2_read_block()
>
> //protect code begin
> ci = INODE_CACHE(inode);
> ocfs2_metadata_cache_io_lock(ci);
> if (ci->ci_flags & OCFS2_CACHE_CLEAN_GD)
> goto free_bh;
> ocfs2_metadata_cache_io_unlock(ci);
> //protect code end
>
> gd = (struct ocfs2_group_desc *) tmp->b_data;
> if (!OCFS2_IS_VALID_GROUP_DESC(gd)) {
> ... ...
> }
>
> _ocfs2_free_suballoc_bits()
> ... ...
> if (ocfs2_is_cluster_bitmap(alloc_inode) ||
> (le32_to_cpu(rec->c_free) != (le32_to_cpu(rec->c_total) - 1)) ||
> (le16_to_cpu(cl->cl_next_free_rec) == 1)) {
> goto bail;
> }
>
> //protect code begin
> ci = INODE_CACHE(alloc_inode);
> ocfs2_metadata_cache_io_lock(ci);
> if (ci->ci_num_cached > 1) {
> goto bail;
> }
> ci->ci_flags |= OCFS2_CACHE_CLEAN_GD;
> ocfs2_metadata_cache_io_unlock(ci);
> //protect code end
>
> _ocfs2_reclaim_suballoc_to_main(handle, alloc_inode, alloc_bh, group_bh);
> --------------------------------------------------------------
>
> >
> > > + /*
> > > + * Invalid gd cache was set in ocfs2_read_block(),
> > > ... ...
> > > +/*
> > > + * Reclaim the suballocator managed space to main bitmap.
> > > + * This function first works on the suballocator then switch to the
> > > + * main bitmap.
> > > + *
> > > + * handle: The transaction handle
> > > + * alloc_inode: The suballoc inode
> > > + * alloc_bh: The buffer_head of suballoc inode
> > > + * group_bh: The group descriptor buffer_head of suballocator managed.
> > > + * Caller should release the input group_bh.
> > > + */
> > > +static int _reclaim_to_main_bm(handle_t *handle,
> >
> > Better to rename it to _ocfs2_reclaim_suballoc_to_main().
>
> OK.
> >
> > > + struct inode *alloc_inode,
> > > + struct buffer_head *alloc_bh,
> > > + struct buffer_head *group_bh)
> > > +{
> > > + int idx, status = 0;
> > > + int i, next_free_rec, len = 0;
> > > + __le16 old_bg_contig_free_bits = 0;
> > > ... ...
> > > + le32_add_cpu(&rec->c_free, count);
> > > tmp_used = le32_to_cpu(fe->id1.bitmap1.i_used);
> > > fe->id1.bitmap1.i_used = cpu_to_le32(tmp_used - count);
> > > ocfs2_journal_dirty(handle, alloc_bh);
> > > + /*
> > > + * Reclaim suballocator free space.
> > > + * Bypass: global_bitmap, not empty rec, first rec in cl_recs[]
> >
> > s/not empty rec/non empty rec
>
> OK.
>
> /Heming
Powered by blists - more mailing lists