[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aHj8WmfJvB3h6lOZ@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
Date: Thu, 17 Jul 2025 19:06:26 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: Baokun Li <libaokun1@...wei.com>
Cc: linux-ext4@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
jack@...e.cz, linux-kernel@...r.kernel.org, julia.lawall@...ia.fr,
yi.zhang@...wei.com, yangerkun@...wei.com, libaokun@...weicloud.com
Subject: Re: [PATCH v3 04/17] ext4: remove unnecessary s_md_lock on update
s_mb_last_group
On Mon, Jul 14, 2025 at 09:03:14PM +0800, Baokun Li wrote:
> After we optimized the block group lock, we found another lock
> contention issue when running will-it-scale/fallocate2 with multiple
> processes. The fallocate's block allocation and the truncate's block
> release were fighting over the s_md_lock. The problem is, this lock
> protects totally different things in those two processes: the list of
> freed data blocks (s_freed_data_list) when releasing, and where to start
> looking for new blocks (mb_last_group) when allocating.
>
> Now we only need to track s_mb_last_group and no longer need to track
> s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
> two are consistent. Since s_mb_last_group is merely a hint and doesn't
> require strong synchronization, READ_ONCE/WRITE_ONCE is sufficient.
Hi Baokun,
So i just got curious of the difference between smp_load_acquire vs
READ_ONCE on PowerPC, another weak memory ordering arch.
Interestingly, I didn't see that big of a single threaded drop.
The number are as follows (mb_opt_scan=1):
100 threads
w/ smp_load_acquire 1668 MB/s
w/ READ_ONCE 1599 MB/s
1 thread pinned to 1 cpu
w/ smp_load_acquire 292 MB/s
w/ READ_ONCE 296 MB/s
Either ways, this is much better than the base which is around 500MB/s
but just thought I'd share it here
Feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
Regards,
ojaswin
>
> Besides, the s_mb_last_group data type only requires ext4_group_t
> (i.e., unsigned int), rendering unsigned long superfluous.
>
> Performance test data follows:
>
> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> Observation: Average fallocate operations per container per second.
>
> |CPU: Kunpeng 920 | P80 | P1 |
> |Memory: 512GB |------------------------|-------------------------|
> |960GB SSD (0.5GB/s)| base | patched | base | patched |
> |-------------------|-------|----------------|--------|----------------|
> |mb_optimize_scan=0 | 4821 | 9636 (+99.8%) | 314065 | 337597 (+7.4%) |
> |mb_optimize_scan=1 | 4784 | 4834 (+1.04%) | 316344 | 341440 (+7.9%) |
>
> |CPU: AMD 9654 * 2 | P96 | P1 |
> |Memory: 1536GB |------------------------|-------------------------|
> |960GB SSD (1GB/s) | base | patched | base | patched |
> |-------------------|-------|----------------|--------|----------------|
> |mb_optimize_scan=0 | 15371 | 22341 (+45.3%) | 205851 | 219707 (+6.7%) |
> |mb_optimize_scan=1 | 6101 | 9177 (+50.4%) | 207373 | 215732 (+4.0%) |
>
> Suggested-by: Jan Kara <jack@...e.cz>
> Signed-off-by: Baokun Li <libaokun1@...wei.com>
> ---
> fs/ext4/ext4.h | 2 +-
> fs/ext4/mballoc.c | 12 +++---------
> 2 files changed, 4 insertions(+), 10 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index b83095541c98..7f5c070de0fb 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1630,7 +1630,7 @@ struct ext4_sb_info {
> unsigned int s_mb_group_prealloc;
> unsigned int s_max_dir_size_kb;
> /* where last allocation was done - for stream allocation */
> - unsigned long s_mb_last_group;
> + ext4_group_t s_mb_last_group;
> unsigned int s_mb_prefetch;
> unsigned int s_mb_prefetch_limit;
> unsigned int s_mb_best_avail_max_trim_order;
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index e3a5103e1620..025b759ca643 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2168,11 +2168,8 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
> ac->ac_buddy_folio = e4b->bd_buddy_folio;
> folio_get(ac->ac_buddy_folio);
> /* store last allocated for subsequent stream allocation */
> - if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
> - spin_lock(&sbi->s_md_lock);
> - sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
> - spin_unlock(&sbi->s_md_lock);
> - }
> + if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
> + WRITE_ONCE(sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
> /*
> * As we've just preallocated more space than
> * user requested originally, we store allocated
> @@ -2845,10 +2842,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>
> /* if stream allocation is enabled, use global goal */
> if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
> - /* TBD: may be hot point */
> - spin_lock(&sbi->s_md_lock);
> - ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
> - spin_unlock(&sbi->s_md_lock);
> + ac->ac_g_ex.fe_group = READ_ONCE(sbi->s_mb_last_group);
> ac->ac_g_ex.fe_start = -1;
> ac->ac_flags &= ~EXT4_MB_HINT_TRY_GOAL;
> }
> --
> 2.46.1
>
Powered by blists - more mailing lists