[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aWi9nAbIkTfYFoMM@intel.com>
Date: Thu, 15 Jan 2026 18:12:44 +0800
From: Zhao Liu <zhao1.liu@...el.com>
To: Vlastimil Babka <vbabka@...e.cz>, Hao Li <haolee.swjtu@...il.com>
Cc: akpm@...ux-foundation.org, harry.yoo@...cle.com, cl@...two.org,
rientjes@...gle.com, roman.gushchin@...ux.dev, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, tim.c.chen@...el.com,
yu.c.chen@...el.com, zhao1.liu@...el.com
Subject: Re: [PATCH v2] slub: keep empty main sheaf as spare in
__pcs_replace_empty_main()
Hi Babka & Hao,
> Thanks, LGTM. We can make it smaller though. Adding to slab/for-next
> adjusted like this:
>
> diff --git a/mm/slub.c b/mm/slub.c
> index f21b2f0c6f5a..ad71f01571f0 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
> */
>
> if (pcs->main->size == 0) {
> - barn_put_empty_sheaf(barn, pcs->main);
> + if (!pcs->spare) {
> + pcs->spare = pcs->main;
> + } else {
> + barn_put_empty_sheaf(barn, pcs->main);
> + }
> pcs->main = full;
> return pcs;
> }
I noticed the previous lkp regression report and tested this fix:
* will-it-scale.per_process_ops
Compared with v6.19-rc4(f0b9d8eb98df), with this fix, I have these
results:
nr_tasks Delta
1 + 3.593%
8 + 3.094%
64 +60.247%
128 +49.344%
192 +27.500%
256 -12.077%
For the cases (nr_tasks: 1-192), there're the improvements. I think
this is expected since pre-cached spare sheaf reduces spinlock race:
reduce barn_put_empty_sheaf() & barn_get_empty_sheaf().
So (maybe too late),
Tested-by: Zhao Liu <zhao1.liu@...el.com>
But I find there are two more questions that might need consideration?
# Question 1: Regression for 256 tasks
For the above test - the case with nr_tasks: 256, there's a "slight"
regression. I did more testing:
(This is a single-round test; the 256-tasks data has jitter.)
nr_tasks Delta
244 0.308%
248 - 0.805%
252 12.070%
256 -11.441%
258 2.070%
260 1.252%
264 2.369%
268 -11.479%
272 2.130%
292 8.714%
296 10.905%
298 17.196%
300 11.783%
302 6.620%
304 3.112%
308 - 5.924%
It can be seen that most cases show improvement, though a few may
experience slight regression.
Based on the configuration of my machine:
GNR - 2 sockets with the following NUMA topology:
NUMA:
NUMA node(s): 4
NUMA node0 CPU(s): 0-42,172-214
NUMA node1 CPU(s): 43-85,215-257
NUMA node2 CPU(s): 86-128,258-300
NUMA node3 CPU(s): 129-171,301-343
Since I set the CPU affinity on the core, 256 cases is roughly
equivalent to the moment when Node 0 and Node 1 are filled.
The following is the perf data comparing 2 tests w/o fix & with this fix:
# Baseline Delta Abs Shared Object Symbol
# ........ ......... ....................... ....................................
#
61.76% +4.78% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
0.93% -0.32% [kernel.vmlinux] [k] __slab_free
0.39% -0.31% [kernel.vmlinux] [k] barn_get_empty_sheaf
1.35% -0.30% [kernel.vmlinux] [k] mas_leaf_max_gap
3.22% -0.30% [kernel.vmlinux] [k] __kmem_cache_alloc_bulk
1.73% -0.20% [kernel.vmlinux] [k] __cond_resched
0.52% -0.19% [kernel.vmlinux] [k] _raw_spin_lock_irqsave
0.92% +0.18% [kernel.vmlinux] [k] _raw_spin_lock
1.91% -0.15% [kernel.vmlinux] [k] zap_pmd_range.isra.0
1.37% -0.13% [kernel.vmlinux] [k] mas_wr_node_store
1.29% -0.12% [kernel.vmlinux] [k] free_pud_range
0.92% -0.11% [kernel.vmlinux] [k] __mmap_region
0.12% -0.11% [kernel.vmlinux] [k] barn_put_empty_sheaf
0.20% -0.09% [kernel.vmlinux] [k] barn_replace_empty_sheaf
0.31% +0.09% [kernel.vmlinux] [k] get_partial_node
0.29% -0.07% [kernel.vmlinux] [k] __rcu_free_sheaf_prepare
0.12% -0.07% [kernel.vmlinux] [k] intel_idle_xstate
0.21% -0.07% [kernel.vmlinux] [k] __kfree_rcu_sheaf
0.26% -0.07% [kernel.vmlinux] [k] down_write
0.53% -0.06% libc.so.6 [.] __mmap
0.66% -0.06% [kernel.vmlinux] [k] mas_walk
0.48% -0.06% [kernel.vmlinux] [k] mas_prev_slot
0.45% -0.06% [kernel.vmlinux] [k] mas_find
0.38% -0.06% [kernel.vmlinux] [k] mas_wr_store_type
0.23% -0.06% [kernel.vmlinux] [k] do_vmi_align_munmap
0.21% -0.05% [kernel.vmlinux] [k] perf_event_mmap_event
0.32% -0.05% [kernel.vmlinux] [k] entry_SYSRETQ_unsafe_stack
0.19% -0.05% [kernel.vmlinux] [k] downgrade_write
0.59% -0.05% [kernel.vmlinux] [k] mas_next_slot
0.31% -0.05% [kernel.vmlinux] [k] __mmap_new_vma
0.44% -0.05% [kernel.vmlinux] [k] kmem_cache_alloc_noprof
0.28% -0.05% [kernel.vmlinux] [k] __vma_enter_locked
0.41% -0.05% [kernel.vmlinux] [k] memcpy
0.48% -0.04% [kernel.vmlinux] [k] mas_store_gfp
0.14% +0.04% [kernel.vmlinux] [k] __put_partials
0.19% -0.04% [kernel.vmlinux] [k] mas_empty_area_rev
0.30% -0.04% [kernel.vmlinux] [k] do_syscall_64
0.25% -0.04% [kernel.vmlinux] [k] mas_preallocate
0.15% -0.04% [kernel.vmlinux] [k] rcu_free_sheaf
0.22% -0.04% [kernel.vmlinux] [k] entry_SYSCALL_64
0.49% -0.04% libc.so.6 [.] __munmap
0.91% -0.04% [kernel.vmlinux] [k] rcu_all_qs
0.21% -0.04% [kernel.vmlinux] [k] __vm_munmap
0.24% -0.04% [kernel.vmlinux] [k] mas_store_prealloc
0.19% -0.04% [kernel.vmlinux] [k] __kmalloc_cache_noprof
0.34% -0.04% [kernel.vmlinux] [k] build_detached_freelist
0.19% -0.03% [kernel.vmlinux] [k] vms_complete_munmap_vmas
0.36% -0.03% [kernel.vmlinux] [k] mas_rev_awalk
0.05% -0.03% [kernel.vmlinux] [k] shuffle_freelist
0.19% -0.03% [kernel.vmlinux] [k] down_write_killable
0.19% -0.03% [kernel.vmlinux] [k] kmem_cache_free
0.27% -0.03% [kernel.vmlinux] [k] up_write
0.13% -0.03% [kernel.vmlinux] [k] vm_area_alloc
0.18% -0.03% [kernel.vmlinux] [k] arch_get_unmapped_area_topdown
0.08% -0.03% [kernel.vmlinux] [k] userfaultfd_unmap_complete
0.10% -0.03% [kernel.vmlinux] [k] tlb_gather_mmu
0.30% -0.02% [kernel.vmlinux] [k] ___slab_alloc
I think the insteresting item is "get_partial_node". It seems this fix
makes "get_partial_node" slightly more frequent. HMM, however, I still
can't figure out why this is happening. Do you have any thoughts on it?
# Question 2: sheaf capacity
Back the original commit which triggerred lkp regression. I did more
testing to check if this fix could totally fill the regression gap.
The base line is commit 3accabda4 ("mm, vma: use percpu sheaves for
vm_area_struct cache") and its next commit 59faa4da7cd4 ("maple_tree:
use percpu sheaves for maple_node_cache") has the regression.
I compared v6.19-rc4(f0b9d8eb98df) w/o fix & with fix aginst the base
line:
nr_tasks w/o fix with fix
1 - 3.643% - 0.181%
8 -12.523% - 9.816%
64 -50.378% -20.482%
128 -36.736% - 5.518%
192 -22.963% - 1.777%
256 -32.926% - 41.026%
It appears that under extreme conditions, regression remains significate.
I remembered your suggestion about larger capacity and did the following
testing:
59faa4da7cd4 59faa4da7cd4 59faa4da7cd4 59faa4da7cd4 59faa4da7cd4
(with this fix) (cap: 32->64) (cap: 32->128) (cap: 32->256)
1 -8.789% -8.805% -8.185% -9.912% -8.673%
8 -12.256% -9.219% -10.460% -10.070% -8.819%
64 -38.915% -8.172% -4.700% 4.571% 8.793%
128 -8.032% 11.377% 23.232% 26.940% 30.573%
192 -1.220% 9.758% 20.573% 22.645% 25.768%
256 -6.570% 9.967% 21.663% 30.103% 33.876%
Comparing with base line (3accabda4), larger capacity could
significatly improve the Sheaf's scalability.
So, I'd like to know if you think dynamically or adaptively adjusting
capacity is a worthwhile idea.
Thanks for your patience.
Regards,
Zhao
Powered by blists - more mailing lists