[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6be60100-e94c-4c06-9542-29ac8bf8f013@suse.cz>
Date: Thu, 15 Jan 2026 17:19:42 +0100
From: Vlastimil Babka <vbabka@...e.cz>
To: Zhao Liu <zhao1.liu@...el.com>, Hao Li <haolee.swjtu@...il.com>
Cc: akpm@...ux-foundation.org, harry.yoo@...cle.com, cl@...two.org,
rientjes@...gle.com, roman.gushchin@...ux.dev, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, tim.c.chen@...el.com, yu.c.chen@...el.com
Subject: Re: [PATCH v2] slub: keep empty main sheaf as spare in
__pcs_replace_empty_main()
On 1/15/26 11:12, Zhao Liu wrote:
> Hi Babka & Hao,
>
>> Thanks, LGTM. We can make it smaller though. Adding to slab/for-next
>> adjusted like this:
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index f21b2f0c6f5a..ad71f01571f0 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
>> */
>>
>> if (pcs->main->size == 0) {
>> - barn_put_empty_sheaf(barn, pcs->main);
>> + if (!pcs->spare) {
>> + pcs->spare = pcs->main;
>> + } else {
>> + barn_put_empty_sheaf(barn, pcs->main);
>> + }
>> pcs->main = full;
>> return pcs;
>> }
>
> I noticed the previous lkp regression report and tested this fix:
>
> * will-it-scale.per_process_ops
>
> Compared with v6.19-rc4(f0b9d8eb98df), with this fix, I have these
> results:
>
> nr_tasks Delta
> 1 + 3.593%
> 8 + 3.094%
> 64 +60.247%
> 128 +49.344%
> 192 +27.500%
> 256 -12.077%
>
> For the cases (nr_tasks: 1-192), there're the improvements. I think
> this is expected since pre-cached spare sheaf reduces spinlock race:
> reduce barn_put_empty_sheaf() & barn_get_empty_sheaf().
>
> So (maybe too late),
>
> Tested-by: Zhao Liu <zhao1.liu@...el.com>
Thanks!
> But I find there are two more questions that might need consideration?
>
> # Question 1: Regression for 256 tasks
>
> For the above test - the case with nr_tasks: 256, there's a "slight"
> regression. I did more testing:
>
> (This is a single-round test; the 256-tasks data has jitter.)
>
> nr_tasks Delta
> 244 0.308%
> 248 - 0.805%
> 252 12.070%
> 256 -11.441%
> 258 2.070%
> 260 1.252%
> 264 2.369%
> 268 -11.479%
> 272 2.130%
> 292 8.714%
> 296 10.905%
> 298 17.196%
> 300 11.783%
> 302 6.620%
> 304 3.112%
> 308 - 5.924%
>
> It can be seen that most cases show improvement, though a few may
> experience slight regression.
>
> Based on the configuration of my machine:
>
> GNR - 2 sockets with the following NUMA topology:
>
> NUMA:
> NUMA node(s): 4
> NUMA node0 CPU(s): 0-42,172-214
> NUMA node1 CPU(s): 43-85,215-257
> NUMA node2 CPU(s): 86-128,258-300
> NUMA node3 CPU(s): 129-171,301-343
>
> Since I set the CPU affinity on the core, 256 cases is roughly
> equivalent to the moment when Node 0 and Node 1 are filled.
>
> The following is the perf data comparing 2 tests w/o fix & with this fix:
>
> # Baseline Delta Abs Shared Object Symbol
> # ........ ......... ....................... ....................................
> #
> 61.76% +4.78% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> 0.93% -0.32% [kernel.vmlinux] [k] __slab_free
> 0.39% -0.31% [kernel.vmlinux] [k] barn_get_empty_sheaf
> 1.35% -0.30% [kernel.vmlinux] [k] mas_leaf_max_gap
> 3.22% -0.30% [kernel.vmlinux] [k] __kmem_cache_alloc_bulk
> 1.73% -0.20% [kernel.vmlinux] [k] __cond_resched
> 0.52% -0.19% [kernel.vmlinux] [k] _raw_spin_lock_irqsave
> 0.92% +0.18% [kernel.vmlinux] [k] _raw_spin_lock
> 1.91% -0.15% [kernel.vmlinux] [k] zap_pmd_range.isra.0
> 1.37% -0.13% [kernel.vmlinux] [k] mas_wr_node_store
> 1.29% -0.12% [kernel.vmlinux] [k] free_pud_range
> 0.92% -0.11% [kernel.vmlinux] [k] __mmap_region
> 0.12% -0.11% [kernel.vmlinux] [k] barn_put_empty_sheaf
> 0.20% -0.09% [kernel.vmlinux] [k] barn_replace_empty_sheaf
> 0.31% +0.09% [kernel.vmlinux] [k] get_partial_node
> 0.29% -0.07% [kernel.vmlinux] [k] __rcu_free_sheaf_prepare
> 0.12% -0.07% [kernel.vmlinux] [k] intel_idle_xstate
> 0.21% -0.07% [kernel.vmlinux] [k] __kfree_rcu_sheaf
> 0.26% -0.07% [kernel.vmlinux] [k] down_write
> 0.53% -0.06% libc.so.6 [.] __mmap
> 0.66% -0.06% [kernel.vmlinux] [k] mas_walk
> 0.48% -0.06% [kernel.vmlinux] [k] mas_prev_slot
> 0.45% -0.06% [kernel.vmlinux] [k] mas_find
> 0.38% -0.06% [kernel.vmlinux] [k] mas_wr_store_type
> 0.23% -0.06% [kernel.vmlinux] [k] do_vmi_align_munmap
> 0.21% -0.05% [kernel.vmlinux] [k] perf_event_mmap_event
> 0.32% -0.05% [kernel.vmlinux] [k] entry_SYSRETQ_unsafe_stack
> 0.19% -0.05% [kernel.vmlinux] [k] downgrade_write
> 0.59% -0.05% [kernel.vmlinux] [k] mas_next_slot
> 0.31% -0.05% [kernel.vmlinux] [k] __mmap_new_vma
> 0.44% -0.05% [kernel.vmlinux] [k] kmem_cache_alloc_noprof
> 0.28% -0.05% [kernel.vmlinux] [k] __vma_enter_locked
> 0.41% -0.05% [kernel.vmlinux] [k] memcpy
> 0.48% -0.04% [kernel.vmlinux] [k] mas_store_gfp
> 0.14% +0.04% [kernel.vmlinux] [k] __put_partials
> 0.19% -0.04% [kernel.vmlinux] [k] mas_empty_area_rev
> 0.30% -0.04% [kernel.vmlinux] [k] do_syscall_64
> 0.25% -0.04% [kernel.vmlinux] [k] mas_preallocate
> 0.15% -0.04% [kernel.vmlinux] [k] rcu_free_sheaf
> 0.22% -0.04% [kernel.vmlinux] [k] entry_SYSCALL_64
> 0.49% -0.04% libc.so.6 [.] __munmap
> 0.91% -0.04% [kernel.vmlinux] [k] rcu_all_qs
> 0.21% -0.04% [kernel.vmlinux] [k] __vm_munmap
> 0.24% -0.04% [kernel.vmlinux] [k] mas_store_prealloc
> 0.19% -0.04% [kernel.vmlinux] [k] __kmalloc_cache_noprof
> 0.34% -0.04% [kernel.vmlinux] [k] build_detached_freelist
> 0.19% -0.03% [kernel.vmlinux] [k] vms_complete_munmap_vmas
> 0.36% -0.03% [kernel.vmlinux] [k] mas_rev_awalk
> 0.05% -0.03% [kernel.vmlinux] [k] shuffle_freelist
> 0.19% -0.03% [kernel.vmlinux] [k] down_write_killable
> 0.19% -0.03% [kernel.vmlinux] [k] kmem_cache_free
> 0.27% -0.03% [kernel.vmlinux] [k] up_write
> 0.13% -0.03% [kernel.vmlinux] [k] vm_area_alloc
> 0.18% -0.03% [kernel.vmlinux] [k] arch_get_unmapped_area_topdown
> 0.08% -0.03% [kernel.vmlinux] [k] userfaultfd_unmap_complete
> 0.10% -0.03% [kernel.vmlinux] [k] tlb_gather_mmu
> 0.30% -0.02% [kernel.vmlinux] [k] ___slab_alloc
>
> I think the insteresting item is "get_partial_node". It seems this fix
> makes "get_partial_node" slightly more frequent. HMM, however, I still
> can't figure out why this is happening. Do you have any thoughts on it?
I'm not sure if it's statistically significant or just noise, +0.09% could
be noise?
> # Question 2: sheaf capacity
>
> Back the original commit which triggerred lkp regression. I did more
> testing to check if this fix could totally fill the regression gap.
>
> The base line is commit 3accabda4 ("mm, vma: use percpu sheaves for
> vm_area_struct cache") and its next commit 59faa4da7cd4 ("maple_tree:
> use percpu sheaves for maple_node_cache") has the regression.
>
> I compared v6.19-rc4(f0b9d8eb98df) w/o fix & with fix aginst the base
> line:
>
> nr_tasks w/o fix with fix
> 1 - 3.643% - 0.181%
> 8 -12.523% - 9.816%
> 64 -50.378% -20.482%
> 128 -36.736% - 5.518%
> 192 -22.963% - 1.777%
> 256 -32.926% - 41.026%
>
> It appears that under extreme conditions, regression remains significate.
> I remembered your suggestion about larger capacity and did the following
> testing:
>
> 59faa4da7cd4 59faa4da7cd4 59faa4da7cd4 59faa4da7cd4 59faa4da7cd4
> (with this fix) (cap: 32->64) (cap: 32->128) (cap: 32->256)
> 1 -8.789% -8.805% -8.185% -9.912% -8.673%
> 8 -12.256% -9.219% -10.460% -10.070% -8.819%
> 64 -38.915% -8.172% -4.700% 4.571% 8.793%
> 128 -8.032% 11.377% 23.232% 26.940% 30.573%
> 192 -1.220% 9.758% 20.573% 22.645% 25.768%
> 256 -6.570% 9.967% 21.663% 30.103% 33.876%
>
> Comparing with base line (3accabda4), larger capacity could
> significatly improve the Sheaf's scalability.
>
> So, I'd like to know if you think dynamically or adaptively adjusting
> capacity is a worthwhile idea.
In the followup series, there will be automatically determined capacity to
roughly match the current capacity of cpu partial slabs:
https://lore.kernel.org/all/20260112-sheaves-for-all-v2-4-98225cfb50cf@suse.cz/
We can use that as starting point for further tuning. But I suspect making
it adjust dynamically would be complicated.
> Thanks for your patience.
>
> Regards,
> Zhao
>
Powered by blists - more mailing lists