[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <3ozekmmsscrarwoa7vcytwjn5rxsiyxjrcsirlu3bhmlwtdxzn@s7a6rcxnqadc>
Date: Mon, 19 Jan 2026 14:07:59 +0800
From: Hao Li <hao.li@...ux.dev>
To: Zhao Liu <zhao1.liu@...el.com>
Cc: Vlastimil Babka <vbabka@...e.cz>, Hao Li <haolee.swjtu@...il.com>,
akpm@...ux-foundation.org, harry.yoo@...cle.com, cl@...two.org, rientjes@...gle.com,
roman.gushchin@...ux.dev, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
tim.c.chen@...el.com, yu.c.chen@...el.com
Subject: Re: [PATCH v2] slub: keep empty main sheaf as spare in
__pcs_replace_empty_main()
On Thu, Jan 15, 2026 at 06:12:44PM +0800, Zhao Liu wrote:
> Hi Babka & Hao,
>
> > Thanks, LGTM. We can make it smaller though. Adding to slab/for-next
> > adjusted like this:
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index f21b2f0c6f5a..ad71f01571f0 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
> > */
> >
> > if (pcs->main->size == 0) {
> > - barn_put_empty_sheaf(barn, pcs->main);
> > + if (!pcs->spare) {
> > + pcs->spare = pcs->main;
> > + } else {
> > + barn_put_empty_sheaf(barn, pcs->main);
> > + }
> > pcs->main = full;
> > return pcs;
> > }
>
> I noticed the previous lkp regression report and tested this fix:
>
> * will-it-scale.per_process_ops
>
> Compared with v6.19-rc4(f0b9d8eb98df), with this fix, I have these
> results:
>
> nr_tasks Delta
> 1 + 3.593%
> 8 + 3.094%
> 64 +60.247%
> 128 +49.344%
> 192 +27.500%
> 256 -12.077%
>
> For the cases (nr_tasks: 1-192), there're the improvements. I think
> this is expected since pre-cached spare sheaf reduces spinlock race:
> reduce barn_put_empty_sheaf() & barn_get_empty_sheaf().
>
> So (maybe too late),
>
> Tested-by: Zhao Liu <zhao1.liu@...el.com>
>
>
>
> But I find there are two more questions that might need consideration?
>
> # Question 1: Regression for 256 tasks
>
> For the above test - the case with nr_tasks: 256, there's a "slight"
> regression. I did more testing:
>
> (This is a single-round test; the 256-tasks data has jitter.)
>
> nr_tasks Delta
> 244 0.308%
> 248 - 0.805%
> 252 12.070%
> 256 -11.441%
> 258 2.070%
> 260 1.252%
> 264 2.369%
> 268 -11.479%
> 272 2.130%
> 292 8.714%
> 296 10.905%
> 298 17.196%
> 300 11.783%
> 302 6.620%
> 304 3.112%
> 308 - 5.924%
>
> It can be seen that most cases show improvement, though a few may
> experience slight regression.
>
> Based on the configuration of my machine:
>
> GNR - 2 sockets with the following NUMA topology:
>
> NUMA:
> NUMA node(s): 4
> NUMA node0 CPU(s): 0-42,172-214
> NUMA node1 CPU(s): 43-85,215-257
> NUMA node2 CPU(s): 86-128,258-300
> NUMA node3 CPU(s): 129-171,301-343
>
> Since I set the CPU affinity on the core, 256 cases is roughly
> equivalent to the moment when Node 0 and Node 1 are filled.
>
> The following is the perf data comparing 2 tests w/o fix & with this fix:
>
> # Baseline Delta Abs Shared Object Symbol
> # ........ ......... ....................... ....................................
> #
> 61.76% +4.78% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> 0.93% -0.32% [kernel.vmlinux] [k] __slab_free
> 0.39% -0.31% [kernel.vmlinux] [k] barn_get_empty_sheaf
> 1.35% -0.30% [kernel.vmlinux] [k] mas_leaf_max_gap
> 3.22% -0.30% [kernel.vmlinux] [k] __kmem_cache_alloc_bulk
> 1.73% -0.20% [kernel.vmlinux] [k] __cond_resched
> 0.52% -0.19% [kernel.vmlinux] [k] _raw_spin_lock_irqsave
> 0.92% +0.18% [kernel.vmlinux] [k] _raw_spin_lock
> 1.91% -0.15% [kernel.vmlinux] [k] zap_pmd_range.isra.0
> 1.37% -0.13% [kernel.vmlinux] [k] mas_wr_node_store
> 1.29% -0.12% [kernel.vmlinux] [k] free_pud_range
> 0.92% -0.11% [kernel.vmlinux] [k] __mmap_region
> 0.12% -0.11% [kernel.vmlinux] [k] barn_put_empty_sheaf
> 0.20% -0.09% [kernel.vmlinux] [k] barn_replace_empty_sheaf
> 0.31% +0.09% [kernel.vmlinux] [k] get_partial_node
> 0.29% -0.07% [kernel.vmlinux] [k] __rcu_free_sheaf_prepare
> 0.12% -0.07% [kernel.vmlinux] [k] intel_idle_xstate
> 0.21% -0.07% [kernel.vmlinux] [k] __kfree_rcu_sheaf
> 0.26% -0.07% [kernel.vmlinux] [k] down_write
> 0.53% -0.06% libc.so.6 [.] __mmap
> 0.66% -0.06% [kernel.vmlinux] [k] mas_walk
> 0.48% -0.06% [kernel.vmlinux] [k] mas_prev_slot
> 0.45% -0.06% [kernel.vmlinux] [k] mas_find
> 0.38% -0.06% [kernel.vmlinux] [k] mas_wr_store_type
> 0.23% -0.06% [kernel.vmlinux] [k] do_vmi_align_munmap
> 0.21% -0.05% [kernel.vmlinux] [k] perf_event_mmap_event
> 0.32% -0.05% [kernel.vmlinux] [k] entry_SYSRETQ_unsafe_stack
> 0.19% -0.05% [kernel.vmlinux] [k] downgrade_write
> 0.59% -0.05% [kernel.vmlinux] [k] mas_next_slot
> 0.31% -0.05% [kernel.vmlinux] [k] __mmap_new_vma
> 0.44% -0.05% [kernel.vmlinux] [k] kmem_cache_alloc_noprof
> 0.28% -0.05% [kernel.vmlinux] [k] __vma_enter_locked
> 0.41% -0.05% [kernel.vmlinux] [k] memcpy
> 0.48% -0.04% [kernel.vmlinux] [k] mas_store_gfp
> 0.14% +0.04% [kernel.vmlinux] [k] __put_partials
> 0.19% -0.04% [kernel.vmlinux] [k] mas_empty_area_rev
> 0.30% -0.04% [kernel.vmlinux] [k] do_syscall_64
> 0.25% -0.04% [kernel.vmlinux] [k] mas_preallocate
> 0.15% -0.04% [kernel.vmlinux] [k] rcu_free_sheaf
> 0.22% -0.04% [kernel.vmlinux] [k] entry_SYSCALL_64
> 0.49% -0.04% libc.so.6 [.] __munmap
> 0.91% -0.04% [kernel.vmlinux] [k] rcu_all_qs
> 0.21% -0.04% [kernel.vmlinux] [k] __vm_munmap
> 0.24% -0.04% [kernel.vmlinux] [k] mas_store_prealloc
> 0.19% -0.04% [kernel.vmlinux] [k] __kmalloc_cache_noprof
> 0.34% -0.04% [kernel.vmlinux] [k] build_detached_freelist
> 0.19% -0.03% [kernel.vmlinux] [k] vms_complete_munmap_vmas
> 0.36% -0.03% [kernel.vmlinux] [k] mas_rev_awalk
> 0.05% -0.03% [kernel.vmlinux] [k] shuffle_freelist
> 0.19% -0.03% [kernel.vmlinux] [k] down_write_killable
> 0.19% -0.03% [kernel.vmlinux] [k] kmem_cache_free
> 0.27% -0.03% [kernel.vmlinux] [k] up_write
> 0.13% -0.03% [kernel.vmlinux] [k] vm_area_alloc
> 0.18% -0.03% [kernel.vmlinux] [k] arch_get_unmapped_area_topdown
> 0.08% -0.03% [kernel.vmlinux] [k] userfaultfd_unmap_complete
> 0.10% -0.03% [kernel.vmlinux] [k] tlb_gather_mmu
> 0.30% -0.02% [kernel.vmlinux] [k] ___slab_alloc
>
> I think the insteresting item is "get_partial_node". It seems this fix
> makes "get_partial_node" slightly more frequent. HMM, however, I still
> can't figure out why this is happening. Do you have any thoughts on it?
Hello, Zhao,
I tested the performance degradation issue we discussed concerning nr_tasks=256.
However, my results differ from yours, so I'd like to share my setup and
findings for clarity and comparison:
1. Machine Configuration
The topology of my machine is as follows:
CPU(s): 384
On-line CPU(s) list: 0-383
Thread(s) per core: 2
Core(s) per socket: 96
Socket(s): 2
NUMA node(s): 2
Since my machine only has 192 cores when counting physical cores, I had to
enable SMT to support the higher number of tasks in the LKP test cases. My
configuration was as follows:
will-it-scale:
mode: process
test: mmap2
no_affinity: 0
smt: 1
The sequence of test cases I used was: nr_tasks= 1, 8, 64, 128, 192, 256, 384.
I noticed that your test command did not enable SMT, but I believe this
difference should not significantly affect the results. I wanted to highlight
this to ensure we account for any potential impact these differences might have
on our results.
2. Kernel Configuration
I conducted tests using the commit f0b9d8eb98dfee8d00419aa07543bdc2c1a44fb1
first, then applied the patch and tested again.
Each test was run 10 times, and I took the average results.
3. Test Results (Without Patch vs. With Patch)
will-it-scale.1.processes -1.27%
will-it-scale.8.processes +0.19%
will-it-scale.64.processes +25.81%
will-it-scale.128.processes +112.88%
will-it-scale.192.processes +157.42%
will-it-scale.256.processes +70.63%
will-it-scale.384.processes +132.12%
will-it-scale.per_process_ops +27.21%
will-it-scale.scalability +135.10%
will-it-scale.time.involuntary_context_switches +127.54%
will-it-scale.time.voluntary_context_switches +0.01%
will-it-scale.workload +94.47%
>From the above results, it appears that the patch improved performance across
the board.
4. Further Analysis
I conducted additional tests by running "./mmap2_processes -t 384 -s 25 -m" both
without and with the patch, and sampled the results using perf.
Here's the "perf report --no-children -g" output without the patch:
```
- 65.72% mmap2_processes [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
- 55.33% testcase
- 55.33% __mmap
- 55.32% entry_SYSCALL_64_after_hwframe
- do_syscall_64
- 55.30% ksys_mmap_pgoff
- 55.30% vm_mmap_pgoff
- 55.28% do_mmap
- 55.24% __mmap_region
- 44.35% mas_preallocate
- 44.34% mas_alloc_nodes
- 44.34% kmem_cache_alloc_noprof
- 44.33% __pcs_replace_empty_main
+ 21.23% barn_put_empty_sheaf
+ 15.95% barn_get_empty_sheaf
+ 5.50% barn_replace_empty_sheaf
+ 1.33% _raw_spin_unlock_irqrestore
+ 10.24% mas_store_prealloc
+ 0.56% perf_event_mmap
- 10.38% __munmap
- 10.38% entry_SYSCALL_64_after_hwframe
- do_syscall_64
- 10.36% __x64_sys_munmap
- 10.36% __vm_munmap
- 10.36% do_vmi_munmap
- 10.35% do_vmi_align_munmap
- 10.14% mas_store_gfp
- 10.13% mas_wr_node_store
- 10.09% kvfree_call_rcu
- 10.09% __kfree_rcu_sheaf
- 10.08% barn_get_empty_sheaf
+ 9.17% _raw_spin_lock_irqsave
+ 0.90% _raw_spin_unlock_irqrestore
```
Here's the "perf report --no-children -g" output with the patch:
```
+ 30.36% mmap2_processes [kernel.kallsyms] [k] perf_iterate_ctx
- 28.80% mmap2_processes [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
- 24.72% testcase
- 24.71% __mmap
- 24.68% entry_SYSCALL_64_after_hwframe
- do_syscall_64
- 24.61% ksys_mmap_pgoff
- 24.57% vm_mmap_pgoff
- 24.51% do_mmap
- 24.30% __mmap_region
- 18.33% mas_preallocate
- 18.30% mas_alloc_nodes
- 18.30% kmem_cache_alloc_noprof
- 18.28% __pcs_replace_empty_main
+ 9.06% barn_replace_empty_sheaf
+ 6.12% barn_get_empty_sheaf
+ 3.09% refill_sheaf
+ 2.94% mas_store_prealloc
+ 2.64% perf_event_mmap
- 4.07% __munmap
- 4.04% entry_SYSCALL_64_after_hwframe
- do_syscall_64
- 3.98% __x64_sys_munmap
- 3.98% __vm_munmap
- 3.95% do_vmi_munmap
- 3.91% do_vmi_align_munmap
- 2.98% mas_store_gfp
- 2.90% mas_wr_node_store
- 2.75% kvfree_call_rcu
- 2.73% __kfree_rcu_sheaf
- 2.71% barn_get_empty_sheaf
+ 1.68% _raw_spin_lock_irqsave
+ 1.03% _raw_spin_unlock_irqrestore
- 0.76% vms_complete_munmap_vmas
0.67% vms_clear_ptes.part.41
```
Using perf diff, I compared the results before and after applying the patch:
```
# Event 'cycles:P'
#
# Baseline Delta Abs Shared Object Symbol
# ........ ......... .................... ..................................................
#
65.72% -36.92% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
14.65% +15.70% [kernel.kallsyms] [k] perf_iterate_ctx
2.10% +2.45% [kernel.kallsyms] [k] unmap_page_range
1.09% +1.26% [kernel.kallsyms] [k] mas_wr_node_store
1.01% +1.14% [kernel.kallsyms] [k] free_pgd_range
0.84% +0.92% [kernel.kallsyms] [k] __mmap_region
0.50% +0.76% [kernel.kallsyms] [k] memcpy
0.62% +0.63% [kernel.kallsyms] [k] __cond_resched
0.49% +0.51% [kernel.kallsyms] [k] mas_walk
0.39% +0.42% [kernel.kallsyms] [k] mas_empty_area_rev
0.32% +0.40% [kernel.kallsyms] [k] mas_next_slot
0.34% +0.39% [kernel.kallsyms] [k] refill_sheaf
0.26% +0.36% [kernel.kallsyms] [k] mas_prev_slot
0.24% +0.29% [kernel.kallsyms] [k] do_syscall_64
0.25% +0.28% [kernel.kallsyms] [k] mas_find
0.20% +0.28% [kernel.kallsyms] [k] kmem_cache_alloc_noprof
0.24% +0.27% [kernel.kallsyms] [k] strlen
0.26% +0.27% [kernel.kallsyms] [k] perf_event_mmap
0.25% +0.26% [kernel.kallsyms] [k] do_mmap
0.22% +0.25% [kernel.kallsyms] [k] mas_store_gfp
0.25% +0.24% [kernel.kallsyms] [k] mas_leaf_max_gap
```
I also sampled the execution counts of several key functions using bpftrace.
Without Patch:
```
@cnt[barn_put_empty_sheaf]: 38833037
@cnt[barn_replace_empty_sheaf]: 41883891
@cnt[__pcs_replace_empty_main]: 41884885
@cnt[barn_get_empty_sheaf]: 75422518
@cnt[mmap]: 489634255
```
With Patch:
```
@cnt[barn_put_empty_sheaf]: 2382910
@cnt[barn_replace_empty_sheaf]: 90681637
@cnt[__pcs_replace_empty_main]: 90683656
@cnt[barn_get_empty_sheaf]: 82710919
@cnt[mmap]: 1113853385
```
>From the above results, I found that the execution count of the
barn_put_empty_sheaf function dropped by an order of magnitude after applying
the patch. This is likely due to the patch's effect: when pcs->spare is NULL,
the empty sheaf is cached in pcs->spare instead of calling barn_put_empty_sheaf.
This reduces contention on the barn spinlock significantly.
At the same time, I noticed that the execution counts for
barn_replace_empty_sheaf and __pcs_replace_empty_main increased, but their
proportion in the perf sampling decreased. This suggests that the average
execution time for these functions has decreased.
Moreover, the total number of mmap executions after applying the patch
(1113853385) is more than double that of the unpatched kernel (489634255). This
further supports our analysis: since the test case duration is fixed at 25
seconds, the patched kernel runs faster, resulting in more iterations of the
test case and more mmap executions, which in turn increases the frequency of
these functions being called.
Based on my tests, everything appears reasonable and explainable. However, I
couldn't reproduce the performance drop for nr_tasks=256, and it's unclear why
our results differ. I'd appreciate it if you could share any additional insights
or thoughts on what might be causing this discrepancy. If needed, we could also
consult Vlastimil for further suggestions to better understand the issue or
explore other potential factors.
Thanks!
--
Thanks,
Hao
>
> # Question 2: sheaf capacity
>
> Back the original commit which triggerred lkp regression. I did more
> testing to check if this fix could totally fill the regression gap.
>
> The base line is commit 3accabda4 ("mm, vma: use percpu sheaves for
> vm_area_struct cache") and its next commit 59faa4da7cd4 ("maple_tree:
> use percpu sheaves for maple_node_cache") has the regression.
>
> I compared v6.19-rc4(f0b9d8eb98df) w/o fix & with fix aginst the base
> line:
>
> nr_tasks w/o fix with fix
> 1 - 3.643% - 0.181%
> 8 -12.523% - 9.816%
> 64 -50.378% -20.482%
> 128 -36.736% - 5.518%
> 192 -22.963% - 1.777%
> 256 -32.926% - 41.026%
>
> It appears that under extreme conditions, regression remains significate.
> I remembered your suggestion about larger capacity and did the following
> testing:
>
> 59faa4da7cd4 59faa4da7cd4 59faa4da7cd4 59faa4da7cd4 59faa4da7cd4
> (with this fix) (cap: 32->64) (cap: 32->128) (cap: 32->256)
> 1 -8.789% -8.805% -8.185% -9.912% -8.673%
> 8 -12.256% -9.219% -10.460% -10.070% -8.819%
> 64 -38.915% -8.172% -4.700% 4.571% 8.793%
> 128 -8.032% 11.377% 23.232% 26.940% 30.573%
> 192 -1.220% 9.758% 20.573% 22.645% 25.768%
> 256 -6.570% 9.967% 21.663% 30.103% 33.876%
>
> Comparing with base line (3accabda4), larger capacity could
> significatly improve the Sheaf's scalability.
>
> So, I'd like to know if you think dynamically or adaptively adjusting
> capacity is a worthwhile idea.
>
> Thanks for your patience.
>
> Regards,
> Zhao
>
>
Powered by blists - more mailing lists