linux-kernel - Re: [PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aWi9nAbIkTfYFoMM@intel.com>
Date: Thu, 15 Jan 2026 18:12:44 +0800
From: Zhao Liu <zhao1.liu@...el.com>
To: Vlastimil Babka <vbabka@...e.cz>, Hao Li <haolee.swjtu@...il.com>
Cc: akpm@...ux-foundation.org, harry.yoo@...cle.com, cl@...two.org,
	rientjes@...gle.com, roman.gushchin@...ux.dev, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, tim.c.chen@...el.com,
	yu.c.chen@...el.com, zhao1.liu@...el.com
Subject: Re: [PATCH v2] slub: keep empty main sheaf as spare in
 __pcs_replace_empty_main()

Hi Babka & Hao,

> Thanks, LGTM. We can make it smaller though. Adding to slab/for-next
> adjusted like this:
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index f21b2f0c6f5a..ad71f01571f0 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
>          */
>  
>         if (pcs->main->size == 0) {
> -               barn_put_empty_sheaf(barn, pcs->main);
> +               if (!pcs->spare) {
> +                       pcs->spare = pcs->main;
> +               } else {
> +                       barn_put_empty_sheaf(barn, pcs->main);
> +               }
>                 pcs->main = full;
>                 return pcs;
>         }

I noticed the previous lkp regression report and tested this fix:

* will-it-scale.per_process_ops

Compared with v6.19-rc4(f0b9d8eb98df), with this fix, I have these
results:

nr_tasks   Delta
1          + 3.593%
8          + 3.094%
64         +60.247%
128        +49.344%
192        +27.500%
256        -12.077%

For the cases (nr_tasks: 1-192), there're the improvements. I think
this is expected since pre-cached spare sheaf reduces spinlock race:
reduce barn_put_empty_sheaf() & barn_get_empty_sheaf().

So (maybe too late),

Tested-by: Zhao Liu <zhao1.liu@...el.com>



But I find there are two more questions that might need consideration?

# Question 1: Regression for 256 tasks

For the above test - the case with nr_tasks: 256, there's a "slight"
regression. I did more testing:

(This is a single-round test; the 256-tasks data has jitter.)

nr_tasks   Delta
244	     0.308%
248	   - 0.805%
252	    12.070%
256	   -11.441%
258	     2.070%
260	     1.252%
264	     2.369%
268	   -11.479%
272	     2.130%
292	     8.714%
296	    10.905%
298	    17.196%
300	    11.783%
302	     6.620%
304	     3.112%
308	   - 5.924%

It can be seen that most cases show improvement, though a few may
experience slight regression.

Based on the configuration of my machine:

    GNR - 2 sockets with the following NUMA topology:

    NUMA:
      NUMA node(s):              4
      NUMA node0 CPU(s):         0-42,172-214
      NUMA node1 CPU(s):         43-85,215-257
      NUMA node2 CPU(s):         86-128,258-300
      NUMA node3 CPU(s):         129-171,301-343

Since I set the CPU affinity on the core, 256 cases is roughly
equivalent to the moment when Node 0 and Node 1 are filled.

The following is the perf data comparing 2 tests w/o fix & with this fix:

# Baseline  Delta Abs  Shared Object            Symbol
# ........  .........  .......................  ....................................
#
    61.76%     +4.78%  [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
     0.93%     -0.32%  [kernel.vmlinux]         [k] __slab_free
     0.39%     -0.31%  [kernel.vmlinux]         [k] barn_get_empty_sheaf
     1.35%     -0.30%  [kernel.vmlinux]         [k] mas_leaf_max_gap
     3.22%     -0.30%  [kernel.vmlinux]         [k] __kmem_cache_alloc_bulk
     1.73%     -0.20%  [kernel.vmlinux]         [k] __cond_resched
     0.52%     -0.19%  [kernel.vmlinux]         [k] _raw_spin_lock_irqsave
     0.92%     +0.18%  [kernel.vmlinux]         [k] _raw_spin_lock
     1.91%     -0.15%  [kernel.vmlinux]         [k] zap_pmd_range.isra.0
     1.37%     -0.13%  [kernel.vmlinux]         [k] mas_wr_node_store
     1.29%     -0.12%  [kernel.vmlinux]         [k] free_pud_range
     0.92%     -0.11%  [kernel.vmlinux]         [k] __mmap_region
     0.12%     -0.11%  [kernel.vmlinux]         [k] barn_put_empty_sheaf
     0.20%     -0.09%  [kernel.vmlinux]         [k] barn_replace_empty_sheaf
     0.31%     +0.09%  [kernel.vmlinux]         [k] get_partial_node
     0.29%     -0.07%  [kernel.vmlinux]         [k] __rcu_free_sheaf_prepare
     0.12%     -0.07%  [kernel.vmlinux]         [k] intel_idle_xstate
     0.21%     -0.07%  [kernel.vmlinux]         [k] __kfree_rcu_sheaf
     0.26%     -0.07%  [kernel.vmlinux]         [k] down_write
     0.53%     -0.06%  libc.so.6                [.] __mmap
     0.66%     -0.06%  [kernel.vmlinux]         [k] mas_walk
     0.48%     -0.06%  [kernel.vmlinux]         [k] mas_prev_slot
     0.45%     -0.06%  [kernel.vmlinux]         [k] mas_find
     0.38%     -0.06%  [kernel.vmlinux]         [k] mas_wr_store_type
     0.23%     -0.06%  [kernel.vmlinux]         [k] do_vmi_align_munmap
     0.21%     -0.05%  [kernel.vmlinux]         [k] perf_event_mmap_event
     0.32%     -0.05%  [kernel.vmlinux]         [k] entry_SYSRETQ_unsafe_stack
     0.19%     -0.05%  [kernel.vmlinux]         [k] downgrade_write
     0.59%     -0.05%  [kernel.vmlinux]         [k] mas_next_slot
     0.31%     -0.05%  [kernel.vmlinux]         [k] __mmap_new_vma
     0.44%     -0.05%  [kernel.vmlinux]         [k] kmem_cache_alloc_noprof
     0.28%     -0.05%  [kernel.vmlinux]         [k] __vma_enter_locked
     0.41%     -0.05%  [kernel.vmlinux]         [k] memcpy
     0.48%     -0.04%  [kernel.vmlinux]         [k] mas_store_gfp
     0.14%     +0.04%  [kernel.vmlinux]         [k] __put_partials
     0.19%     -0.04%  [kernel.vmlinux]         [k] mas_empty_area_rev
     0.30%     -0.04%  [kernel.vmlinux]         [k] do_syscall_64
     0.25%     -0.04%  [kernel.vmlinux]         [k] mas_preallocate
     0.15%     -0.04%  [kernel.vmlinux]         [k] rcu_free_sheaf
     0.22%     -0.04%  [kernel.vmlinux]         [k] entry_SYSCALL_64
     0.49%     -0.04%  libc.so.6                [.] __munmap
     0.91%     -0.04%  [kernel.vmlinux]         [k] rcu_all_qs
     0.21%     -0.04%  [kernel.vmlinux]         [k] __vm_munmap
     0.24%     -0.04%  [kernel.vmlinux]         [k] mas_store_prealloc
     0.19%     -0.04%  [kernel.vmlinux]         [k] __kmalloc_cache_noprof
     0.34%     -0.04%  [kernel.vmlinux]         [k] build_detached_freelist
     0.19%     -0.03%  [kernel.vmlinux]         [k] vms_complete_munmap_vmas
     0.36%     -0.03%  [kernel.vmlinux]         [k] mas_rev_awalk
     0.05%     -0.03%  [kernel.vmlinux]         [k] shuffle_freelist
     0.19%     -0.03%  [kernel.vmlinux]         [k] down_write_killable
     0.19%     -0.03%  [kernel.vmlinux]         [k] kmem_cache_free
     0.27%     -0.03%  [kernel.vmlinux]         [k] up_write
     0.13%     -0.03%  [kernel.vmlinux]         [k] vm_area_alloc
     0.18%     -0.03%  [kernel.vmlinux]         [k] arch_get_unmapped_area_topdown
     0.08%     -0.03%  [kernel.vmlinux]         [k] userfaultfd_unmap_complete
     0.10%     -0.03%  [kernel.vmlinux]         [k] tlb_gather_mmu
     0.30%     -0.02%  [kernel.vmlinux]         [k] ___slab_alloc

I think the insteresting item is "get_partial_node". It seems this fix
makes "get_partial_node" slightly more frequent. HMM, however, I still
can't figure out why this is happening. Do you have any thoughts on it?

# Question 2: sheaf capacity

Back the original commit which triggerred lkp regression. I did more
testing to check if this fix could totally fill the regression gap.

The base line is commit 3accabda4 ("mm, vma: use percpu sheaves for
vm_area_struct cache") and its next commit 59faa4da7cd4 ("maple_tree:
use percpu sheaves for maple_node_cache") has the regression.

I compared v6.19-rc4(f0b9d8eb98df) w/o fix & with fix aginst the base
line:

nr_tasks	w/o fix         with fix
1		- 3.643%	- 0.181%
8		-12.523%	- 9.816%
64		-50.378%	-20.482%
128		-36.736%	- 5.518%
192		-22.963%	- 1.777%
256		-32.926%	- 41.026%

It appears that under extreme conditions, regression remains significate.
I remembered your suggestion about larger capacity and did the following
testing:

	59faa4da7cd4	59faa4da7cd4	59faa4da7cd4	59faa4da7cd4	59faa4da7cd4
			(with this fix)	(cap: 32->64)	(cap: 32->128)	(cap: 32->256)
1	-8.789%		-8.805%		-8.185%		-9.912%		-8.673%
8	-12.256%	-9.219%		-10.460%	-10.070%	-8.819%
64	-38.915%	-8.172%		-4.700%		4.571%		8.793%
128	-8.032%		11.377%		23.232%		26.940%		30.573%
192	-1.220%		9.758%		20.573%		22.645%		25.768%
256	-6.570%		9.967%		21.663%		30.103%		33.876%

Comparing with base line (3accabda4), larger capacity could
significatly improve the Sheaf's scalability.

So, I'd like to know if you think dynamically or adaptively adjusting
capacity is a worthwhile idea.

Thanks for your patience.

Regards,
Zhao