lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6be60100-e94c-4c06-9542-29ac8bf8f013@suse.cz>
Date: Thu, 15 Jan 2026 17:19:42 +0100
From: Vlastimil Babka <vbabka@...e.cz>
To: Zhao Liu <zhao1.liu@...el.com>, Hao Li <haolee.swjtu@...il.com>
Cc: akpm@...ux-foundation.org, harry.yoo@...cle.com, cl@...two.org,
 rientjes@...gle.com, roman.gushchin@...ux.dev, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, tim.c.chen@...el.com, yu.c.chen@...el.com
Subject: Re: [PATCH v2] slub: keep empty main sheaf as spare in
 __pcs_replace_empty_main()

On 1/15/26 11:12, Zhao Liu wrote:
> Hi Babka & Hao,
> 
>> Thanks, LGTM. We can make it smaller though. Adding to slab/for-next
>> adjusted like this:
>> 
>> diff --git a/mm/slub.c b/mm/slub.c
>> index f21b2f0c6f5a..ad71f01571f0 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
>>          */
>>  
>>         if (pcs->main->size == 0) {
>> -               barn_put_empty_sheaf(barn, pcs->main);
>> +               if (!pcs->spare) {
>> +                       pcs->spare = pcs->main;
>> +               } else {
>> +                       barn_put_empty_sheaf(barn, pcs->main);
>> +               }
>>                 pcs->main = full;
>>                 return pcs;
>>         }
> 
> I noticed the previous lkp regression report and tested this fix:
> 
> * will-it-scale.per_process_ops
> 
> Compared with v6.19-rc4(f0b9d8eb98df), with this fix, I have these
> results:
> 
> nr_tasks   Delta
> 1          + 3.593%
> 8          + 3.094%
> 64         +60.247%
> 128        +49.344%
> 192        +27.500%
> 256        -12.077%
> 
> For the cases (nr_tasks: 1-192), there're the improvements. I think
> this is expected since pre-cached spare sheaf reduces spinlock race:
> reduce barn_put_empty_sheaf() & barn_get_empty_sheaf().
> 
> So (maybe too late),
> 
> Tested-by: Zhao Liu <zhao1.liu@...el.com>

Thanks!

> But I find there are two more questions that might need consideration?
> 
> # Question 1: Regression for 256 tasks
> 
> For the above test - the case with nr_tasks: 256, there's a "slight"
> regression. I did more testing:
> 
> (This is a single-round test; the 256-tasks data has jitter.)
> 
> nr_tasks   Delta
> 244	     0.308%
> 248	   - 0.805%
> 252	    12.070%
> 256	   -11.441%
> 258	     2.070%
> 260	     1.252%
> 264	     2.369%
> 268	   -11.479%
> 272	     2.130%
> 292	     8.714%
> 296	    10.905%
> 298	    17.196%
> 300	    11.783%
> 302	     6.620%
> 304	     3.112%
> 308	   - 5.924%
> 
> It can be seen that most cases show improvement, though a few may
> experience slight regression.
> 
> Based on the configuration of my machine:
> 
>     GNR - 2 sockets with the following NUMA topology:
> 
>     NUMA:
>       NUMA node(s):              4
>       NUMA node0 CPU(s):         0-42,172-214
>       NUMA node1 CPU(s):         43-85,215-257
>       NUMA node2 CPU(s):         86-128,258-300
>       NUMA node3 CPU(s):         129-171,301-343
> 
> Since I set the CPU affinity on the core, 256 cases is roughly
> equivalent to the moment when Node 0 and Node 1 are filled.
> 
> The following is the perf data comparing 2 tests w/o fix & with this fix:
> 
> # Baseline  Delta Abs  Shared Object            Symbol
> # ........  .........  .......................  ....................................
> #
>     61.76%     +4.78%  [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>      0.93%     -0.32%  [kernel.vmlinux]         [k] __slab_free
>      0.39%     -0.31%  [kernel.vmlinux]         [k] barn_get_empty_sheaf
>      1.35%     -0.30%  [kernel.vmlinux]         [k] mas_leaf_max_gap
>      3.22%     -0.30%  [kernel.vmlinux]         [k] __kmem_cache_alloc_bulk
>      1.73%     -0.20%  [kernel.vmlinux]         [k] __cond_resched
>      0.52%     -0.19%  [kernel.vmlinux]         [k] _raw_spin_lock_irqsave
>      0.92%     +0.18%  [kernel.vmlinux]         [k] _raw_spin_lock
>      1.91%     -0.15%  [kernel.vmlinux]         [k] zap_pmd_range.isra.0
>      1.37%     -0.13%  [kernel.vmlinux]         [k] mas_wr_node_store
>      1.29%     -0.12%  [kernel.vmlinux]         [k] free_pud_range
>      0.92%     -0.11%  [kernel.vmlinux]         [k] __mmap_region
>      0.12%     -0.11%  [kernel.vmlinux]         [k] barn_put_empty_sheaf
>      0.20%     -0.09%  [kernel.vmlinux]         [k] barn_replace_empty_sheaf
>      0.31%     +0.09%  [kernel.vmlinux]         [k] get_partial_node
>      0.29%     -0.07%  [kernel.vmlinux]         [k] __rcu_free_sheaf_prepare
>      0.12%     -0.07%  [kernel.vmlinux]         [k] intel_idle_xstate
>      0.21%     -0.07%  [kernel.vmlinux]         [k] __kfree_rcu_sheaf
>      0.26%     -0.07%  [kernel.vmlinux]         [k] down_write
>      0.53%     -0.06%  libc.so.6                [.] __mmap
>      0.66%     -0.06%  [kernel.vmlinux]         [k] mas_walk
>      0.48%     -0.06%  [kernel.vmlinux]         [k] mas_prev_slot
>      0.45%     -0.06%  [kernel.vmlinux]         [k] mas_find
>      0.38%     -0.06%  [kernel.vmlinux]         [k] mas_wr_store_type
>      0.23%     -0.06%  [kernel.vmlinux]         [k] do_vmi_align_munmap
>      0.21%     -0.05%  [kernel.vmlinux]         [k] perf_event_mmap_event
>      0.32%     -0.05%  [kernel.vmlinux]         [k] entry_SYSRETQ_unsafe_stack
>      0.19%     -0.05%  [kernel.vmlinux]         [k] downgrade_write
>      0.59%     -0.05%  [kernel.vmlinux]         [k] mas_next_slot
>      0.31%     -0.05%  [kernel.vmlinux]         [k] __mmap_new_vma
>      0.44%     -0.05%  [kernel.vmlinux]         [k] kmem_cache_alloc_noprof
>      0.28%     -0.05%  [kernel.vmlinux]         [k] __vma_enter_locked
>      0.41%     -0.05%  [kernel.vmlinux]         [k] memcpy
>      0.48%     -0.04%  [kernel.vmlinux]         [k] mas_store_gfp
>      0.14%     +0.04%  [kernel.vmlinux]         [k] __put_partials
>      0.19%     -0.04%  [kernel.vmlinux]         [k] mas_empty_area_rev
>      0.30%     -0.04%  [kernel.vmlinux]         [k] do_syscall_64
>      0.25%     -0.04%  [kernel.vmlinux]         [k] mas_preallocate
>      0.15%     -0.04%  [kernel.vmlinux]         [k] rcu_free_sheaf
>      0.22%     -0.04%  [kernel.vmlinux]         [k] entry_SYSCALL_64
>      0.49%     -0.04%  libc.so.6                [.] __munmap
>      0.91%     -0.04%  [kernel.vmlinux]         [k] rcu_all_qs
>      0.21%     -0.04%  [kernel.vmlinux]         [k] __vm_munmap
>      0.24%     -0.04%  [kernel.vmlinux]         [k] mas_store_prealloc
>      0.19%     -0.04%  [kernel.vmlinux]         [k] __kmalloc_cache_noprof
>      0.34%     -0.04%  [kernel.vmlinux]         [k] build_detached_freelist
>      0.19%     -0.03%  [kernel.vmlinux]         [k] vms_complete_munmap_vmas
>      0.36%     -0.03%  [kernel.vmlinux]         [k] mas_rev_awalk
>      0.05%     -0.03%  [kernel.vmlinux]         [k] shuffle_freelist
>      0.19%     -0.03%  [kernel.vmlinux]         [k] down_write_killable
>      0.19%     -0.03%  [kernel.vmlinux]         [k] kmem_cache_free
>      0.27%     -0.03%  [kernel.vmlinux]         [k] up_write
>      0.13%     -0.03%  [kernel.vmlinux]         [k] vm_area_alloc
>      0.18%     -0.03%  [kernel.vmlinux]         [k] arch_get_unmapped_area_topdown
>      0.08%     -0.03%  [kernel.vmlinux]         [k] userfaultfd_unmap_complete
>      0.10%     -0.03%  [kernel.vmlinux]         [k] tlb_gather_mmu
>      0.30%     -0.02%  [kernel.vmlinux]         [k] ___slab_alloc
> 
> I think the insteresting item is "get_partial_node". It seems this fix
> makes "get_partial_node" slightly more frequent. HMM, however, I still
> can't figure out why this is happening. Do you have any thoughts on it?

I'm not sure if it's statistically significant or just noise, +0.09% could
be noise?

> # Question 2: sheaf capacity
> 
> Back the original commit which triggerred lkp regression. I did more
> testing to check if this fix could totally fill the regression gap.
> 
> The base line is commit 3accabda4 ("mm, vma: use percpu sheaves for
> vm_area_struct cache") and its next commit 59faa4da7cd4 ("maple_tree:
> use percpu sheaves for maple_node_cache") has the regression.
> 
> I compared v6.19-rc4(f0b9d8eb98df) w/o fix & with fix aginst the base
> line:
> 
> nr_tasks	w/o fix         with fix
> 1		- 3.643%	- 0.181%
> 8		-12.523%	- 9.816%
> 64		-50.378%	-20.482%
> 128		-36.736%	- 5.518%
> 192		-22.963%	- 1.777%
> 256		-32.926%	- 41.026%
> 
> It appears that under extreme conditions, regression remains significate.
> I remembered your suggestion about larger capacity and did the following
> testing:
> 
> 	59faa4da7cd4	59faa4da7cd4	59faa4da7cd4	59faa4da7cd4	59faa4da7cd4
> 			(with this fix)	(cap: 32->64)	(cap: 32->128)	(cap: 32->256)
> 1	-8.789%		-8.805%		-8.185%		-9.912%		-8.673%
> 8	-12.256%	-9.219%		-10.460%	-10.070%	-8.819%
> 64	-38.915%	-8.172%		-4.700%		4.571%		8.793%
> 128	-8.032%		11.377%		23.232%		26.940%		30.573%
> 192	-1.220%		9.758%		20.573%		22.645%		25.768%
> 256	-6.570%		9.967%		21.663%		30.103%		33.876%
> 
> Comparing with base line (3accabda4), larger capacity could
> significatly improve the Sheaf's scalability.
> 
> So, I'd like to know if you think dynamically or adaptively adjusting
> capacity is a worthwhile idea.

In the followup series, there will be automatically determined capacity to
roughly match the current capacity of cpu partial slabs:

https://lore.kernel.org/all/20260112-sheaves-for-all-v2-4-98225cfb50cf@suse.cz/

We can use that as starting point for further tuning. But I suspect making
it adjust dynamically would be complicated.

> Thanks for your patience.
> 
> Regards,
> Zhao
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ