linux-kernel - Re: [PATCH v8 15/23] maple_tree: use percpu sheaves for maple_node

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ad9864db-a297-44d9-ab1a-61e0285eac5f@suse.cz>
Date: Thu, 16 Oct 2025 18:15:30 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: "D, Suneeth" <Suneeth.D@....com>, Suren Baghdasaryan <surenb@...gle.com>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>,
 Christoph Lameter <cl@...two.org>, David Rientjes <rientjes@...gle.com>
Cc: Roman Gushchin <roman.gushchin@...ux.dev>,
 Harry Yoo <harry.yoo@...cle.com>, Uladzislau Rezki <urezki@...il.com>,
 Sidhartha Kumar <sidhartha.kumar@...cle.com>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, rcu@...r.kernel.org,
 maple-tree@...ts.infradead.org
Subject: Re: [PATCH v8 15/23] maple_tree: use percpu sheaves for
 maple_node_cache

On 10/16/25 17:16, D, Suneeth wrote:
> Hi Vlastimil Babka,
> 
> On 9/10/2025 1:31 PM, Vlastimil Babka wrote:
>> Setup the maple_node_cache with percpu sheaves of size 32 to hopefully
>> improve its performance. Note this will not immediately take advantage
>> of sheaf batching of kfree_rcu() operations due to the maple tree using
>> call_rcu with custom callbacks. The followup changes to maple tree will
>> change that and also make use of the prefilled sheaves functionality.
>> 
> 
> 
> We run will-it-scale-process-mmap2 micro-benchmark as part of our weekly 
> CI for Kernel Performance Regression testing between a stable vs rc 
> kernel. In this week's run we were able to observe severe regression on 
> AMD platforms (Turin and Bergamo) with running the micro-benchmark 
> between the kernels v6.17 and v6.18-rc1 in the range of 12-13% (Turin) 
> and 22-26% (Bergamo). Bisecting further landed me onto this commit 
> (59faa4da7cd4565cbce25358495556b75bb37022) as first bad commit. The 
> following were the machines' configuration and test parameters used:-
> 
> Model name:           AMD EPYC 128-Core Processor [Bergamo]
> Thread(s) per core:   2
> Core(s) per socket:   128
> Socket(s):            1
> Total online memory:  258G
> 
> Model name:           AMD EPYC 64-Core Processor [Turin]
> Thread(s) per core:   2
> Core(s) per socket:   64
> Socket(s):            1
> Total online memory:  258G
> 
> Test params:
> 
>      nr_task: [1 8 64 128 192 256]
>      mode: process
>      test: mmap2
>      kpi: per_process_ops
>      cpufreq_governor: performance
> 
> The following are the stats after bisection:-
> (the KPI used here is per_process_ops)
> 
> kernel_versions      					 per_process_ops
> ---------------      					 ---------------
> v6.17.0 	                                       - 258291
> v6.18.0-rc1 	                                       - 225839
> v6.17.0-rc3-59faa4da7                                  - 212152
> v6.17.0-rc3-3accabda4da1(one commit before bad commit) - 265054

Thanks for the info. Is there any difference if you increase the
sheaf_capacity in the commit from 32 to a higher value? For example 120 to
match what the automatically calculated cpu partial slabs target would be.
I think there's a lock contention on the barn lock causing the regression.
By matching the cpu partial slabs value we should have same batching factor
for the barn lock as there would be on the node list_lock before sheaves.

Thanks.

> Recreation steps:
> 
> 1) git clone https://github.com/antonblanchard/will-it-scale.git
> 2) git clone https://github.com/intel/lkp-tests.git
> 3) cd will-it-scale && git apply 
> lkp-tests/programs/will-it-scale/pkg/will-it-scale.patch
> 4) make
> 5) python3 runtest.py mmap2 25 process 0 0 1 8 64 128 192 256
> 
> NOTE: [5] is specific to machine's architecture. starting from 1 is the 
> array of no.of tasks that you'd wish to run the testcase which here is 
> no.cores per CCX, per NUMA node/ per Socket, nr_threads.
> 
> I also ran the micro-benchmark with tools/testing/perf record and 
> following is the collected data:-
> 
> # perf diff perf.data.old perf.data
> No kallsyms or vmlinux with build-id 
> 0fc9c7b62ade1502af5d6a060914732523f367ef was found
> Warning:
> 43 out of order events recorded.
> Warning:
> 54 out of order events recorded.
> # Event 'cycles:P'
> #
> # Baseline  Delta Abs  Shared Object           Symbol
> # ........  .........  ...................... 
> ..............................................
> #
>                +51.51%  [kernel.kallsyms]       [k] 
> native_queued_spin_lock_slowpath
>                +14.39%  [kernel.kallsyms]       [k] perf_iterate_ctx
>                 +2.52%  [kernel.kallsyms]       [k] unmap_page_range
>                 +1.75%  [kernel.kallsyms]       [k] mas_wr_node_store
>                 +1.47%  [kernel.kallsyms]       [k] __pi_memset
>                 +1.38%  [kernel.kallsyms]       [k] mt_free_rcu
>                 +1.36%  [kernel.kallsyms]       [k] free_pgd_range
>                 +1.10%  [kernel.kallsyms]       [k] __pi_memcpy
>                 +0.96%  [kernel.kallsyms]       [k] __kmem_cache_alloc_bulk
>                 +0.92%  [kernel.kallsyms]       [k] __mmap_region
>                 +0.79%  [kernel.kallsyms]       [k] mas_empty_area_rev
>                 +0.74%  [kernel.kallsyms]       [k] __cond_resched
>                 +0.73%  [kernel.kallsyms]       [k] mas_walk
>                 +0.59%  [kernel.kallsyms]       [k] mas_pop_node
>                 +0.57%  [kernel.kallsyms]       [k] perf_event_mmap_output
>                 +0.49%  [kernel.kallsyms]       [k] mas_find
>                 +0.48%  [kernel.kallsyms]       [k] mas_next_slot
>                 +0.46%  [kernel.kallsyms]       [k] kmem_cache_free
>                 +0.42%  [kernel.kallsyms]       [k] mas_leaf_max_gap
>                 +0.42%  [kernel.kallsyms]       [k] 
> __call_rcu_common.constprop.0
>                 +0.39%  [kernel.kallsyms]       [k] entry_SYSCALL_64
>                 +0.38%  [kernel.kallsyms]       [k] mas_prev_slot
>                 +0.38%  [kernel.kallsyms]       [k] kmem_cache_alloc_noprof
>                 +0.37%  [kernel.kallsyms]       [k] mas_store_gfp
> 
> 
>> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@...cle.com>
>> Reviewed-by: Suren Baghdasaryan <surenb@...gle.com>
>> Signed-off-by: Vlastimil Babka <vbabka@...e.cz>
>> ---
>>   lib/maple_tree.c | 9 +++++++--
>>   1 file changed, 7 insertions(+), 2 deletions(-)
>> 
>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> index 4f0e30b57b0cef9e5cf791f3f64f5898752db402..d034f170ac897341b40cfd050b6aee86b6d2cf60 100644
>> --- a/lib/maple_tree.c
>> +++ b/lib/maple_tree.c
>> @@ -6040,9 +6040,14 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
>>   
>>   void __init maple_tree_init(void)
>>   {
>> +	struct kmem_cache_args args = {
>> +		.align  = sizeof(struct maple_node),
>> +		.sheaf_capacity = 32,
>> +	};
>> +
>>   	maple_node_cache = kmem_cache_create("maple_node",
>> -			sizeof(struct maple_node), sizeof(struct maple_node),
>> -			SLAB_PANIC, NULL);
>> +			sizeof(struct maple_node), &args,
>> +			SLAB_PANIC);
>>   }
>>   
>>   /**
>> 
> 
> ---
> Thanks and Regards
> Suneeth D
>