linux-kernel - Re: [RFC PATCH] mm/vmalloc: request large order pages from buddy allocator

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aPEubI4kWvzSC5RN@fedora>
Date: Thu, 16 Oct 2025 10:42:04 -0700
From: "Vishal Moola (Oracle)" <vishal.moola@...il.com>
To: Uladzislau Rezki <urezki@...il.com>
Cc: Matthew Wilcox <willy@...radead.org>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [RFC PATCH] mm/vmalloc: request large order pages from buddy
 allocator

On Thu, Oct 16, 2025 at 06:12:36PM +0200, Uladzislau Rezki wrote:
> On Wed, Oct 15, 2025 at 02:28:49AM -0700, Vishal Moola (Oracle) wrote:
> > On Wed, Oct 15, 2025 at 04:56:42AM +0100, Matthew Wilcox wrote:
> > > On Tue, Oct 14, 2025 at 11:27:54AM -0700, Vishal Moola (Oracle) wrote:
> > > > Running 1000 iterations of allocations on a small 4GB system finds:
> > > > 
> > > > 1000 2mb allocations:
> > > > 	[Baseline]			[This patch]
> > > > 	real    46.310s			real    34.380s
> > > > 	user    0.001s			user    0.008s
> > > > 	sys     46.058s			sys     34.152s
> > > > 
> > > > 10000 200kb allocations:
> > > > 	[Baseline]			[This patch]
> > > > 	real    56.104s			real    43.946s
> > > > 	user    0.001s			user    0.003s
> > > > 	sys     55.375s			sys     43.259s
> > > > 
> > > > 10000 20kb allocations:
> > > > 	[Baseline]			[This patch]
> > > > 	real    0m8.438s		real    0m9.160s
> > > > 	user    0m0.001s		user    0m0.002s
> > > > 	sys     0m7.936s		sys     0m8.671s
> > > 
> > > I'd be more confident in the 20kB numbers if you'd done 10x more
> > > iterations.
> > 
> > I actually ran my a number of times to mitigate the effects of possibly
> > too small sample sizes, so I do have that number for you too:
> > 
> > [Baseline]			[This patch]
> > real    1m28.119s		real    1m32.630s
> > user    0m0.012s		user    0m0.011s
> > sys     1m23.270s		sys     1m28.529s
> > 
> I have just had a look at performance figures of this patch. The test
> case is 16K allocation by one single thread, 1 000 000 loops, 10 run:
> 
> sudo ./test_vmalloc.sh run_test_mask=1 nr_threads=1 nr_pages=4

The reason I didn't use this test module is the same concern Matthew
brought up earlier about testing the PCP list rather than buddy
allocator. The test module allocates, then frees over and over again,
making it incredibly prone to reuse the pages over and over again.

> BOX: AMD Milan, 256 CPUs, 512GB of memory
> 
> # default 16K alloc
> [   15.823704] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 955334 usec
> [   17.751685] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1158739 usec
> [   19.443759] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1016522 usec
> [   21.035701] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 911381 usec
> [   22.727688] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 987286 usec
> [   24.199694] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 955112 usec
> [   25.755675] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 926393 usec
> [   27.355670] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 937875 usec
> [   28.979671] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1006985 usec
> [   30.531674] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 941088 usec
> 
> # the patch 16K alloc
> [   44.343380] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2296849 usec
> [   47.171290] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2014678 usec
> [   50.007258] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2094184 usec
> [   52.651141] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1953046 usec
> [   55.455089] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2209423 usec
> [   57.943153] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1941747 usec
> [   60.799043] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2038504 usec
> [   63.299007] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1788588 usec
> [   65.843011] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2137055 usec
> [   68.647031] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2193022 usec
> 
> 2X slower.
> 
> perf-cycles, same test but on 64 CPUs:
> 
> +   97.02%     0.13%  [test_vmalloc]    [k] fix_size_alloc_test
> -   82.11%    82.10%  [kernel]          [k] native_queued_spin_lock_slowpath
>      26.19% ret_from_fork_asm
>         ret_from_fork
>       - kthread
>          - 25.96% test_func
>             - fix_size_alloc_test
>                - 23.49% __vmalloc_node_noprof
>                   - __vmalloc_node_range_noprof
>                      - 54.70% alloc_pages_noprof
>                           alloc_pages_mpol
>                           __alloc_frozen_pages_noprof
>                           get_page_from_freelist
>                           __rmqueue_pcplist
>                      - 5.58% __get_vm_area_node
>                           alloc_vmap_area
>                - 20.54% vfree.part.0
>                   - 20.43% __free_frozen_pages
>                        free_frozen_page_commit
>                        free_pcppages_bulk
>                        _raw_spin_lock_irqsave
>                        native_queued_spin_lock_slowpath
>          - 0.77% worker_thread
>             - process_one_work
>                - 0.76% vmstat_update
>                     refresh_cpu_vm_stats
>                     decay_pcp_high
>                     free_pcppages_bulk
>                     _raw_spin_lock_irqsave
>                     native_queued_spin_lock_slowpath
> +   76.57%     0.16%  [kernel]          [k] _raw_spin_lock_irqsave
> +   71.62%     0.00%  [kernel]          [k] __vmalloc_node_noprof
> +   71.61%     0.58%  [kernel]          [k] __vmalloc_node_range_noprof
> +   62.35%     0.06%  [kernel]          [k] alloc_pages_mpol
> +   62.27%     0.17%  [kernel]          [k] __alloc_frozen_pages_noprof
> +   62.20%     0.02%  [kernel]          [k] alloc_pages_noprof
> +   62.10%     0.05%  [kernel]          [k] get_page_from_freelist
> +   55.63%     0.19%  [kernel]          [k] __rmqueue_pcplist
> +   32.11%     0.00%  [kernel]          [k] ret_from_fork_asm
> +   32.11%     0.00%  [kernel]          [k] ret_from_fork
> +   32.11%     0.00%  [kernel]          [k] kthread
> 
> I would say the bottle-neck is a page-allocator. It seems high-order
> allocations are not good for it.
> 
> --
> Uladzislau Rezki