[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aPEZdHJlNOofy5tm@milan>
Date: Thu, 16 Oct 2025 18:12:36 +0200
From: Uladzislau Rezki <urezki@...il.com>
To: "Vishal Moola (Oracle)" <vishal.moola@...il.com>
Cc: Matthew Wilcox <willy@...radead.org>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Uladzislau Rezki <urezki@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [RFC PATCH] mm/vmalloc: request large order pages from buddy
allocator
On Wed, Oct 15, 2025 at 02:28:49AM -0700, Vishal Moola (Oracle) wrote:
> On Wed, Oct 15, 2025 at 04:56:42AM +0100, Matthew Wilcox wrote:
> > On Tue, Oct 14, 2025 at 11:27:54AM -0700, Vishal Moola (Oracle) wrote:
> > > Running 1000 iterations of allocations on a small 4GB system finds:
> > >
> > > 1000 2mb allocations:
> > > [Baseline] [This patch]
> > > real 46.310s real 34.380s
> > > user 0.001s user 0.008s
> > > sys 46.058s sys 34.152s
> > >
> > > 10000 200kb allocations:
> > > [Baseline] [This patch]
> > > real 56.104s real 43.946s
> > > user 0.001s user 0.003s
> > > sys 55.375s sys 43.259s
> > >
> > > 10000 20kb allocations:
> > > [Baseline] [This patch]
> > > real 0m8.438s real 0m9.160s
> > > user 0m0.001s user 0m0.002s
> > > sys 0m7.936s sys 0m8.671s
> >
> > I'd be more confident in the 20kB numbers if you'd done 10x more
> > iterations.
>
> I actually ran my a number of times to mitigate the effects of possibly
> too small sample sizes, so I do have that number for you too:
>
> [Baseline] [This patch]
> real 1m28.119s real 1m32.630s
> user 0m0.012s user 0m0.011s
> sys 1m23.270s sys 1m28.529s
>
I have just had a look at performance figures of this patch. The test
case is 16K allocation by one single thread, 1 000 000 loops, 10 run:
sudo ./test_vmalloc.sh run_test_mask=1 nr_threads=1 nr_pages=4
BOX: AMD Milan, 256 CPUs, 512GB of memory
# default 16K alloc
[ 15.823704] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 955334 usec
[ 17.751685] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1158739 usec
[ 19.443759] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1016522 usec
[ 21.035701] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 911381 usec
[ 22.727688] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 987286 usec
[ 24.199694] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 955112 usec
[ 25.755675] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 926393 usec
[ 27.355670] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 937875 usec
[ 28.979671] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1006985 usec
[ 30.531674] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 941088 usec
# the patch 16K alloc
[ 44.343380] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2296849 usec
[ 47.171290] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2014678 usec
[ 50.007258] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2094184 usec
[ 52.651141] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1953046 usec
[ 55.455089] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2209423 usec
[ 57.943153] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1941747 usec
[ 60.799043] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2038504 usec
[ 63.299007] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 1788588 usec
[ 65.843011] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2137055 usec
[ 68.647031] Summary: fix_size_alloc_test passed: 1 failed: 0 xfailed: 0 repeat: 1 loops: 1000000 avg: 2193022 usec
2X slower.
perf-cycles, same test but on 64 CPUs:
+ 97.02% 0.13% [test_vmalloc] [k] fix_size_alloc_test
- 82.11% 82.10% [kernel] [k] native_queued_spin_lock_slowpath
26.19% ret_from_fork_asm
ret_from_fork
- kthread
- 25.96% test_func
- fix_size_alloc_test
- 23.49% __vmalloc_node_noprof
- __vmalloc_node_range_noprof
- 54.70% alloc_pages_noprof
alloc_pages_mpol
__alloc_frozen_pages_noprof
get_page_from_freelist
__rmqueue_pcplist
- 5.58% __get_vm_area_node
alloc_vmap_area
- 20.54% vfree.part.0
- 20.43% __free_frozen_pages
free_frozen_page_commit
free_pcppages_bulk
_raw_spin_lock_irqsave
native_queued_spin_lock_slowpath
- 0.77% worker_thread
- process_one_work
- 0.76% vmstat_update
refresh_cpu_vm_stats
decay_pcp_high
free_pcppages_bulk
_raw_spin_lock_irqsave
native_queued_spin_lock_slowpath
+ 76.57% 0.16% [kernel] [k] _raw_spin_lock_irqsave
+ 71.62% 0.00% [kernel] [k] __vmalloc_node_noprof
+ 71.61% 0.58% [kernel] [k] __vmalloc_node_range_noprof
+ 62.35% 0.06% [kernel] [k] alloc_pages_mpol
+ 62.27% 0.17% [kernel] [k] __alloc_frozen_pages_noprof
+ 62.20% 0.02% [kernel] [k] alloc_pages_noprof
+ 62.10% 0.05% [kernel] [k] get_page_from_freelist
+ 55.63% 0.19% [kernel] [k] __rmqueue_pcplist
+ 32.11% 0.00% [kernel] [k] ret_from_fork_asm
+ 32.11% 0.00% [kernel] [k] ret_from_fork
+ 32.11% 0.00% [kernel] [k] kthread
I would say the bottle-neck is a page-allocator. It seems high-order
allocations are not good for it.
--
Uladzislau Rezki
Powered by blists - more mailing lists