[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZdcHXFMq0cwmVKfM@pc636>
Date: Thu, 22 Feb 2024 09:35:40 +0100
From: Uladzislau Rezki <urezki@...il.com>
To: Matthew Wilcox <willy@...radead.org>, Mel Gorman <mgorman@...e.de>,
kirill.shutemov@...ux.intel.com,
Vishal Moola <vishal.moola@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
LKML <linux-kernel@...r.kernel.org>, Baoquan He <bhe@...hat.com>,
Lorenzo Stoakes <lstoakes@...il.com>,
Christoph Hellwig <hch@...radead.org>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>,
Dave Chinner <david@...morbit.com>,
"Paul E . McKenney" <paulmck@...nel.org>,
Joel Fernandes <joel@...lfernandes.org>,
Oleksiy Avramchenko <oleksiy.avramchenko@...y.com>,
linux-mm@...ck.org
Subject: Re: [PATCH v3 00/11] Mitigate a vmap lock contention v3
Hello, Folk!
> This is v3. It is based on the 6.7.0-rc8.
>
> 1. Motivation
>
> - Offload global vmap locks making it scaled to number of CPUS;
> - If possible and there is an agreement, we can remove the "Per cpu kva allocator"
> to make the vmap code to be more simple;
> - There were complains from XFS folk that a vmalloc might be contented
> on the their workloads.
>
> 2. Design(high level overview)
>
> We introduce an effective vmap node logic. A node behaves as independent
> entity to serve an allocation request directly(if possible) from its pool.
> That way it bypasses a global vmap space that is protected by its own lock.
>
> An access to pools are serialized by CPUs. Number of nodes are equal to
> number of CPUs in a system. Please note the high threshold is bound to
> 128 nodes.
>
> Pools are size segregated and populated based on system demand. The maximum
> alloc request that can be stored into a segregated storage is 256 pages. The
> lazily drain path decays a pool by 25% as a first step and as second populates
> it by fresh freed VAs for reuse instead of returning them into a global space.
>
> When a VA is obtained(alloc path), it is stored in separate nodes. A va->va_start
> address is converted into a correct node where it should be placed and resided.
> Doing so we balance VAs across the nodes as a result an access becomes scalable.
> The addr_to_node() function does a proper address conversion to a correct node.
>
> A vmap space is divided on segments with fixed size, it is 16 pages. That way
> any address can be associated with a segment number. Number of segments are
> equal to num_possible_cpus() but not grater then 128. The numeration starts
> from 0. See below how it is converted:
>
> static inline unsigned int
> addr_to_node_id(unsigned long addr)
> {
> return (addr / zone_size) % nr_nodes;
> }
>
> On a free path, a VA can be easily found by converting its "va_start" address
> to a certain node it resides. It is moved from "busy" data to "lazy" data structure.
> Later on, as noted earlier, the lazy kworker decays each node pool and populates it
> by fresh incoming VAs. Please note, a VA is returned to a node that did an alloc
> request.
>
> 3. Test on AMD Ryzen Threadripper 3970X 32-Core Processor
>
> sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
>
> <default perf>
> 94.41% 0.89% [kernel] [k] _raw_spin_lock
> 93.35% 93.07% [kernel] [k] native_queued_spin_lock_slowpath
> 76.13% 0.28% [kernel] [k] __vmalloc_node_range
> 72.96% 0.81% [kernel] [k] alloc_vmap_area
> 56.94% 0.00% [kernel] [k] __get_vm_area_node
> 41.95% 0.00% [kernel] [k] vmalloc
> 37.15% 0.01% [test_vmalloc] [k] full_fit_alloc_test
> 35.17% 0.00% [kernel] [k] ret_from_fork_asm
> 35.17% 0.00% [kernel] [k] ret_from_fork
> 35.17% 0.00% [kernel] [k] kthread
> 35.08% 0.00% [test_vmalloc] [k] test_func
> 34.45% 0.00% [test_vmalloc] [k] fix_size_alloc_test
> 28.09% 0.01% [test_vmalloc] [k] long_busy_list_alloc_test
> 23.53% 0.25% [kernel] [k] vfree.part.0
> 21.72% 0.00% [kernel] [k] remove_vm_area
> 20.08% 0.21% [kernel] [k] find_unlink_vmap_area
> 2.34% 0.61% [kernel] [k] free_vmap_area_noflush
> <default perf>
> vs
> <patch-series perf>
> 82.32% 0.22% [test_vmalloc] [k] long_busy_list_alloc_test
> 63.36% 0.02% [kernel] [k] vmalloc
> 63.34% 2.64% [kernel] [k] __vmalloc_node_range
> 30.42% 4.46% [kernel] [k] vfree.part.0
> 28.98% 2.51% [kernel] [k] __alloc_pages_bulk
> 27.28% 0.19% [kernel] [k] __get_vm_area_node
> 26.13% 1.50% [kernel] [k] alloc_vmap_area
> 21.72% 21.67% [kernel] [k] clear_page_rep
> 19.51% 2.43% [kernel] [k] _raw_spin_lock
> 16.61% 16.51% [kernel] [k] native_queued_spin_lock_slowpath
> 13.40% 2.07% [kernel] [k] free_unref_page
> 10.62% 0.01% [kernel] [k] remove_vm_area
> 9.02% 8.73% [kernel] [k] insert_vmap_area
> 8.94% 0.00% [kernel] [k] ret_from_fork_asm
> 8.94% 0.00% [kernel] [k] ret_from_fork
> 8.94% 0.00% [kernel] [k] kthread
> 8.29% 0.00% [test_vmalloc] [k] test_func
> 7.81% 0.05% [test_vmalloc] [k] full_fit_alloc_test
> 5.30% 4.73% [kernel] [k] purge_vmap_node
> 4.47% 2.65% [kernel] [k] free_vmap_area_noflush
> <patch-series perf>
>
> confirms that a native_queued_spin_lock_slowpath goes down to
> 16.51% percent from 93.07%.
>
> The throughput is ~12x higher:
>
> urezki@...38:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
> Run the test with following parameters: run_test_mask=7 nr_threads=64
> Done.
> Check the kernel ring buffer to see the summary.
>
> real 10m51.271s
> user 0m0.013s
> sys 0m0.187s
> urezki@...38:~$
>
> urezki@...38:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
> Run the test with following parameters: run_test_mask=7 nr_threads=64
> Done.
> Check the kernel ring buffer to see the summary.
>
> real 0m51.301s
> user 0m0.015s
> sys 0m0.040s
> urezki@...38:~$
>
> 4. Changelog
>
> v1: https://lore.kernel.org/linux-mm/ZIAqojPKjChJTssg@pc636/T/
> v2: https://lore.kernel.org/lkml/20230829081142.3619-1-urezki@gmail.com/
>
> Delta v2 -> v3:
> - fix comments from v2 feedback;
> - switch from pre-fetch chunk logic to a less complex size based pools.
>
> Baoquan He (1):
> mm/vmalloc: remove vmap_area_list
>
> Uladzislau Rezki (Sony) (10):
> mm: vmalloc: Add va_alloc() helper
> mm: vmalloc: Rename adjust_va_to_fit_type() function
> mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c
> mm: vmalloc: Remove global vmap_area_root rb-tree
> mm: vmalloc: Remove global purge_vmap_area_root rb-tree
> mm: vmalloc: Offload free_vmap_area_lock lock
> mm: vmalloc: Support multiple nodes in vread_iter
> mm: vmalloc: Support multiple nodes in vmallocinfo
> mm: vmalloc: Set nr_nodes based on CPUs in a system
> mm: vmalloc: Add a shrinker to drain vmap pools
>
> .../admin-guide/kdump/vmcoreinfo.rst | 8 +-
> arch/arm64/kernel/crash_core.c | 1 -
> arch/riscv/kernel/crash_core.c | 1 -
> include/linux/vmalloc.h | 1 -
> kernel/crash_core.c | 4 +-
> kernel/kallsyms_selftest.c | 1 -
> mm/nommu.c | 2 -
> mm/vmalloc.c | 1049 ++++++++++++-----
> 8 files changed, 786 insertions(+), 281 deletions(-)
>
> --
> 2.39.2
>
There is one thing that i have to clarify and which is open for me yet.
Test machine:
quemu x86_64 system
64 CPUs
64G of memory
test suite:
test_vmalloc.sh
environment:
mm-unstable, branch: next-20240220 where this series
is located. On top of it i added locally Suren's Baghdasaryan
Memory allocation profiling v3 for better understanding of memory
usage.
Before running test, the condition is as below:
urezki@...38:~$ sort -h /proc/allocinfo
27.2MiB 6970 mm/memory.c:1122 module:memory func:folio_prealloc
79.1MiB 20245 mm/readahead.c:247 module:readahead func:page_cache_ra_unbounded
112MiB 8689 mm/slub.c:2202 module:slub func:alloc_slab_page
122MiB 31168 mm/page_ext.c:270 module:page_ext func:alloc_page_ext
urezki@...38:~$ free -m
total used free shared buff/cache available
Mem: 64172 936 63618 0 134 63236
Swap: 0 0 0
urezki@...38:~$
The test-suite stresses vmap/vmalloc layer by creating workers which in
a tight loop do alloc/free, i.e. it is considered as extreme. Below three
identical tests were done with only one difference, which is 64, 128 and 256 kworkers:
1) sudo tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=64
urezki@...38:~$ sort -h /proc/allocinfo
80.1MiB 20518 mm/readahead.c:247 module:readahead func:page_cache_ra_unbounded
122MiB 31168 mm/page_ext.c:270 module:page_ext func:alloc_page_ext
153MiB 39048 mm/filemap.c:1919 module:filemap func:__filemap_get_folio
178MiB 13259 mm/slub.c:2202 module:slub func:alloc_slab_page
350MiB 89656 include/linux/mm.h:2848 module:memory func:pagetable_alloc
urezki@...38:~$ free -m
total used free shared buff/cache available
Mem: 64172 1417 63054 0 298 62755
Swap: 0 0 0
urezki@...38:~$
2) sudo tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=128
urezki@...38:~$ sort -h /proc/allocinfo
122MiB 31168 mm/page_ext.c:270 module:page_ext func:alloc_page_ext
154MiB 39440 mm/filemap.c:1919 module:filemap func:__filemap_get_folio
196MiB 14038 mm/slub.c:2202 module:slub func:alloc_slab_page
1.20GiB 315655 include/linux/mm.h:2848 module:memory func:pagetable_alloc
urezki@...38:~$ free -m
total used free shared buff/cache available
Mem: 64172 2556 61914 0 302 61616
Swap: 0 0 0
urezki@...38:~$
3) sudo tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256
urezki@...38:~$ sort -h /proc/allocinfo
127MiB 32565 mm/readahead.c:247 module:readahead func:page_cache_ra_unbounded
197MiB 50506 mm/filemap.c:1919 module:filemap func:__filemap_get_folio
278MiB 18519 mm/slub.c:2202 module:slub func:alloc_slab_page
5.36GiB 1405072 include/linux/mm.h:2848 module:memory func:pagetable_alloc
urezki@...38:~$ free -m
total used free shared buff/cache available
Mem: 64172 6741 57652 0 394 57431
Swap: 0 0 0
urezki@...38:~$
pagetable_alloc - gets increased as soon as a higher pressure is applied by
increasing number of workers. Running same number of jobs on a next run
does not increase it and stays on same level as on previous.
/**
* pagetable_alloc - Allocate pagetables
* @gfp: GFP flags
* @order: desired pagetable order
*
* pagetable_alloc allocates memory for page tables as well as a page table
* descriptor to describe that memory.
*
* Return: The ptdesc describing the allocated page tables.
*/
static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
{
struct page *page = alloc_pages(gfp | __GFP_COMP, order);
return page_ptdesc(page);
}
Could you please comment on it? Or do you have any thought? Is it expected?
Is a page-table ever shrink?
/proc/slabinfo does not show any high "active" or "number" of objects to
be used by any cache.
/proc/meminfo - "VmallocUsed" stays low after those 3 tests.
I have checked it with KASAN, KMEMLEAK and i do not see any issues.
Thank you for the help!
--
Uladzislau Rezki
Powered by blists - more mailing lists