[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z2B9AMOSgTPboo2U@pc636>
Date: Mon, 16 Dec 2024 20:18:24 +0100
From: Uladzislau Rezki <urezki@...il.com>
To: Matthew Wilcox <willy@...radead.org>
Cc: Uladzislau Rezki <urezki@...il.com>,
Kefeng Wang <wangkefeng.wang@...wei.com>, zuoze <zuoze1@...wei.com>,
gustavoars@...nel.org, akpm@...ux-foundation.org,
linux-hardening@...r.kernel.org, linux-mm@...ck.org,
keescook@...omium.org
Subject: Re: [PATCH -next] mm: usercopy: add a debugfs interface to bypass
the vmalloc check.
Hello, Matthew!
> On Wed, Dec 04, 2024 at 09:51:07AM +0100, Uladzislau Rezki wrote:
> > I think, when i have more free cycles, i will check it from performance
> > point of view. Because i do not know how much a maple tree is efficient
> > when it comes to lookups, insert and removing.
>
> Maple tree has a fanout of around 8-12 at each level, while an rbtree has
> a fanout of two (arguably 3, since we might find the node). Let's say you
> have 1000 vmalloc areas. A perfectly balanced rbtree would have 9 levels
> (and might well be 11+ levels if imperfectly balanced -- and part of the
> advantage of rbtrees over AVL trees is that they can be less balanced
> so need fewer rotations). A perfectly balanced maple tree would have
> only 3 levels.
>
Thank you for your explanation and some input on this topic. Density, a
high of tree and branching factor should make the work better :)
>
> Addition/removal is more expensive. We biased the implementation heavily
> towards lookup, so we chose to keep it very compact. Most users (and
> particularly the VMA tree which was our first client) do more lookups
> than modifications; a real application takes many more pagefaults than
> it does calls to mmap/munmap/mprotect/etc.
>
This is what i see. Some use cases are degraded. For example stress-ng
forking bench is worse, test_vmalloc.sh also reports a degrade:
See below figures:
<snip>
# Default
urezki@...38:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
+ 59.52% 7.15% [kernel] [k] __vmalloc_node_range_noprof
+ 37.98% 0.22% [test_vmalloc] [k] fix_size_alloc_test
+ 37.32% 8.56% [kernel] [k] vfree.part.0
+ 35.31% 0.00% [kernel] [k] ret_from_fork_asm
+ 35.31% 0.00% [kernel] [k] ret_from_fork
+ 35.31% 0.00% [kernel] [k] kthread
+ 35.05% 0.00% [test_vmalloc] [k] test_func
+ 34.16% 0.06% [test_vmalloc] [k] long_busy_list_alloc_test
+ 32.10% 0.12% [kernel] [k] __get_vm_area_node
+ 31.69% 1.82% [kernel] [k] alloc_vmap_area
+ 27.24% 5.01% [kernel] [k] _raw_spin_lock
+ 25.45% 0.15% [test_vmalloc] [k] full_fit_alloc_test
+ 23.57% 0.03% [kernel] [k] remove_vm_area
+ 22.23% 22.23% [kernel] [k] native_queued_spin_lock_slowpath
+ 14.34% 0.94% [kernel] [k] alloc_pages_bulk_noprof
+ 10.80% 7.51% [kernel] [k] free_vmap_area_noflush
+ 10.59% 10.59% [kernel] [k] clear_page_rep
+ 9.52% 8.96% [kernel] [k] insert_vmap_area
+ 7.39% 2.82% [kernel] [k] find_unlink_vmap_area
# Maple-tree
time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
+ 74.33% 1.50% [kernel] [k] __vmalloc_node_range_noprof
+ 55.73% 0.06% [kernel] [k] __get_vm_area_node
+ 55.53% 1.07% [kernel] [k] alloc_vmap_area
+ 53.78% 0.09% [test_vmalloc] [k] long_busy_list_alloc_test
+ 53.75% 1.76% [kernel] [k] _raw_spin_lock
+ 52.81% 51.80% [kernel] [k] native_queued_spin_lock_slowpath
+ 28.93% 0.09% [test_vmalloc] [k] full_fit_alloc_test
+ 23.29% 2.43% [kernel] [k] vfree.part.0
+ 20.29% 0.01% [kernel] [k] mt_insert_vmap_area
+ 20.27% 0.34% [kernel] [k] mtree_insert_range
+ 15.30% 0.05% [test_vmalloc] [k] fix_size_alloc_test
+ 14.06% 0.05% [kernel] [k] remove_vm_area
+ 13.73% 0.00% [kernel] [k] ret_from_fork_asm
+ 13.73% 0.00% [kernel] [k] ret_from_fork
+ 13.73% 0.00% [kernel] [k] kthread
+ 13.51% 0.00% [test_vmalloc] [k] test_func
+ 13.15% 0.87% [kernel] [k] alloc_pages_bulk_noprof
+ 9.92% 9.54% [kernel] [k] clear_page_rep
+ 9.62% 0.07% [kernel] [k] find_unlink_vmap_area
+ 9.55% 0.04% [kernel] [k] mtree_erase
+ 5.92% 1.44% [kernel] [k] free_unref_page
+ 4.92% 0.24% [kernel] [k] mas_insert.isra.0
+ 4.69% 0.93% [kernel] [k] mas_erase
+ 4.47% 0.02% [kernel] [k] rcu_do_batch
+ 3.35% 2.10% [kernel] [k] __vmap_pages_range_noflush
+ 3.00% 2.81% [kernel] [k] mas_wr_store_type
<snip>
i.e. insert/remove are more expansive, at least my test show this.
It looks like, mtree_insert() uses a range_variant which implies
a tree update after an insert operation is completed. And probably
where an overhead comes from. If i use a b+tree(my own implementation),
as expected, it is better than rb-tree because of b+tree properties.
I have composed some data, you can find more bench data there:
wget ftp://vps418301.ovh.net/incoming/Maple_tree_comparison_with_rb_tree_in_vmalloc.pdf
>> That's what maple trees do; they store non-overlapping ranges. So you
>> can look up any address in a range and it will return you the pointer
>> associated with that range. Just like you'd want for a page fault ;-)
Thank you. I see. I though that it also can work as a regular b+ or b
tress so we do not spend cycles on updates to track ranges. Like below
code:
int ret = mtree_insert(t, va->va_start, va, GFP_KERNEL);
i do not store a range here, i store key -> value pair but maple-tree
considers it as range: [va_start:va_start]. Maybe we can improve this
case when not a range is passed?
This is just my thoughts :)
--
Uladzislau Rezki
Powered by blists - more mailing lists