linux-hardening - Re: [PATCH -next] mm: usercopy: add a debugfs interface to bypass the vmalloc check.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z2B9AMOSgTPboo2U@pc636>
Date: Mon, 16 Dec 2024 20:18:24 +0100
From: Uladzislau Rezki <urezki@...il.com>
To: Matthew Wilcox <willy@...radead.org>
Cc: Uladzislau Rezki <urezki@...il.com>,
	Kefeng Wang <wangkefeng.wang@...wei.com>, zuoze <zuoze1@...wei.com>,
	gustavoars@...nel.org, akpm@...ux-foundation.org,
	linux-hardening@...r.kernel.org, linux-mm@...ck.org,
	keescook@...omium.org
Subject: Re: [PATCH -next] mm: usercopy: add a debugfs interface to bypass
 the vmalloc check.

Hello, Matthew!

> On Wed, Dec 04, 2024 at 09:51:07AM +0100, Uladzislau Rezki wrote:
> > I think, when i have more free cycles, i will check it from performance
> > point of view. Because i do not know how much a maple tree is efficient
> > when it comes to lookups, insert and removing.
> 
> Maple tree has a fanout of around 8-12 at each level, while an rbtree has
> a fanout of two (arguably 3, since we might find the node).  Let's say you
> have 1000 vmalloc areas.  A perfectly balanced rbtree would have 9 levels
> (and might well be 11+ levels if imperfectly balanced -- and part of the
> advantage of rbtrees over AVL trees is that they can be less balanced
> so need fewer rotations).  A perfectly balanced maple tree would have
> only 3 levels.
> 
Thank you for your explanation and some input on this topic. Density, a
high of tree and branching factor should make the work better :)

>
> Addition/removal is more expensive.  We biased the implementation heavily
> towards lookup, so we chose to keep it very compact.  Most users (and
> particularly the VMA tree which was our first client) do more lookups
> than modifications; a real application takes many more pagefaults than
> it does calls to mmap/munmap/mprotect/etc.
> 
This is what i see. Some use cases are degraded. For example stress-ng
forking bench is worse, test_vmalloc.sh also reports a degrade:

See below figures:

<snip>
# Default
urezki@...38:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
+   59.52%     7.15%  [kernel]        [k] __vmalloc_node_range_noprof
+   37.98%     0.22%  [test_vmalloc]  [k] fix_size_alloc_test
+   37.32%     8.56%  [kernel]        [k] vfree.part.0
+   35.31%     0.00%  [kernel]        [k] ret_from_fork_asm
+   35.31%     0.00%  [kernel]        [k] ret_from_fork
+   35.31%     0.00%  [kernel]        [k] kthread
+   35.05%     0.00%  [test_vmalloc]  [k] test_func
+   34.16%     0.06%  [test_vmalloc]  [k] long_busy_list_alloc_test
+   32.10%     0.12%  [kernel]        [k] __get_vm_area_node
+   31.69%     1.82%  [kernel]        [k] alloc_vmap_area
+   27.24%     5.01%  [kernel]        [k] _raw_spin_lock
+   25.45%     0.15%  [test_vmalloc]  [k] full_fit_alloc_test
+   23.57%     0.03%  [kernel]        [k] remove_vm_area
+   22.23%    22.23%  [kernel]        [k] native_queued_spin_lock_slowpath
+   14.34%     0.94%  [kernel]        [k] alloc_pages_bulk_noprof
+   10.80%     7.51%  [kernel]        [k] free_vmap_area_noflush
+   10.59%    10.59%  [kernel]        [k] clear_page_rep
+    9.52%     8.96%  [kernel]        [k] insert_vmap_area
+    7.39%     2.82%  [kernel]        [k] find_unlink_vmap_area

# Maple-tree
time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
+   74.33%     1.50%  [kernel]        [k] __vmalloc_node_range_noprof
+   55.73%     0.06%  [kernel]        [k] __get_vm_area_node
+   55.53%     1.07%  [kernel]        [k] alloc_vmap_area
+   53.78%     0.09%  [test_vmalloc]  [k] long_busy_list_alloc_test
+   53.75%     1.76%  [kernel]        [k] _raw_spin_lock
+   52.81%    51.80%  [kernel]        [k] native_queued_spin_lock_slowpath
+   28.93%     0.09%  [test_vmalloc]  [k] full_fit_alloc_test
+   23.29%     2.43%  [kernel]        [k] vfree.part.0
+   20.29%     0.01%  [kernel]        [k] mt_insert_vmap_area
+   20.27%     0.34%  [kernel]        [k] mtree_insert_range
+   15.30%     0.05%  [test_vmalloc]  [k] fix_size_alloc_test
+   14.06%     0.05%  [kernel]        [k] remove_vm_area
+   13.73%     0.00%  [kernel]        [k] ret_from_fork_asm
+   13.73%     0.00%  [kernel]        [k] ret_from_fork
+   13.73%     0.00%  [kernel]        [k] kthread
+   13.51%     0.00%  [test_vmalloc]  [k] test_func
+   13.15%     0.87%  [kernel]        [k] alloc_pages_bulk_noprof
+    9.92%     9.54%  [kernel]        [k] clear_page_rep
+    9.62%     0.07%  [kernel]        [k] find_unlink_vmap_area
+    9.55%     0.04%  [kernel]        [k] mtree_erase
+    5.92%     1.44%  [kernel]        [k] free_unref_page
+    4.92%     0.24%  [kernel]        [k] mas_insert.isra.0
+    4.69%     0.93%  [kernel]        [k] mas_erase
+    4.47%     0.02%  [kernel]        [k] rcu_do_batch
+    3.35%     2.10%  [kernel]        [k] __vmap_pages_range_noflush
+    3.00%     2.81%  [kernel]        [k] mas_wr_store_type
<snip>

i.e. insert/remove are more expansive, at least my test show this.
It looks like, mtree_insert() uses a range_variant which implies
a tree update after an insert operation is completed. And probably
where an overhead comes from. If i use a b+tree(my own implementation),
as expected, it is better than rb-tree because of b+tree properties.

I have composed some data, you can find more bench data there:

wget ftp://vps418301.ovh.net/incoming/Maple_tree_comparison_with_rb_tree_in_vmalloc.pdf

>> That's what maple trees do; they store non-overlapping ranges.  So you
>> can look up any address in a range and it will return you the pointer
>> associated with that range.  Just like you'd want for a page fault ;-)
Thank you. I see. I though that it also can work as a regular b+ or b
tress so we do not spend cycles on updates to track ranges. Like below
code:

int ret = mtree_insert(t, va->va_start, va, GFP_KERNEL);

i do not store a range here, i store key -> value pair but maple-tree
considers it as range: [va_start:va_start]. Maybe we can improve this
case when not a range is passed?

This is just my thoughts :)

--
Uladzislau Rezki