[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzYbXzt7RXB962OLEd3xoQcPfT1MFw5JcHSmRzPx-Etm_A@mail.gmail.com>
Date: Wed, 7 Aug 2024 10:14:31 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Andrii Nakryiko <andrii@...nel.org>
Cc: linux-trace-kernel@...r.kernel.org, peterz@...radead.org, oleg@...hat.com,
rostedt@...dmis.org, mhiramat@...nel.org, bpf@...r.kernel.org,
linux-kernel@...r.kernel.org, jolsa@...nel.org, paulmck@...nel.org
Subject: Re: [PATCH 7/8] uprobes: perform lockless SRCU-protected uprobes_tree lookup
On Wed, Jul 31, 2024 at 2:43 PM Andrii Nakryiko <andrii@...nel.org> wrote:
>
> Another big bottleneck to scalablity is uprobe_treelock that's taken in
> a very hot path in handle_swbp(). Now that uprobes are SRCU-protected,
> take advantage of that and make uprobes_tree RB-tree look up lockless.
>
> To make RB-tree RCU-protected lockless lookup correct, we need to take
> into account that such RB-tree lookup can return false negatives if there
> are parallel RB-tree modifications (rotations) going on. We use seqcount
> lock to detect whether RB-tree changed, and if we find nothing while
> RB-tree got modified inbetween, we just retry. If uprobe was found, then
> it's guaranteed to be a correct lookup.
>
> With all the lock-avoiding changes done, we get a pretty decent
> improvement in performance and scalability of uprobes with number of
> CPUs, even though we are still nowhere near linear scalability. This is
> due to SRCU not really scaling very well with number of CPUs on
> a particular hardware that was used for testing (80-core Intel Xeon Gold
> 6138 CPU @ 2.00GHz), but also due to the remaning mmap_lock, which is
> currently taken to resolve interrupt address to inode+offset and then
> uprobe instance. And, of course, uretprobes still need similar RCU to
> avoid refcount in the hot path, which will be addressed in the follow up
> patches.
>
> Nevertheless, the improvement is good. We used BPF selftest-based
> uprobe-nop and uretprobe-nop benchmarks to get the below numbers,
> varying number of CPUs on which uprobes and uretprobes are triggered.
>
> BASELINE
> ========
> uprobe-nop ( 1 cpus): 3.032 ± 0.023M/s ( 3.032M/s/cpu)
> uprobe-nop ( 2 cpus): 3.452 ± 0.005M/s ( 1.726M/s/cpu)
> uprobe-nop ( 4 cpus): 3.663 ± 0.005M/s ( 0.916M/s/cpu)
> uprobe-nop ( 8 cpus): 3.718 ± 0.038M/s ( 0.465M/s/cpu)
> uprobe-nop (16 cpus): 3.344 ± 0.008M/s ( 0.209M/s/cpu)
> uprobe-nop (32 cpus): 2.288 ± 0.021M/s ( 0.071M/s/cpu)
> uprobe-nop (64 cpus): 3.205 ± 0.004M/s ( 0.050M/s/cpu)
>
> uretprobe-nop ( 1 cpus): 1.979 ± 0.005M/s ( 1.979M/s/cpu)
> uretprobe-nop ( 2 cpus): 2.361 ± 0.005M/s ( 1.180M/s/cpu)
> uretprobe-nop ( 4 cpus): 2.309 ± 0.002M/s ( 0.577M/s/cpu)
> uretprobe-nop ( 8 cpus): 2.253 ± 0.001M/s ( 0.282M/s/cpu)
> uretprobe-nop (16 cpus): 2.007 ± 0.000M/s ( 0.125M/s/cpu)
> uretprobe-nop (32 cpus): 1.624 ± 0.003M/s ( 0.051M/s/cpu)
> uretprobe-nop (64 cpus): 2.149 ± 0.001M/s ( 0.034M/s/cpu)
>
> SRCU CHANGES
> ============
> uprobe-nop ( 1 cpus): 3.276 ± 0.005M/s ( 3.276M/s/cpu)
> uprobe-nop ( 2 cpus): 4.125 ± 0.002M/s ( 2.063M/s/cpu)
> uprobe-nop ( 4 cpus): 7.713 ± 0.002M/s ( 1.928M/s/cpu)
> uprobe-nop ( 8 cpus): 8.097 ± 0.006M/s ( 1.012M/s/cpu)
> uprobe-nop (16 cpus): 6.501 ± 0.056M/s ( 0.406M/s/cpu)
> uprobe-nop (32 cpus): 4.398 ± 0.084M/s ( 0.137M/s/cpu)
> uprobe-nop (64 cpus): 6.452 ± 0.000M/s ( 0.101M/s/cpu)
>
> uretprobe-nop ( 1 cpus): 2.055 ± 0.001M/s ( 2.055M/s/cpu)
> uretprobe-nop ( 2 cpus): 2.677 ± 0.000M/s ( 1.339M/s/cpu)
> uretprobe-nop ( 4 cpus): 4.561 ± 0.003M/s ( 1.140M/s/cpu)
> uretprobe-nop ( 8 cpus): 5.291 ± 0.002M/s ( 0.661M/s/cpu)
> uretprobe-nop (16 cpus): 5.065 ± 0.019M/s ( 0.317M/s/cpu)
> uretprobe-nop (32 cpus): 3.622 ± 0.003M/s ( 0.113M/s/cpu)
> uretprobe-nop (64 cpus): 3.723 ± 0.002M/s ( 0.058M/s/cpu)
>
> Peak througput increased from 3.7 mln/s (uprobe triggerings) up to about
> 8 mln/s. For uretprobes it's a bit more modest with bump from 2.4 mln/s
> to 5mln/s.
>
> Suggested-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> Signed-off-by: Andrii Nakryiko <andrii@...nel.org>
> ---
> kernel/events/uprobes.c | 30 ++++++++++++++++++++++++------
> 1 file changed, 24 insertions(+), 6 deletions(-)
>
Ok, so it seems like rb_find_rcu() and rb_find_add_rcu() are not
enough or are buggy. I managed to more or less reliably start
reproducing a crash, which was bisected to exactly this change. My
wild guess is that we'd need an rb_erase_rcu() variant or something,
because what I'm seeing is a corrupted rb_root node while performing
lockless rb_find_rcu() operation.
You can find below debugging info (which Gmail will butcher here) in
[0], for convenience.
So, I got the below crash when running `sudo ./uprobe-stress -a20 -f3
-m5 -t4` on a 16-core QEMU VM.
[ 179.375551] BUG: unable to handle page fault for address: 0000441f0f660097
[ 179.376612] #PF: supervisor read access in kernel mode
[ 179.377334] #PF: error_code(0x0000) - not-present page
[ 179.378037] PGD 0 P4D 0
[ 179.378391] Oops: Oops: 0000 [#1] PREEMPT SMP
[ 179.378992] CPU: 5 UID: 0 PID: 2292 Comm: uprobe-stress Tainted: G
E 6.11.0-rc1-00025-g6f8e0d8d5b55-dirty #181
[ 179.380514] Tainted: [E]=UNSIGNED_MODULE
[ 179.381022] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 179.382475] RIP: 0010:uprobe_notify_resume+0x3db/0xc40
[ 179.383148] Code: c1 e1 0c 48 2b 08 48 8b 44 24 08 48 01 c1 8b 35
bb 36 0d 03 40 f6 c6 01 0f 85 be 00 00 00 48 8b 2d ba 36 0d 03 48 85
ed 74 29 <48> 8b 85 90 00 00 00 48 39 c2 72 2d 48 39 d0 72 0f 48 3b 8d
98 00
[ 179.385639] RSP: 0000:ffffc90000a93e78 EFLAGS: 00010202
[ 179.386338] RAX: 74c2394808478d48 RBX: ffff888105619800 RCX: 0000000000004118
[ 179.387480] RDX: ffff888109c18eb0 RSI: 00000000000a6130 RDI: ffff888105619800
[ 179.388677] RBP: 0000441f0f660007 R08: ffff8881098f1300 R09: 0000000000000000
[ 179.389729] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888105680fc0
[ 179.390694] R13: 0000000000000000 R14: ffff888105681068 R15: ffff888105681000
[ 179.391717] FS: 00007f10690006c0(0000) GS:ffff88881fb40000(0000)
knlGS:0000000000000000
[ 179.392800] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 179.393582] CR2: 0000441f0f660097 CR3: 00000001049bb004 CR4: 0000000000370ef0
[ 179.394536] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 179.395485] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 179.396464] Call Trace:
[ 179.396796] <TASK>
[ 179.397133] ? __die+0x1f/0x60
[ 179.397640] ? page_fault_oops+0x14c/0x440
[ 179.398316] ? do_user_addr_fault+0x5f/0x6c0
[ 179.398899] ? kvm_read_and_reset_apf_flags+0x3c/0x50
[ 179.399730] ? exc_page_fault+0x66/0x130
[ 179.400361] ? asm_exc_page_fault+0x22/0x30
[ 179.400912] ? uprobe_notify_resume+0x3db/0xc40
[ 179.401500] ? uprobe_notify_resume+0x11a/0xc40
[ 179.402089] ? arch_uprobe_exception_notify+0x39/0x40
[ 179.402736] ? notifier_call_chain+0x55/0xc0
[ 179.403293] irqentry_exit_to_user_mode+0x98/0x140
[ 179.403917] asm_exc_int3+0x35/0x40
[ 179.404394] RIP: 0033:0x404119
[ 179.404831] Code: 8b 45 fc 89 c7 e8 43 09 00 00 83 c0 01 c9 c3 55
48 89 e5 48 83 ec 10 89 7d fc 8b 45 fc 89 c7 e8 29 09 00 00 83 c0 01
c9 c3 cc <48> 89 e5 48 83 ec 10 89 7d fc 8b 45 fc 89 c7 e8 0f 09 00 00
83 c0
[ 179.407350] RSP: 002b:00007f1068fffab8 EFLAGS: 00000206
[ 179.408083] RAX: 0000000000404118 RBX: 00007f1069000cdc RCX: 000000000000000a
[ 179.409020] RDX: 0000000000000012 RSI: 0000000000000064 RDI: 0000000000000012
[ 179.409965] RBP: 00007f1068fffae0 R08: 00007f106a84403c R09: 00007f106a8440a0
[ 179.410864] R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffff80
[ 179.411768] R13: 0000000000000016 R14: 00007ffc3dc8f070 R15: 00007f1068800000
[ 179.412672] </TASK>
[ 179.412965] Modules linked in: aesni_intel(E) crypto_simd(E)
cryptd(E) kvm_intel(E) kvm(E) i2c_piix4(E) i2c_smbus(E) i2c_core(E)
i6300esb(E) crc32c_intel(E) floppy(E) pcspkr(E) button(E) serio_raw(E)
[ 179.415227] CR2: 0000441f0f660097
[ 179.415683] ---[ end trace 0000000000000000 ]---
Note RBP: 0000441f0f660007 and RIP: 0010:uprobe_notify_resume+0x3db/0xc40
Decoding:
[ 179.417075] Code: c1 e1 0c 48 2b 08 48 8b 44 24 08 48 01 c1 8b 35 bb
36 0d 03 40 f6 c6 01 0f 85 be 00 00 00 48 8b 2d ba 36 0d 03 48 85 ed
74 29 <48> 8b 85 90 00 00 00 48 39 c2 72 2d 48 39 d0 72 0f 48 3b 8d 98
00
All code
========
0: c1 e1 0c shl $0xc,%ecx
3: 48 2b 08 sub (%rax),%rcx
6: 48 8b 44 24 08 mov 0x8(%rsp),%rax
b: 48 01 c1 add %rax,%rcx
e: 8b 35 bb 36 0d 03 mov 0x30d36bb(%rip),%esi # 0x30d36cf
14: 40 f6 c6 01 test $0x1,%sil
18: 0f 85 be 00 00 00 jne 0xdc
1e: 48 8b 2d ba 36 0d 03 mov 0x30d36ba(%rip),%rbp # 0x30d36df
25: 48 85 ed test %rbp,%rbp
28: 74 29 je 0x53
2a:* 48 8b 85 90 00 00 00 mov 0x90(%rbp),%rax <-- trapping instruction
31: 48 39 c2 cmp %rax,%rdx
34: 72 2d jb 0x63
36: 48 39 d0 cmp %rdx,%rax
39: 72 0f jb 0x4a
3b: 48 rex.W
3c: 3b .byte 0x3b
3d: 8d .byte 0x8d
3e: 98 cwtl
...
Let's also look at uprobe_notify_resume+0x3db:
(gdb) list *(uprobe_notify_resume+0x3db)
0xffffffff81242f5b is in uprobe_notify_resume
(/data/users/andriin/linux/kernel/events/uprobes.c:662).
657
658 static __always_inline
659 int uprobe_cmp(const struct inode *l_inode, const loff_t l_offset,
660 const struct uprobe *r)
661 {
662 if (l_inode < r->inode)
663 return -1;
664
665 if (l_inode > r->inode)
666 return 1;
So this is most probably when accessing uprobe through r->inode.
Looking at uprobe_notify_resume disassembly:
0xffffffff81242f2c <+940>: mov 0x78(%rax),%rcx
0xffffffff81242f30 <+944>: shl $0xc,%rcx
0xffffffff81242f34 <+948>: sub (%rax),%rcx
0xffffffff81242f37 <+951>: mov 0x8(%rsp),%rax
0xffffffff81242f3c <+956>: add %rax,%rcx
0xffffffff81242f3f <+959>: mov 0x30d36bb(%rip),%esi #
0xffffffff84316600 <uprobes_seqcount>
0xffffffff81242f45 <+965>: test $0x1,%sil
0xffffffff81242f49 <+969>: jne 0xffffffff8124300d <uprobe_notify_resume+1165>
0xffffffff81242f4f <+975>: mov 0x30d36ba(%rip),%rbp #
0xffffffff84316610 <uprobes_tree>
0xffffffff81242f56 <+982>: test %rbp,%rbp
0xffffffff81242f59 <+985>: je 0xffffffff81242f84 <uprobe_notify_resume+1028>
0xffffffff81242f5b <+987>: mov 0x90(%rbp),%rax
0xffffffff81242f62 <+994>: cmp %rax,%rdx
0xffffffff81242f65 <+997>: jb 0xffffffff81242f94 <uprobe_notify_resume+1044>
0xffffffff81242f67 <+999>: cmp %rdx,%rax
0xffffffff81242f6a <+1002>: jb 0xffffffff81242f7b <uprobe_notify_resume+1019>
0xffffffff81242f6c <+1004>: cmp 0x98(%rbp),%rcx
0xffffffff81242f73 <+1011>: jl 0xffffffff81242f94 <uprobe_notify_resume+1044>
0xffffffff81242f75 <+1013>: jle 0xffffffff81242d0e <uprobe_notify_resume+398>
0xffffffff81242f7b <+1019>: mov 0x8(%rbp),%rbp
0xffffffff81242f7f <+1023>: test %rbp,%rbp
0xffffffff81242f25 <+933>: mov 0xa8(%rcx),%rdx
Not how in mov 0x90(%rbp),%rax %rbp is coming from
mov 0x30d36ba(%rip),%rbp # 0xffffffff84316610 <uprobes_tree>
So. This is inlined uprobe_cmp() on uprobes_tree.rb_node (i.e., the
root of RB tree). And we know that rbp was set to 0x0000441f0f660097,
which doesn't look like a valid kernel address to me. 0x90 is
offsetof(struct uprobe, inode), and offsetof(struct uprobe, rb_node)
== 0. So we load the rb_node/uprobe pointer from uprobe_tree (struct
rb_root), get 0x0000441f0f660097, and then r->inode crashes the kernel
because r looks like user space address.
Note that we never modify RB tree root (or any node) directly
anywhere. We only mutate it with rb_find_add_rcu() and rb_erase(),
both under uprobes_treelock.
So, any ideas how we can end up with "corrupted" root on lockless
lookup with rb_find_rcu()? This seems to be the very first lockless
RB-tree lookup use case in the tree, so perhaps it's not that safe
after all (certainly rb_erase() is non-trivial enough to not be
"obviously correct" w.r.t. RCU, no)?
[0] https://gist.github.com/anakryiko/df31ab75f25544af93cd41273056ee88
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index b0488d356399..d03962cc96de 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -40,6 +40,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
>
> static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */
> +static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
>
> DEFINE_STATIC_SRCU(uprobes_srcu);
>
> @@ -629,8 +630,11 @@ static void put_uprobe(struct uprobe *uprobe)
>
> write_lock(&uprobes_treelock);
>
> - if (uprobe_is_active(uprobe))
> + if (uprobe_is_active(uprobe)) {
> + write_seqcount_begin(&uprobes_seqcount);
> rb_erase(&uprobe->rb_node, &uprobes_tree);
> + write_seqcount_end(&uprobes_seqcount);
> + }
>
> write_unlock(&uprobes_treelock);
>
> @@ -696,14 +700,26 @@ static struct uprobe *find_uprobe_rcu(struct inode *inode, loff_t offset)
> .offset = offset,
> };
> struct rb_node *node;
> + unsigned int seq;
>
> lockdep_assert(srcu_read_lock_held(&uprobes_srcu));
>
> - read_lock(&uprobes_treelock);
> - node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
> - read_unlock(&uprobes_treelock);
> + do {
> + seq = read_seqcount_begin(&uprobes_seqcount);
> + node = rb_find_rcu(&key, &uprobes_tree, __uprobe_cmp_key);
> + /*
> + * Lockless RB-tree lookups can result only in false negatives.
> + * If the element is found, it is correct and can be returned
> + * under RCU protection. If we find nothing, we need to
> + * validate that seqcount didn't change. If it did, we have to
> + * try again as we might have missed the element (false
> + * negative). If seqcount is unchanged, search truly failed.
> + */
> + if (node)
> + return __node_2_uprobe(node);
> + } while (read_seqcount_retry(&uprobes_seqcount, seq));
>
> - return node ? __node_2_uprobe(node) : NULL;
> + return NULL;
> }
>
> /*
> @@ -725,7 +741,7 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
> {
> struct rb_node *node;
> again:
> - node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
> + node = rb_find_add_rcu(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
> if (node) {
> struct uprobe *u = __node_2_uprobe(node);
>
> @@ -750,7 +766,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
> struct uprobe *u;
>
> write_lock(&uprobes_treelock);
> + write_seqcount_begin(&uprobes_seqcount);
> u = __insert_uprobe(uprobe);
> + write_seqcount_end(&uprobes_seqcount);
> write_unlock(&uprobes_treelock);
>
> return u;
> --
> 2.43.0
>
Powered by blists - more mailing lists