[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzYFXmCU83mr9YHy2JtF35WmYBvKpyrmBV4QxFuqubk_6A@mail.gmail.com>
Date: Mon, 19 Aug 2024 13:34:57 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Oleg Nesterov <oleg@...hat.com>
Cc: Andrii Nakryiko <andrii@...nel.org>, linux-trace-kernel@...r.kernel.org,
peterz@...radead.org, rostedt@...dmis.org, mhiramat@...nel.org,
bpf@...r.kernel.org, linux-kernel@...r.kernel.org, jolsa@...nel.org,
paulmck@...nel.org, willy@...radead.org, surenb@...gle.com,
akpm@...ux-foundation.org, linux-mm@...ck.org
Subject: Re: [PATCH RFC v3 09/13] uprobes: SRCU-protect uretprobe lifetime
(with timeout)
On Mon, Aug 19, 2024 at 6:41 AM Oleg Nesterov <oleg@...hat.com> wrote:
>
> On 08/12, Andrii Nakryiko wrote:
> >
> > Avoid taking refcount on uprobe in prepare_uretprobe(), instead take
> > uretprobe-specific SRCU lock and keep it active as kernel transfers
> > control back to user space.
> ...
> > include/linux/uprobes.h | 49 ++++++-
> > kernel/events/uprobes.c | 294 ++++++++++++++++++++++++++++++++++------
> > 2 files changed, 301 insertions(+), 42 deletions(-)
>
> Oh. To be honest I don't like this patch.
>
> I would like to know what other reviewers think, but to me it adds too many
> complications that I can't even fully understand...
Which parts? The atomic xchg() and cmpxchg() parts? What exactly do
you feel like you don't fully understand?
>
> And how much does it help performance-wise?
A lot, as we increase uprobe parallelism. Here's a subset of
benchmarks for 1-4, 8, 16, 32, 64, and 80 threads firing uretprobe.
With and without this SRCU change, but including all the other
changes, including the lockless VMA lookup. It's noticeable already
with just two competing CPUs/threads, and it just gets much worse from
there.
Of course in production you shouldn't come close to such rates of
uprobe/uretprobe firing, so this is definitely a microbenchmark
emphasizing the sharing between CPUs, but it still adds up. And we do
have production use cases that would like to fire uprobes at 100K+ per
second rates.
WITH SRCU for uretprobes
========================
uretprobe-nop ( 1 cpus): 1.968 ± 0.001M/s ( 1.968M/s/cpu)
uretprobe-nop ( 2 cpus): 3.739 ± 0.003M/s ( 1.869M/s/cpu)
uretprobe-nop ( 3 cpus): 5.616 ± 0.003M/s ( 1.872M/s/cpu)
uretprobe-nop ( 4 cpus): 7.286 ± 0.002M/s ( 1.822M/s/cpu)
uretprobe-nop ( 8 cpus): 13.657 ± 0.007M/s ( 1.707M/s/cpu)
uretprobe-nop (32 cpus): 45.305 ± 0.066M/s ( 1.416M/s/cpu)
uretprobe-nop (64 cpus): 42.390 ± 0.922M/s ( 0.662M/s/cpu)
uretprobe-nop (80 cpus): 47.554 ± 2.411M/s ( 0.594M/s/cpu)
WITHOUT SRCU for uretprobes
===========================
uretprobe-nop ( 1 cpus): 2.197 ± 0.002M/s ( 2.197M/s/cpu)
uretprobe-nop ( 2 cpus): 3.325 ± 0.001M/s ( 1.662M/s/cpu)
uretprobe-nop ( 3 cpus): 4.129 ± 0.002M/s ( 1.376M/s/cpu)
uretprobe-nop ( 4 cpus): 6.180 ± 0.003M/s ( 1.545M/s/cpu)
uretprobe-nop ( 8 cpus): 7.323 ± 0.005M/s ( 0.915M/s/cpu)
uretprobe-nop (16 cpus): 6.943 ± 0.005M/s ( 0.434M/s/cpu)
uretprobe-nop (32 cpus): 5.931 ± 0.014M/s ( 0.185M/s/cpu)
uretprobe-nop (64 cpus): 5.145 ± 0.003M/s ( 0.080M/s/cpu)
uretprobe-nop (80 cpus): 4.925 ± 0.005M/s ( 0.062M/s/cpu)
>
> I'll try to take another look, and I'll try to think about other approaches,
> not that I have something better in mind...
Ok.
>
>
> But lets forgets this patch for the moment. The next one adds even more
> complications, and I think it doesn't make sense.
>
"Even more complications" is a bit of an overstatement. It just
applies everything we do for uretprobes in this patch to a very
straightforward single-stepped case.
> As I have already mentioned in the previous discussions, we can simply kill
> utask->active_uprobe. And utask->auprobe.
I don't have anything against that, in principle, but let's benchmark
and test that thoroughly. I'm a bit uneasy about the possibility that
some arch-specific code will do container_of() on this arch_uprobe in
order to get to uprobe, we'd need to audit all the code to make sure
that can't happen. Also it's a bit unfortunate that we have to assume
that struct arch_uprobe is small on all architectures, and there is no
code that assumes it can't be moved, etc, etc. (I also don't get why
you need memcpy
>
> So can't we start with the patch below? On top of your 08/13. It doesn't kill
> utask->auprobe yet, this needs a bit more trivial changes.
>
> What do you think?
I think that single-stepped case isn't the main use case (typically
uprobe/uretprobe will be installed on nop or push %rbp, both
emulated). uretprobes, though, are the main use case (along with
optimized entry uprobes). So what we do about single-stepped is a bit
secondary (for me, looking at production use cases).
But we do need to do something with uretprobes first and foremost.
>
> Oleg.
>
> -------------------------------------------------------------------------------
> From d7cb674eb6f7bb891408b2b6a5fb872a6c2f0f6c Mon Sep 17 00:00:00 2001
> From: Oleg Nesterov <oleg@...hat.com>
> Date: Mon, 19 Aug 2024 15:34:55 +0200
> Subject: [RFC PATCH] uprobe: kill uprobe_task->active_uprobe
>
> Untested, not for inclusion yet, and I need to split it into 2 changes.
> It does 2 simple things:
>
> 1. active_uprobe != NULL is possible if and only if utask->state != 0,
> so it turns the active_uprobe checks into the utask->state checks.
>
> 2. handle_singlestep() doesn't really need ->active_uprobe, it only
> needs uprobe->arch which is "const" after prepare_uprobe().
>
> So this patch adds the new "arch_uprobe uarch" member into utask
> and changes pre_ssout() to do memcpy(&utask->uarch, &uprobe->arch).
> ---
> include/linux/uprobes.h | 2 +-
> kernel/events/uprobes.c | 37 +++++++++++--------------------------
> 2 files changed, 12 insertions(+), 27 deletions(-)
[...]
Powered by blists - more mailing lists