[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzbxLMB8RJWWZjtg6NkumHHZA=vhWZfHqZBf90O=aJVC+A@mail.gmail.com>
Date: Fri, 28 Feb 2025 15:00:22 -0800
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Jiri Olsa <olsajiri@...il.com>
Cc: Oleg Nesterov <oleg@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Andrii Nakryiko <andrii@...nel.org>, bpf@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-trace-kernel@...r.kernel.org, x86@...nel.org,
Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
John Fastabend <john.fastabend@...il.com>, Hao Luo <haoluo@...gle.com>,
Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu <mhiramat@...nel.org>,
Alan Maguire <alan.maguire@...cle.com>, David Laight <David.Laight@...lab.com>,
Thomas Weißschuh <thomas@...ch.de>
Subject: Re: [PATCH RFCv2 12/18] uprobes/x86: Add support to optimize uprobes
On Fri, Feb 28, 2025 at 2:55 PM Jiri Olsa <olsajiri@...il.com> wrote:
>
> On Fri, Feb 28, 2025 at 10:55:24AM -0800, Andrii Nakryiko wrote:
> > On Mon, Feb 24, 2025 at 6:04 AM Jiri Olsa <jolsa@...nel.org> wrote:
> > >
> > > Putting together all the previously added pieces to support optimized
> > > uprobes on top of 5-byte nop instruction.
> > >
> > > The current uprobe execution goes through following:
> > > - installs breakpoint instruction over original instruction
> > > - exception handler hit and calls related uprobe consumers
> > > - and either simulates original instruction or does out of line single step
> > > execution of it
> > > - returns to user space
> > >
> > > The optimized uprobe path
> > >
> > > - checks the original instruction is 5-byte nop (plus other checks)
> > > - adds (or uses existing) user space trampoline and overwrites original
> > > instruction (5-byte nop) with call to user space trampoline
> > > - the user space trampoline executes uprobe syscall that calls related uprobe
> > > consumers
> > > - trampoline returns back to next instruction
> > >
> > > This approach won't speed up all uprobes as it's limited to using nop5 as
> > > original instruction, but we could use nop5 as USDT probe instruction (which
> > > uses single byte nop ATM) and speed up the USDT probes.
> > >
> > > This patch overloads related arch functions in uprobe_write_opcode and
> > > set_orig_insn so they can install call instruction if needed.
> > >
> > > The arch_uprobe_optimize triggers the uprobe optimization and is called after
> > > first uprobe hit. I originally had it called on uprobe installation but then
> > > it clashed with elf loader, because the user space trampoline was added in a
> > > place where loader might need to put elf segments, so I decided to do it after
> > > first uprobe hit when loading is done.
> > >
> > > We do not unmap and release uprobe trampoline when it's no longer needed,
> > > because there's no easy way to make sure none of the threads is still
> > > inside the trampoline. But we do not waste memory, because there's just
> > > single page for all the uprobe trampoline mappings.
> > >
> > > We do waste frmae on page mapping for every 4GB by keeping the uprobe
> > > trampoline page mapped, but that seems ok.
> > >
> > > Attaching the speed up from benchs/run_bench_uprobes.sh script:
> > >
> > > current:
> > > usermode-count : 818.836 ± 2.842M/s
> > > syscall-count : 8.917 ± 0.003M/s
> > > uprobe-nop : 3.056 ± 0.013M/s
> > > uprobe-push : 2.903 ± 0.002M/s
> > > uprobe-ret : 1.533 ± 0.001M/s
> > > --> uprobe-nop5 : 1.492 ± 0.000M/s
> > > uretprobe-nop : 1.783 ± 0.000M/s
> > > uretprobe-push : 1.672 ± 0.001M/s
> > > uretprobe-ret : 1.067 ± 0.002M/s
> > > --> uretprobe-nop5 : 1.052 ± 0.000M/s
> > >
> > > after the change:
> > >
> > > usermode-count : 818.386 ± 1.886M/s
> > > syscall-count : 8.923 ± 0.003M/s
> > > uprobe-nop : 3.086 ± 0.005M/s
> > > uprobe-push : 2.751 ± 0.001M/s
> > > uprobe-ret : 1.481 ± 0.000M/s
> > > --> uprobe-nop5 : 4.016 ± 0.002M/s
> > > uretprobe-nop : 1.712 ± 0.008M/s
> > > uretprobe-push : 1.616 ± 0.001M/s
> > > uretprobe-ret : 1.052 ± 0.000M/s
> > > --> uretprobe-nop5 : 2.015 ± 0.000M/s
> > >
> > > Signed-off-by: Jiri Olsa <jolsa@...nel.org>
> > > ---
> > > arch/x86/include/asm/uprobes.h | 6 ++
> > > arch/x86/kernel/uprobes.c | 191 ++++++++++++++++++++++++++++++++-
> > > include/linux/uprobes.h | 6 +-
> > > kernel/events/uprobes.c | 16 ++-
> > > 4 files changed, 209 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> > > index 678fb546f0a7..7d4df920bb59 100644
> > > --- a/arch/x86/include/asm/uprobes.h
> > > +++ b/arch/x86/include/asm/uprobes.h
> > > @@ -20,6 +20,10 @@ typedef u8 uprobe_opcode_t;
> > > #define UPROBE_SWBP_INSN 0xcc
> > > #define UPROBE_SWBP_INSN_SIZE 1
> > >
> > > +enum {
> > > + ARCH_UPROBE_FLAG_CAN_OPTIMIZE = 0,
> > > +};
> > > +
> > > struct uprobe_xol_ops;
> > >
> > > struct arch_uprobe {
> > > @@ -45,6 +49,8 @@ struct arch_uprobe {
> > > u8 ilen;
> > > } push;
> > > };
> > > +
> > > + unsigned long flags;
> > > };
> > >
> > > struct arch_uprobe_task {
> > > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > > index e8aebbda83bc..73ddff823904 100644
> > > --- a/arch/x86/kernel/uprobes.c
> > > +++ b/arch/x86/kernel/uprobes.c
> > > @@ -18,6 +18,7 @@
> > > #include <asm/processor.h>
> > > #include <asm/insn.h>
> > > #include <asm/mmu_context.h>
> > > +#include <asm/nops.h>
> > >
> > > /* Post-execution fixups. */
> > >
> > > @@ -768,7 +769,7 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
> > > return NULL;
> > > }
> > >
> > > -static __maybe_unused struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> > > +static struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> > > {
> > > struct uprobes_state *state = ¤t->mm->uprobes_state;
> > > struct uprobe_trampoline *tramp = NULL;
> > > @@ -794,7 +795,7 @@ static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
> > > kfree(tramp);
> > > }
> > >
> > > -static __maybe_unused void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
> > > +static void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
> > > {
> > > if (tramp == NULL)
> > > return;
> > > @@ -807,6 +808,7 @@ struct mm_uprobe {
> > > struct rb_node rb_node;
> > > unsigned long auprobe;
> > > unsigned long vaddr;
> > > + bool optimized;
> > > };
> > >
> >
> > I'm trying to understand if this RB-tree based mm_uprobe is strictly
> > necessary. Is it? Sure we keep optimized flag, but that's more for
> > defensive checks, no? Is there any other reason we need this separate
> > look up data structure?
>
> so the call instruction update is done in 2 locked steps:
> - first we write breakpoint as part of normal uprobe registration
> - then uprobe is hit, we overwrite breakpoint with call instruction
>
> in between we could race with another thread that could either unregister the
> uprobe or try to optimize the uprobe as well
>
> I think we either need to keep the state of the uprobe per process (mm_struct),
> or we would need to read the probed instruction each time when we need to make
> decision based on what state are we at (nop5,breakpoint,call)
This decision is only done in "slow path", right? Only when
registering/unregistering. And those operations are done under lock.
So reading those 5 bytes every time we register/unregister seems
completely acceptable, rather than now *also* having a per-mm uprobe
lookup tree.
>
> jirka
Powered by blists - more mailing lists