linux-kernel - Re: [PATCH RFCv2 12/18] uprobes/x86: Add support to optimize uprobes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z8JEPdAHkkEL4x7k@krava>
Date: Sat, 1 Mar 2025 00:18:21 +0100
From: Jiri Olsa <olsajiri@...il.com>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>
Cc: Jiri Olsa <olsajiri@...il.com>, Oleg Nesterov <oleg@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Andrii Nakryiko <andrii@...nel.org>, bpf@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
	x86@...nel.org, Song Liu <songliubraving@...com>,
	Yonghong Song <yhs@...com>,
	John Fastabend <john.fastabend@...il.com>,
	Hao Luo <haoluo@...gle.com>, Steven Rostedt <rostedt@...dmis.org>,
	Masami Hiramatsu <mhiramat@...nel.org>,
	Alan Maguire <alan.maguire@...cle.com>,
	David Laight <David.Laight@...lab.com>,
	Thomas Weißschuh <thomas@...ch.de>
Subject: Re: [PATCH RFCv2 12/18] uprobes/x86: Add support to optimize uprobes

On Fri, Feb 28, 2025 at 03:00:22PM -0800, Andrii Nakryiko wrote:
> On Fri, Feb 28, 2025 at 2:55 PM Jiri Olsa <olsajiri@...il.com> wrote:
> >
> > On Fri, Feb 28, 2025 at 10:55:24AM -0800, Andrii Nakryiko wrote:
> > > On Mon, Feb 24, 2025 at 6:04 AM Jiri Olsa <jolsa@...nel.org> wrote:
> > > >
> > > > Putting together all the previously added pieces to support optimized
> > > > uprobes on top of 5-byte nop instruction.
> > > >
> > > > The current uprobe execution goes through following:
> > > >   - installs breakpoint instruction over original instruction
> > > >   - exception handler hit and calls related uprobe consumers
> > > >   - and either simulates original instruction or does out of line single step
> > > >     execution of it
> > > >   - returns to user space
> > > >
> > > > The optimized uprobe path
> > > >
> > > >   - checks the original instruction is 5-byte nop (plus other checks)
> > > >   - adds (or uses existing) user space trampoline and overwrites original
> > > >     instruction (5-byte nop) with call to user space trampoline
> > > >   - the user space trampoline executes uprobe syscall that calls related uprobe
> > > >     consumers
> > > >   - trampoline returns back to next instruction
> > > >
> > > > This approach won't speed up all uprobes as it's limited to using nop5 as
> > > > original instruction, but we could use nop5 as USDT probe instruction (which
> > > > uses single byte nop ATM) and speed up the USDT probes.
> > > >
> > > > This patch overloads related arch functions in uprobe_write_opcode and
> > > > set_orig_insn so they can install call instruction if needed.
> > > >
> > > > The arch_uprobe_optimize triggers the uprobe optimization and is called after
> > > > first uprobe hit. I originally had it called on uprobe installation but then
> > > > it clashed with elf loader, because the user space trampoline was added in a
> > > > place where loader might need to put elf segments, so I decided to do it after
> > > > first uprobe hit when loading is done.
> > > >
> > > > We do not unmap and release uprobe trampoline when it's no longer needed,
> > > > because there's no easy way to make sure none of the threads is still
> > > > inside the trampoline. But we do not waste memory, because there's just
> > > > single page for all the uprobe trampoline mappings.
> > > >
> > > > We do waste frmae on page mapping for every 4GB by keeping the uprobe
> > > > trampoline page mapped, but that seems ok.
> > > >
> > > > Attaching the speed up from benchs/run_bench_uprobes.sh script:
> > > >
> > > > current:
> > > >         usermode-count :  818.836 ± 2.842M/s
> > > >         syscall-count  :    8.917 ± 0.003M/s
> > > >         uprobe-nop     :    3.056 ± 0.013M/s
> > > >         uprobe-push    :    2.903 ± 0.002M/s
> > > >         uprobe-ret     :    1.533 ± 0.001M/s
> > > > -->     uprobe-nop5    :    1.492 ± 0.000M/s
> > > >         uretprobe-nop  :    1.783 ± 0.000M/s
> > > >         uretprobe-push :    1.672 ± 0.001M/s
> > > >         uretprobe-ret  :    1.067 ± 0.002M/s
> > > > -->     uretprobe-nop5 :    1.052 ± 0.000M/s
> > > >
> > > > after the change:
> > > >
> > > >         usermode-count :  818.386 ± 1.886M/s
> > > >         syscall-count  :    8.923 ± 0.003M/s
> > > >         uprobe-nop     :    3.086 ± 0.005M/s
> > > >         uprobe-push    :    2.751 ± 0.001M/s
> > > >         uprobe-ret     :    1.481 ± 0.000M/s
> > > > -->     uprobe-nop5    :    4.016 ± 0.002M/s
> > > >         uretprobe-nop  :    1.712 ± 0.008M/s
> > > >         uretprobe-push :    1.616 ± 0.001M/s
> > > >         uretprobe-ret  :    1.052 ± 0.000M/s
> > > > -->     uretprobe-nop5 :    2.015 ± 0.000M/s
> > > >
> > > > Signed-off-by: Jiri Olsa <jolsa@...nel.org>
> > > > ---
> > > >  arch/x86/include/asm/uprobes.h |   6 ++
> > > >  arch/x86/kernel/uprobes.c      | 191 ++++++++++++++++++++++++++++++++-
> > > >  include/linux/uprobes.h        |   6 +-
> > > >  kernel/events/uprobes.c        |  16 ++-
> > > >  4 files changed, 209 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> > > > index 678fb546f0a7..7d4df920bb59 100644
> > > > --- a/arch/x86/include/asm/uprobes.h
> > > > +++ b/arch/x86/include/asm/uprobes.h
> > > > @@ -20,6 +20,10 @@ typedef u8 uprobe_opcode_t;
> > > >  #define UPROBE_SWBP_INSN               0xcc
> > > >  #define UPROBE_SWBP_INSN_SIZE             1
> > > >
> > > > +enum {
> > > > +       ARCH_UPROBE_FLAG_CAN_OPTIMIZE   = 0,
> > > > +};
> > > > +
> > > >  struct uprobe_xol_ops;
> > > >
> > > >  struct arch_uprobe {
> > > > @@ -45,6 +49,8 @@ struct arch_uprobe {
> > > >                         u8      ilen;
> > > >                 }                       push;
> > > >         };
> > > > +
> > > > +       unsigned long flags;
> > > >  };
> > > >
> > > >  struct arch_uprobe_task {
> > > > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > > > index e8aebbda83bc..73ddff823904 100644
> > > > --- a/arch/x86/kernel/uprobes.c
> > > > +++ b/arch/x86/kernel/uprobes.c
> > > > @@ -18,6 +18,7 @@
> > > >  #include <asm/processor.h>
> > > >  #include <asm/insn.h>
> > > >  #include <asm/mmu_context.h>
> > > > +#include <asm/nops.h>
> > > >
> > > >  /* Post-execution fixups. */
> > > >
> > > > @@ -768,7 +769,7 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
> > > >         return NULL;
> > > >  }
> > > >
> > > > -static __maybe_unused struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> > > > +static struct uprobe_trampoline *uprobe_trampoline_get(unsigned long vaddr)
> > > >  {
> > > >         struct uprobes_state *state = &current->mm->uprobes_state;
> > > >         struct uprobe_trampoline *tramp = NULL;
> > > > @@ -794,7 +795,7 @@ static void destroy_uprobe_trampoline(struct uprobe_trampoline *tramp)
> > > >         kfree(tramp);
> > > >  }
> > > >
> > > > -static __maybe_unused void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
> > > > +static void uprobe_trampoline_put(struct uprobe_trampoline *tramp)
> > > >  {
> > > >         if (tramp == NULL)
> > > >                 return;
> > > > @@ -807,6 +808,7 @@ struct mm_uprobe {
> > > >         struct rb_node rb_node;
> > > >         unsigned long auprobe;
> > > >         unsigned long vaddr;
> > > > +       bool optimized;
> > > >  };
> > > >
> > >
> > > I'm trying to understand if this RB-tree based mm_uprobe is strictly
> > > necessary. Is it? Sure we keep optimized flag, but that's more for
> > > defensive checks, no? Is there any other reason we need this separate
> > > look up data structure?
> >
> > so the call instruction update is done in 2 locked steps:
> >  - first we write breakpoint as part of normal uprobe registration
> >  - then uprobe is hit, we overwrite breakpoint with call instruction
> >
> > in between we could race with another thread that could either unregister the
> > uprobe or try to optimize the uprobe as well
> >
> > I think we either need to keep the state of the uprobe per process (mm_struct),
> > or we would need to read the probed instruction each time when we need to make
> > decision based on what state are we at (nop5,breakpoint,call)
> 
> This decision is only done in "slow path", right? Only when
> registering/unregistering. And those operations are done under lock.
> So reading those 5 bytes every time we register/unregister seems
> completely acceptable, rather than now *also* having a per-mm uprobe
> lookup tree.

true.. I was also thinking about having another flag in that tree for
when we fail to optimize the uprobe for other reason than it being on
page alignment.. without such flag we'd need to read the 5 bytes each
time we hit that uprobe .. but that might be rare case

jirka