[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250714191304.a93c5398165bafc93827e716@kernel.org>
Date: Mon, 14 Jul 2025 19:13:04 +0900
From: Masami Hiramatsu (Google) <mhiramat@...nel.org>
To: Jiri Olsa <jolsa@...nel.org>
Cc: Oleg Nesterov <oleg@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Andrii Nakryiko <andrii@...nel.org>, bpf@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
x86@...nel.org, Song Liu <songliubraving@...com>, Yonghong Song
<yhs@...com>, John Fastabend <john.fastabend@...il.com>, Hao Luo
<haoluo@...gle.com>, Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu
<mhiramat@...nel.org>, Alan Maguire <alan.maguire@...cle.com>, David Laight
<David.Laight@...LAB.COM>, Thomas Weißschuh
<thomas@...ch.de>, Ingo Molnar <mingo@...nel.org>
Subject: Re: [PATCHv5 perf/core 10/22] uprobes/x86: Add support to optimize
uprobes
On Fri, 11 Jul 2025 10:29:18 +0200
Jiri Olsa <jolsa@...nel.org> wrote:
> Putting together all the previously added pieces to support optimized
> uprobes on top of 5-byte nop instruction.
>
> The current uprobe execution goes through following:
>
> - installs breakpoint instruction over original instruction
> - exception handler hit and calls related uprobe consumers
> - and either simulates original instruction or does out of line single step
> execution of it
> - returns to user space
>
> The optimized uprobe path does following:
>
> - checks the original instruction is 5-byte nop (plus other checks)
> - adds (or uses existing) user space trampoline with uprobe syscall
> - overwrites original instruction (5-byte nop) with call to user space
> trampoline
> - the user space trampoline executes uprobe syscall that calls related uprobe
> consumers
> - trampoline returns back to next instruction
>
> This approach won't speed up all uprobes as it's limited to using nop5 as
> original instruction, but we plan to use nop5 as USDT probe instruction
> (which currently uses single byte nop) and speed up the USDT probes.
>
> The arch_uprobe_optimize triggers the uprobe optimization and is called after
> first uprobe hit. I originally had it called on uprobe installation but then
> it clashed with elf loader, because the user space trampoline was added in a
> place where loader might need to put elf segments, so I decided to do it after
> first uprobe hit when loading is done.
>
> The uprobe is un-optimized in arch specific set_orig_insn call.
>
> The instruction overwrite is x86 arch specific and needs to go through 3 updates:
> (on top of nop5 instruction)
>
> - write int3 into 1st byte
> - write last 4 bytes of the call instruction
> - update the call instruction opcode
>
> And cleanup goes though similar reverse stages:
>
> - overwrite call opcode with breakpoint (int3)
> - write last 4 bytes of the nop5 instruction
> - write the nop5 first instruction byte
>
> We do not unmap and release uprobe trampoline when it's no longer needed,
> because there's no easy way to make sure none of the threads is still
> inside the trampoline. But we do not waste memory, because there's just
> single page for all the uprobe trampoline mappings.
>
> We do waste frame on page mapping for every 4GB by keeping the uprobe
> trampoline page mapped, but that seems ok.
>
> We take the benefit from the fact that set_swbp and set_orig_insn are
> called under mmap_write_lock(mm), so we can use the current instruction
> as the state the uprobe is in - nop5/breakpoint/call trampoline -
> and decide the needed action (optimize/un-optimize) based on that.
>
> Attaching the speed up from benchs/run_bench_uprobes.sh script:
>
> current:
> usermode-count : 152.604 ± 0.044M/s
> syscall-count : 13.359 ± 0.042M/s
> --> uprobe-nop : 3.229 ± 0.002M/s
> uprobe-push : 3.086 ± 0.004M/s
> uprobe-ret : 1.114 ± 0.004M/s
> uprobe-nop5 : 1.121 ± 0.005M/s
> uretprobe-nop : 2.145 ± 0.002M/s
> uretprobe-push : 2.070 ± 0.001M/s
> uretprobe-ret : 0.931 ± 0.001M/s
> uretprobe-nop5 : 0.957 ± 0.001M/s
>
> after the change:
> usermode-count : 152.448 ± 0.244M/s
> syscall-count : 14.321 ± 0.059M/s
> uprobe-nop : 3.148 ± 0.007M/s
> uprobe-push : 2.976 ± 0.004M/s
> uprobe-ret : 1.068 ± 0.003M/s
> --> uprobe-nop5 : 7.038 ± 0.007M/s
> uretprobe-nop : 2.109 ± 0.004M/s
> uretprobe-push : 2.035 ± 0.001M/s
> uretprobe-ret : 0.908 ± 0.001M/s
> uretprobe-nop5 : 3.377 ± 0.009M/s
>
> I see bit more speed up on Intel (above) compared to AMD. The big nop5
> speed up is partly due to emulating nop5 and partly due to optimization.
>
> The key speed up we do this for is the USDT switch from nop to nop5:
> uprobe-nop : 3.148 ± 0.007M/s
> uprobe-nop5 : 7.038 ± 0.007M/s
>
> Acked-by: Andrii Nakryiko <andrii@...nel.org>
> Acked-by: Oleg Nesterov <oleg@...hat.com>
> Signed-off-by: Jiri Olsa <jolsa@...nel.org>
> ---
> arch/x86/include/asm/uprobes.h | 7 +
> arch/x86/kernel/uprobes.c | 288 ++++++++++++++++++++++++++++++++-
> include/linux/uprobes.h | 6 +-
> kernel/events/uprobes.c | 16 +-
> 4 files changed, 310 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> index 678fb546f0a7..1ee2e5115955 100644
> --- a/arch/x86/include/asm/uprobes.h
> +++ b/arch/x86/include/asm/uprobes.h
> @@ -20,6 +20,11 @@ typedef u8 uprobe_opcode_t;
> #define UPROBE_SWBP_INSN 0xcc
> #define UPROBE_SWBP_INSN_SIZE 1
>
> +enum {
> + ARCH_UPROBE_FLAG_CAN_OPTIMIZE = 0,
> + ARCH_UPROBE_FLAG_OPTIMIZE_FAIL = 1,
> +};
> +
> struct uprobe_xol_ops;
>
> struct arch_uprobe {
> @@ -45,6 +50,8 @@ struct arch_uprobe {
> u8 ilen;
> } push;
> };
> +
> + unsigned long flags;
> };
>
> struct arch_uprobe_task {
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index 5eecab712376..b80942768f77 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -18,6 +18,7 @@
> #include <asm/processor.h>
> #include <asm/insn.h>
> #include <asm/mmu_context.h>
> +#include <asm/nops.h>
>
> /* Post-execution fixups. */
>
> @@ -702,7 +703,6 @@ static struct uprobe_trampoline *create_uprobe_trampoline(unsigned long vaddr)
> return tramp;
> }
>
> -__maybe_unused
> static struct uprobe_trampoline *get_uprobe_trampoline(unsigned long vaddr, bool *new)
> {
> struct uprobes_state *state = ¤t->mm->uprobes_state;
> @@ -874,6 +874,285 @@ static int __init arch_uprobes_init(void)
>
> late_initcall(arch_uprobes_init);
>
> +enum {
> + OPT_PART,
> + OPT_INSN,
> + UNOPT_INT3,
> + UNOPT_PART,
> +};
> +
> +struct write_opcode_ctx {
> + unsigned long base;
> + int update;
> +};
> +
> +static int is_call_insn(uprobe_opcode_t *insn)
> +{
> + return *insn == CALL_INSN_OPCODE;
> +}
> +
nit: Maybe we need a comment how to verify it as below, or just say
"See swbp_optimize/unoptimize() for how it works"
/*
* verify the old opcode starts from swbp or call before update to new opcode.
* When optimizing from swbp -> call, write 4 byte oprand (OPT_PART), and write
* the first opcode (OPT_INSN). Also, in unoptimizing, write the first opcode
* (UNOPT_INT3) and write the rest bytes (OPT_PART).
* Thus, the *old* `opcode` byte (not @vaddr[0], but ctx->base[0]) must be
* INT3 (OPT_PART, OPT_INSN, and UNOPT_PART) or CALL(UNOPT_INT3).
*/
> +static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode,
> + int nbytes, void *data)
> +{
> + struct write_opcode_ctx *ctx = data;
> + uprobe_opcode_t old_opcode[5];
> +
> + uprobe_copy_from_page(page, ctx->base, (uprobe_opcode_t *) &old_opcode, 5);
> +
> + switch (ctx->update) {
> + case OPT_PART:
> + case OPT_INSN:
> + if (is_swbp_insn(&old_opcode[0]))
> + return 1;
> + break;
> + case UNOPT_INT3:
> + if (is_call_insn(&old_opcode[0]))
> + return 1;
> + break;
> + case UNOPT_PART:
> + if (is_swbp_insn(&old_opcode[0]))
> + return 1;
> + break;
nit: Can we fold this case to the OPT_PART & OPT_INSN case?
It seems the same.
Thanks,
> + }
> +
> + return -1;
> +}
> +
> +static int write_insn(struct arch_uprobe *auprobe, struct vm_area_struct *vma, unsigned long vaddr,
> + uprobe_opcode_t *insn, int nbytes, void *ctx)
> +{
> + return uprobe_write(auprobe, vma, vaddr, insn, nbytes, verify_insn,
> + true /* is_register */, false /* do_update_ref_ctr */, ctx);
> +}
> +
> +static void relative_call(void *dest, long from, long to)
> +{
> + struct __packed __arch_relative_insn {
> + u8 op;
> + s32 raddr;
> + } *insn;
> +
> + insn = (struct __arch_relative_insn *)dest;
> + insn->raddr = (s32)(to - (from + 5));
> + insn->op = CALL_INSN_OPCODE;
> +}
> +
> +static int swbp_optimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> + unsigned long vaddr, unsigned long tramp)
> +{
> + struct write_opcode_ctx ctx = {
> + .base = vaddr,
> + };
> + char call[5];
> + int err;
> +
> + relative_call(call, vaddr, tramp);
> +
> + /*
> + * We are in state where breakpoint (int3) is installed on top of first
> + * byte of the nop5 instruction. We will do following steps to overwrite
> + * this to call instruction:
> + *
> + * - sync cores
> + * - write last 4 bytes of the call instruction
> + * - sync cores
> + * - update the call instruction opcode
> + */
> +
> + smp_text_poke_sync_each_cpu();
> +
> + ctx.update = OPT_PART;
> + err = write_insn(auprobe, vma, vaddr + 1, call + 1, 4, &ctx);
> + if (err)
> + return err;
> +
> + smp_text_poke_sync_each_cpu();
> +
> + ctx.update = OPT_INSN;
> + return write_insn(auprobe, vma, vaddr, call, 1, &ctx);
> +}
> +
> +static int swbp_unoptimize(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> + unsigned long vaddr)
> +{
> + uprobe_opcode_t int3 = UPROBE_SWBP_INSN;
> + struct write_opcode_ctx ctx = {
> + .base = vaddr,
> + };
> + int err;
> +
> + /*
> + * We need to overwrite call instruction into nop5 instruction with
> + * breakpoint (int3) installed on top of its first byte. We will:
> + *
> + * - overwrite call opcode with breakpoint (int3)
> + * - sync cores
> + * - write last 4 bytes of the nop5 instruction
> + * - sync cores
> + */
> +
> + ctx.update = UNOPT_INT3;
> + err = write_insn(auprobe, vma, vaddr, &int3, 1, &ctx);
> + if (err)
> + return err;
> +
> + smp_text_poke_sync_each_cpu();
> +
> + ctx.update = UNOPT_PART;
> + err = write_insn(auprobe, vma, vaddr + 1, (uprobe_opcode_t *) auprobe->insn + 1, 4, &ctx);
> +
> + smp_text_poke_sync_each_cpu();
> + return err;
> +}
> +
> +static int copy_from_vaddr(struct mm_struct *mm, unsigned long vaddr, void *dst, int len)
> +{
> + unsigned int gup_flags = FOLL_FORCE|FOLL_SPLIT_PMD;
> + struct vm_area_struct *vma;
> + struct page *page;
> +
> + page = get_user_page_vma_remote(mm, vaddr, gup_flags, &vma);
> + if (IS_ERR(page))
> + return PTR_ERR(page);
> + uprobe_copy_from_page(page, vaddr, dst, len);
> + put_page(page);
> + return 0;
> +}
> +
> +static bool __is_optimized(uprobe_opcode_t *insn, unsigned long vaddr)
> +{
> + struct __packed __arch_relative_insn {
> + u8 op;
> + s32 raddr;
> + } *call = (struct __arch_relative_insn *) insn;
> +
> + if (!is_call_insn(insn))
> + return false;
> + return __in_uprobe_trampoline(vaddr + 5 + call->raddr);
> +}
> +
> +static int is_optimized(struct mm_struct *mm, unsigned long vaddr, bool *optimized)
> +{
> + uprobe_opcode_t insn[5];
> + int err;
> +
> + err = copy_from_vaddr(mm, vaddr, &insn, 5);
> + if (err)
> + return err;
> + *optimized = __is_optimized((uprobe_opcode_t *)&insn, vaddr);
> + return 0;
> +}
> +
> +static bool should_optimize(struct arch_uprobe *auprobe)
> +{
> + return !test_bit(ARCH_UPROBE_FLAG_OPTIMIZE_FAIL, &auprobe->flags) &&
> + test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags);
> +}
> +
> +int set_swbp(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> + unsigned long vaddr)
> +{
> + if (should_optimize(auprobe)) {
> + bool optimized = false;
> + int err;
> +
> + /*
> + * We could race with another thread that already optimized the probe,
> + * so let's not overwrite it with int3 again in this case.
> + */
> + err = is_optimized(vma->vm_mm, vaddr, &optimized);
> + if (err)
> + return err;
> + if (optimized)
> + return 0;
> + }
> + return uprobe_write_opcode(auprobe, vma, vaddr, UPROBE_SWBP_INSN,
> + true /* is_register */);
> +}
> +
> +int set_orig_insn(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> + unsigned long vaddr)
> +{
> + if (test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags)) {
> + struct mm_struct *mm = vma->vm_mm;
> + bool optimized = false;
> + int err;
> +
> + err = is_optimized(mm, vaddr, &optimized);
> + if (err)
> + return err;
> + if (optimized)
> + WARN_ON_ONCE(swbp_unoptimize(auprobe, vma, vaddr));
> + }
> + return uprobe_write_opcode(auprobe, vma, vaddr, *(uprobe_opcode_t *)&auprobe->insn,
> + false /* is_register */);
> +}
> +
> +static int __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct *mm,
> + unsigned long vaddr)
> +{
> + struct uprobe_trampoline *tramp;
> + struct vm_area_struct *vma;
> + bool new = false;
> + int err = 0;
> +
> + vma = find_vma(mm, vaddr);
> + if (!vma)
> + return -EINVAL;
> + tramp = get_uprobe_trampoline(vaddr, &new);
> + if (!tramp)
> + return -EINVAL;
> + err = swbp_optimize(auprobe, vma, vaddr, tramp->vaddr);
> + if (WARN_ON_ONCE(err) && new)
> + destroy_uprobe_trampoline(tramp);
> + return err;
> +}
> +
> +void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
> +{
> + struct mm_struct *mm = current->mm;
> + uprobe_opcode_t insn[5];
> +
> + /*
> + * Do not optimize if shadow stack is enabled, the return address hijack
> + * code in arch_uretprobe_hijack_return_addr updates wrong frame when
> + * the entry uprobe is optimized and the shadow stack crashes the app.
> + */
> + if (shstk_is_enabled())
> + return;
> +
> + if (!should_optimize(auprobe))
> + return;
> +
> + mmap_write_lock(mm);
> +
> + /*
> + * Check if some other thread already optimized the uprobe for us,
> + * if it's the case just go away silently.
> + */
> + if (copy_from_vaddr(mm, vaddr, &insn, 5))
> + goto unlock;
> + if (!is_swbp_insn((uprobe_opcode_t*) &insn))
> + goto unlock;
> +
> + /*
> + * If we fail to optimize the uprobe we set the fail bit so the
> + * above should_optimize will fail from now on.
> + */
> + if (__arch_uprobe_optimize(auprobe, mm, vaddr))
> + set_bit(ARCH_UPROBE_FLAG_OPTIMIZE_FAIL, &auprobe->flags);
> +
> +unlock:
> + mmap_write_unlock(mm);
> +}
> +
> +static bool can_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
> +{
> + if (memcmp(&auprobe->insn, x86_nops[5], 5))
> + return false;
> + /* We can't do cross page atomic writes yet. */
> + return PAGE_SIZE - (vaddr & ~PAGE_MASK) >= 5;
> +}
> #else /* 32-bit: */
> /*
> * No RIP-relative addressing on 32-bit
> @@ -887,6 +1166,10 @@ static void riprel_pre_xol(struct arch_uprobe *auprobe, struct pt_regs *regs)
> static void riprel_post_xol(struct arch_uprobe *auprobe, struct pt_regs *regs)
> {
> }
> +static bool can_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
> +{
> + return false;
> +}
> #endif /* CONFIG_X86_64 */
>
> struct uprobe_xol_ops {
> @@ -1253,6 +1536,9 @@ int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm,
> if (ret)
> return ret;
>
> + if (can_optimize(auprobe, addr))
> + set_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags);
> +
> ret = branch_setup_xol_ops(auprobe, &insn);
> if (ret != -ENOSYS)
> return ret;
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index b6b077cc7d0f..08ef78439d0d 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -192,7 +192,7 @@ struct uprobes_state {
> };
>
> typedef int (*uprobe_write_verify_t)(struct page *page, unsigned long vaddr,
> - uprobe_opcode_t *insn, int nbytes);
> + uprobe_opcode_t *insn, int nbytes, void *data);
>
> extern void __init uprobes_init(void);
> extern int set_swbp(struct arch_uprobe *aup, struct vm_area_struct *vma, unsigned long vaddr);
> @@ -204,7 +204,8 @@ extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
> extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma, unsigned long vaddr, uprobe_opcode_t,
> bool is_register);
> extern int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma, const unsigned long opcode_vaddr,
> - uprobe_opcode_t *insn, int nbytes, uprobe_write_verify_t verify, bool is_register, bool do_update_ref_ctr);
> + uprobe_opcode_t *insn, int nbytes, uprobe_write_verify_t verify, bool is_register, bool do_update_ref_ctr,
> + void *data);
> extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
> extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
> extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
> @@ -240,6 +241,7 @@ extern void uprobe_copy_from_page(struct page *page, unsigned long vaddr, void *
> extern void arch_uprobe_clear_state(struct mm_struct *mm);
> extern void arch_uprobe_init_state(struct mm_struct *mm);
> extern void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr);
> +extern void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr);
> #else /* !CONFIG_UPROBES */
> struct uprobes_state {
> };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index cbba31c0495f..e54081beeab9 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -192,7 +192,7 @@ static void copy_to_page(struct page *page, unsigned long vaddr, const void *src
> }
>
> static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *insn,
> - int nbytes)
> + int nbytes, void *data)
> {
> uprobe_opcode_t old_opcode;
> bool is_swbp;
> @@ -492,12 +492,13 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> bool is_register)
> {
> return uprobe_write(auprobe, vma, opcode_vaddr, &opcode, UPROBE_SWBP_INSN_SIZE,
> - verify_opcode, is_register, true /* do_update_ref_ctr */);
> + verify_opcode, is_register, true /* do_update_ref_ctr */, NULL);
> }
>
> int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> const unsigned long insn_vaddr, uprobe_opcode_t *insn, int nbytes,
> - uprobe_write_verify_t verify, bool is_register, bool do_update_ref_ctr)
> + uprobe_write_verify_t verify, bool is_register, bool do_update_ref_ctr,
> + void *data)
> {
> const unsigned long vaddr = insn_vaddr & PAGE_MASK;
> struct mm_struct *mm = vma->vm_mm;
> @@ -531,7 +532,7 @@ int uprobe_write(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> goto out;
> folio = page_folio(page);
>
> - ret = verify(page, insn_vaddr, insn, nbytes);
> + ret = verify(page, insn_vaddr, insn, nbytes, data);
> if (ret <= 0) {
> folio_put(folio);
> goto out;
> @@ -2697,6 +2698,10 @@ bool __weak arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
> return true;
> }
>
> +void __weak arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr)
> +{
> +}
> +
> /*
> * Run handler and ask thread to singlestep.
> * Ensure all non-fatal signals cannot interrupt thread while it singlesteps.
> @@ -2761,6 +2766,9 @@ static void handle_swbp(struct pt_regs *regs)
>
> handler_chain(uprobe, regs);
>
> + /* Try to optimize after first hit. */
> + arch_uprobe_optimize(&uprobe->arch, bp_vaddr);
> +
> if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
> goto out;
>
> --
> 2.50.0
>
--
Masami Hiramatsu (Google) <mhiramat@...nel.org>
Powered by blists - more mailing lists