linux-kernel - Re: [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize uprobes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aIftAJg1hZGYp4NF@krava>
Date: Mon, 28 Jul 2025 23:34:56 +0200
From: Jiri Olsa <olsajiri@...il.com>
To: Masami Hiramatsu <mhiramat@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>
Cc: Oleg Nesterov <oleg@...hat.com>, Andrii Nakryiko <andrii@...nel.org>,
	bpf@...r.kernel.org, linux-kernel@...r.kernel.org,
	linux-trace-kernel@...r.kernel.org, x86@...nel.org,
	Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
	John Fastabend <john.fastabend@...il.com>,
	Hao Luo <haoluo@...gle.com>, Steven Rostedt <rostedt@...dmis.org>,
	Alan Maguire <alan.maguire@...cle.com>,
	David Laight <David.Laight@...lab.com>,
	Thomas Weißschuh <thomas@...ch.de>,
	Ingo Molnar <mingo@...nel.org>
Subject: Re: [PATCHv6 perf/core 10/22] uprobes/x86: Add support to optimize
 uprobes

On Fri, Jul 25, 2025 at 07:13:18PM +0900, Masami Hiramatsu wrote:
> On Sun, 20 Jul 2025 13:21:20 +0200
> Jiri Olsa <jolsa@...nel.org> wrote:
> 
> > Putting together all the previously added pieces to support optimized
> > uprobes on top of 5-byte nop instruction.
> > 
> > The current uprobe execution goes through following:
> > 
> >   - installs breakpoint instruction over original instruction
> >   - exception handler hit and calls related uprobe consumers
> >   - and either simulates original instruction or does out of line single step
> >     execution of it
> >   - returns to user space
> > 
> > The optimized uprobe path does following:
> > 
> >   - checks the original instruction is 5-byte nop (plus other checks)
> >   - adds (or uses existing) user space trampoline with uprobe syscall
> >   - overwrites original instruction (5-byte nop) with call to user space
> >     trampoline
> >   - the user space trampoline executes uprobe syscall that calls related uprobe
> >     consumers
> >   - trampoline returns back to next instruction
> > 
> > This approach won't speed up all uprobes as it's limited to using nop5 as
> > original instruction, but we plan to use nop5 as USDT probe instruction
> > (which currently uses single byte nop) and speed up the USDT probes.
> > 
> > The arch_uprobe_optimize triggers the uprobe optimization and is called after
> > first uprobe hit. I originally had it called on uprobe installation but then
> > it clashed with elf loader, because the user space trampoline was added in a
> > place where loader might need to put elf segments, so I decided to do it after
> > first uprobe hit when loading is done.
> > 
> > The uprobe is un-optimized in arch specific set_orig_insn call.
> > 
> > The instruction overwrite is x86 arch specific and needs to go through 3 updates:
> > (on top of nop5 instruction)
> > 
> >   - write int3 into 1st byte
> >   - write last 4 bytes of the call instruction
> >   - update the call instruction opcode
> > 
> > And cleanup goes though similar reverse stages:
> > 
> >   - overwrite call opcode with breakpoint (int3)
> >   - write last 4 bytes of the nop5 instruction
> >   - write the nop5 first instruction byte
> > 
> > We do not unmap and release uprobe trampoline when it's no longer needed,
> > because there's no easy way to make sure none of the threads is still
> > inside the trampoline. But we do not waste memory, because there's just
> > single page for all the uprobe trampoline mappings.
> > 
> > We do waste frame on page mapping for every 4GB by keeping the uprobe
> > trampoline page mapped, but that seems ok.
> > 
> > We take the benefit from the fact that set_swbp and set_orig_insn are
> > called under mmap_write_lock(mm), so we can use the current instruction
> > as the state the uprobe is in - nop5/breakpoint/call trampoline -
> > and decide the needed action (optimize/un-optimize) based on that.
> > 
> > Attaching the speed up from benchs/run_bench_uprobes.sh script:
> > 
> > current:
> >         usermode-count :  152.604 ± 0.044M/s
> >         syscall-count  :   13.359 ± 0.042M/s
> > -->     uprobe-nop     :    3.229 ± 0.002M/s
> >         uprobe-push    :    3.086 ± 0.004M/s
> >         uprobe-ret     :    1.114 ± 0.004M/s
> >         uprobe-nop5    :    1.121 ± 0.005M/s
> >         uretprobe-nop  :    2.145 ± 0.002M/s
> >         uretprobe-push :    2.070 ± 0.001M/s
> >         uretprobe-ret  :    0.931 ± 0.001M/s
> >         uretprobe-nop5 :    0.957 ± 0.001M/s
> > 
> > after the change:
> >         usermode-count :  152.448 ± 0.244M/s
> >         syscall-count  :   14.321 ± 0.059M/s
> >         uprobe-nop     :    3.148 ± 0.007M/s
> >         uprobe-push    :    2.976 ± 0.004M/s
> >         uprobe-ret     :    1.068 ± 0.003M/s
> > -->     uprobe-nop5    :    7.038 ± 0.007M/s
> >         uretprobe-nop  :    2.109 ± 0.004M/s
> >         uretprobe-push :    2.035 ± 0.001M/s
> >         uretprobe-ret  :    0.908 ± 0.001M/s
> >         uretprobe-nop5 :    3.377 ± 0.009M/s
> > 
> > I see bit more speed up on Intel (above) compared to AMD. The big nop5
> > speed up is partly due to emulating nop5 and partly due to optimization.
> > 
> > The key speed up we do this for is the USDT switch from nop to nop5:
> >         uprobe-nop     :    3.148 ± 0.007M/s
> >         uprobe-nop5    :    7.038 ± 0.007M/s
> > 
> 
> This also looks good to me.
> 
> Acked-by: Masami Hiramatsu (Google) <mhiramat@...nel.org>

thanks!

Peter, do you have more comments?

thanks,
jirka