[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4Bzbg2ROstG5+1XUoZre403n-B3CHuW9E0UECNY364giDcw@mail.gmail.com>
Date: Fri, 11 Jul 2025 10:17:50 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Jiri Olsa <jolsa@...nel.org>
Cc: Oleg Nesterov <oleg@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Andrii Nakryiko <andrii@...nel.org>, Alejandro Colomar <alx@...nel.org>, Eyal Birger <eyal.birger@...il.com>,
kees@...nel.org, bpf@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-trace-kernel@...r.kernel.org, x86@...nel.org,
Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
John Fastabend <john.fastabend@...il.com>, Hao Luo <haoluo@...gle.com>,
Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu <mhiramat@...nel.org>,
Alan Maguire <alan.maguire@...cle.com>, David Laight <David.Laight@...lab.com>,
Thomas Weißschuh <thomas@...ch.de>,
Ingo Molnar <mingo@...nel.org>
Subject: Re: [PATCHv5 perf/core 00/22] uprobes: Add support to optimize usdt
probes on x86_64
On Fri, Jul 11, 2025 at 1:29 AM Jiri Olsa <jolsa@...nel.org> wrote:
>
> hi,
> this patchset adds support to optimize usdt probes on top of 5-byte
> nop instruction.
>
> The generic approach (optimize all uprobes) is hard due to emulating
> possible multiple original instructions and its related issues. The
> usdt case, which stores 5-byte nop seems much easier, so starting
> with that.
>
> The basic idea is to replace breakpoint exception with syscall which
> is faster on x86_64. For more details please see changelog of patch 8.
>
> The run_bench_uprobes.sh benchmark triggers uprobe (on top of different
> original instructions) in a loop and counts how many of those happened
> per second (the unit below is million loops).
>
> There's big speed up if you consider current usdt implementation
> (uprobe-nop) compared to proposed usdt (uprobe-nop5):
>
> current:
> usermode-count : 152.501 ± 0.012M/s
> syscall-count : 14.463 ± 0.062M/s
> --> uprobe-nop : 3.160 ± 0.005M/s
> uprobe-push : 3.003 ± 0.003M/s
> uprobe-ret : 1.100 ± 0.003M/s
> uprobe-nop5 : 3.132 ± 0.012M/s
> uretprobe-nop : 2.103 ± 0.002M/s
> uretprobe-push : 2.027 ± 0.004M/s
> uretprobe-ret : 0.914 ± 0.002M/s
> uretprobe-nop5 : 2.115 ± 0.002M/s
>
> after the change:
> usermode-count : 152.343 ± 0.400M/s
> syscall-count : 14.851 ± 0.033M/s
> uprobe-nop : 3.204 ± 0.005M/s
> uprobe-push : 3.040 ± 0.005M/s
> uprobe-ret : 1.098 ± 0.003M/s
> --> uprobe-nop5 : 7.286 ± 0.017M/s
> uretprobe-nop : 2.144 ± 0.001M/s
> uretprobe-push : 2.069 ± 0.002M/s
> uretprobe-ret : 0.922 ± 0.000M/s
> uretprobe-nop5 : 3.487 ± 0.001M/s
>
> I see bit more speed up on Intel (above) compared to AMD. The big nop5
> speed up is partly due to emulating nop5 and partly due to optimization.
>
> The key speed up we do this for is the USDT switch from nop to nop5:
> uprobe-nop : 3.160 ± 0.005M/s
> uprobe-nop5 : 7.286 ± 0.017M/s
>
We've been waiting for this to land for so long, I hope this gets
applied soon...
Once this lands, we can finally start implementing USDT support that
can take advantage of this transparently and with no performance
regression on old kernel.
For the series:
Acked-by: Andrii Nakryiko <andrii@...nel.org>
[...]
Powered by blists - more mailing lists