[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzaUy+oxMk9guMX06z-MLeUJMmf8TvzoLveO7ukBFaJiqg@mail.gmail.com>
Date: Thu, 15 Aug 2024 10:07:02 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Mark Rutland <mark.rutland@....com>
Cc: Liao Chang <liaochang1@...wei.com>, catalin.marinas@....com, will@...nel.org,
mhiramat@...nel.org, oleg@...hat.com, peterz@...radead.org,
puranjay@...nel.org, ast@...nel.org, andrii@...nel.org, xukuohai@...wei.com,
revest@...omium.org, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
bpf@...r.kernel.org
Subject: Re: [PATCH] arm64: insn: Simulate nop and push instruction for better
uprobe performance
On Thu, Aug 15, 2024 at 2:58 AM Mark Rutland <mark.rutland@....com> wrote:
>
> On Wed, Aug 14, 2024 at 08:03:56AM +0000, Liao Chang wrote:
> > As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
> > counterintuitive result that nop and push variants are much slower than
> > ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
> > which excludes 'nop' and 'stp' from the emulatable instructions list.
> > This force the kernel returns to userspace and execute them out-of-line,
> > then trapping back to kernel for running uprobe callback functions. This
> > leads to a significant performance overhead compared to 'ret' variant,
> > which is already emulated.
>
> I appreciate this might be surprising, but does it actually matter
> outside of a microbenchmark?
I'll leave the ARM parts to Liao, but yes, it does a lot. Admittedly,
my main focus right now is x86-64, but ARM64 keeps growing in
importance.
But on x86-64 we specifically added emulation of push/pop operations
(a while ago) so we can mitigate performance degradation for a common
case of installing uprobes on (user space) function entry. That was a
significant speed up because we avoided doing one extra interrupt hop
between kernel and user space, which is a big chunk of uprobe
activation cost. And then in BPF cases, BPF uprobe program logic is
usually pretty lightweight, so the uprobe triggering overhead is still
very noticeable in practice.
So if there is anything that can be done to improve performance on
ARM64 for similar function entry situations, that would be greatly
appreciated by many bpftrace and BPF users at the very least.
>
> > Typicall uprobe is installed on 'nop' for USDT and on function entry
> > which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
> > and fp into stack regardless kernel or userspace binary.
>
> Function entry doesn't always start with a STP; these days it's often a
> BTI or PACIASP, and for non-leaf functions (or with shrink-wrapping in
> the compiler), it could be any arbitrary instruction. This might happen
> to be the common case today, but there are certain;y codebases where it
> is not.
>
[...]
Powered by blists - more mailing lists