[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzarhiBHAQXECJzP5e-z0fbSaTpfQNPaSXwdgErz2f0vUA@mail.gmail.com>
Date: Wed, 16 Oct 2024 12:35:21 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Peter Ziljstra <peterz@...radead.org>, Will Deacon <will@...nel.org>,
Catalin Marinas <catalin.marinas@....com>, Mark Rutland <mark.rutland@....com>
Cc: Linux trace kernel <linux-trace-kernel@...r.kernel.org>, bpf <bpf@...r.kernel.org>,
Jiri Olsa <jolsa@...nel.org>, Oleg Nesterov <oleg@...hat.com>,
Masami Hiramatsu <mhiramat@...nel.org>, Liao Chang <liaochang1@...wei.com>,
linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
open list <linux-kernel@...r.kernel.org>,
"linux-perf-use." <linux-perf-users@...r.kernel.org>, Kernel Team <kernel-team@...a.com>
Subject: The state of uprobes work and logistics
Hello,
I wanted to provide a bit of a context about and tie together a few
separate work streams (across a few separate kernel trees) all
revolving around uprobe improvements, as there are a bunch of them and
I'm sure it's hard to keep track of all of them. And hopefully I can
also get Peter and ARM maintainer's input on some specific questions I
asked below. Thank you in advance!
In short, in the last few months there was a high activity around
fixing and improving uprobes. All this is the result of increased and
more varied use of uprobes/uretprobe in production settings. Uprobe
performance is **very** important, and yes, we do have real use cases
that go to millions per second uprobe/uretprobe triggering throughput,
unfortunately. So any small bit of performance and scalability
improvement is helpful. No, this isn't just some nerdy perf
optimization work (I've been asked this a few times, so I thought I'd
emphasize this again).
So, we've already landed a bunch of work, mainly (not an exhaustive list):
- various clean ups, API improvements, and bug fixes from Oleg
Nesterov ([0], [1]). This simplified internal APIs and was a
prerequisite of the rest of the work;
- changes to refcounting and RCU-ifying of uprobe lifetime from me
([2]). This improved single-threaded performance somewhat, but mainly
significantly improved scalability in the presence of multiple CPUs
triggering lots of uprobes;
- ARM64-specific optimization of uprobe emulation of NOP instruction
by Liao Chang ([3]). This change alone gives 2x (!) speed up for a
USDT tracing use cases *on ARM64* (we already have this optimization
in x86-64);
- there was a bit earlier work by Jiri Olsa ([4]) to add uretprobe()
syscall, giving +30% speed ups.
And there are a few more outstanding changes:
- Jiri Olsa's uprobe "session" support ([5]). This is less
performance focused, but important functionality by itself. But I'm
calling this out here because the first two patches are pure uprobe
internal changes, and I believe they should go into tip/perf/core to
avoid conflicts with the rest of pending uprobe changes.
Peter, do you mind applying those two and creating a stable tag for
bpf-next to pull? We'll apply the rest of Jiri's series to
bpf-next/master.
- Liao Chang's ARM64-specific STP instruction emulation support
([6]). This one will give 2x (!) improvement for a common case of
having STP instruction being a first instruction in traced user
function (similar to NOP for USDTs).
ARM64 maintainers (cc'ed Catalin, Will, and Mark), can you guys please
take another look? This one was a bit more controversial, but
hopefully there is a way to massage it to be acceptable and not
introduce unnecessary slowdowns (there were some concerns about memory
ordering/visibility, which hopefully don't apply to uprobe cases).
It's an important improvement, I'd really appreciate it if we can make
progress here, thank you!
- my speculative VMA-to-uprobe lookup series ([7]). This makes entry
uprobe scalability scale linearly with the number of CPUs (the
ultimate goal of uprobe scalability work).
I think it's ready to go in. It has **implicit** dependency on
Christian Brauner's recent change for FMODE_BACKING, for which he
provided a stable tag. Peter, do you have any remaining concerns or
this can be also merged soon?
- another patch set of mine, switching uretprobe fast path to SRCU
(with timeout) ([8]). This makes return uprobes (uretprobes) linearly
scalable in the common case (again, the ultimate scalability goal).
I haven't gotten much feedback here, would love to get some objective
review here. This is an important counterpart to the speculative
VMA-to-uprobe lookup series. Both are needed in practice.
- patch set dropping unnecessary siglock usage in uprobe by Liao
Chang ([9]). This one removes yet another lock, for a less common case
(at least on x86-64) of single-stepped uprobe (where the probed
instruction can't be emulated).
This one needs a rebase, but it was already acked by Oleg. Liao,
please prioritize the rebase and send v4 ASAP, so this is not lost.
As you can see, lots of stuff needs to be landed and most of it is in
good shape already. I'd love to hear thoughts of relevant people
called out above, thank you!
[0] https://lore.kernel.org/linux-trace-kernel/20240729134444.GA12293@redhat.com/
[1] https://lore.kernel.org/linux-trace-kernel/20240929144201.GA9429@redhat.com/
[2] https://lore.kernel.org/linux-trace-kernel/20240903174603.3554182-1-andrii@kernel.org/
[3] https://lore.kernel.org/linux-trace-kernel/20240909071114.1150053-1-liaochang1@huawei.com/
[4] https://lore.kernel.org/linux-trace-kernel/20240523121149.575616-1-jolsa@kernel.org/
[5] https://lore.kernel.org/bpf/20241015091050.3731669-1-jolsa@kernel.org/
[6] https://lore.kernel.org/linux-trace-kernel/20240910060407.1427716-1-liaochang1@huawei.com/
[7] https://lore.kernel.org/linux-trace-kernel/20241010205644.3831427-1-andrii@kernel.org/
[8] https://lore.kernel.org/linux-trace-kernel/20241008002556.2332835-1-andrii@kernel.org/
[9] https://lore.kernel.org/linux-trace-kernel/20240815014629.2685155-1-liaochang1@huawei.com/
-- Andrii
Powered by blists - more mailing lists