[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <157918584866.29301.6941815715391411338.stgit@devnote2>
Date: Thu, 16 Jan 2020 23:44:09 +0900
From: Masami Hiramatsu <mhiramat@...nel.org>
To: Brendan Gregg <brendan.d.gregg@...il.com>,
Steven Rostedt <rostedt@...dmis.org>,
Alexei Starovoitov <ast@...nel.org>
Cc: mhiramat@...nel.org, Ingo Molnar <mingo@...nel.org>,
bpf@...r.kernel.org, linux-kernel@...r.kernel.org,
Daniel Borkmann <daniel@...earbox.net>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
"David S . Miller" <davem@...emloft.net>, paulmck@...nel.org,
joel@...lfernandes.org,
"Naveen N . Rao" <naveen.n.rao@...ux.ibm.com>,
Anil S Keshavamurthy <anil.s.keshavamurthy@...el.com>
Subject: [RFT PATCH 00/13] tracing: kprobes: Introduce async unregistration
Hello,
Here is an experimental series of patches for asynchronous kprobe
event disabling and unregistration. This fixes the performance
issue (from operator's viewpoint) of disabling/removing kprobe
events.
This still under testing phase, so please try to test it,
especially with eBPF. You can download this series from here;
https://github.com/mhiramat/linux.git ftrace-nowait
Background
==========
In last Plumbers, Brendan showed me a demo that BPF tracer took
too long time (more than 30 seconds?) to exit from tracing.
I had seen similar issue on systemtap 10 years ago and had fixed
it by array-based batch API (see commit 9861668f7478 ("kprobes:
add (un)register_kprobes for batch registration").)
I suspected there was similar issue.
Investigation
=============
While I'm checking the current ftrace-side performance with 256
kprobes (without ftrace-based kprobes)[1] , I found there were
another issue.
----
Probes: 256 kprobe_events
Enable events
real 0m 0.00s
user 0m 0.00s
sys 0m 0.00s
Disable events
real 0m 21.40s
user 0m 0.00s
sys 0m 0.02s
Remove events
real 0m 2.24s
user 0m 0.00s
sys 0m 0.01s
----
OK, removing events took more than 2 seconds for 256 probe events.
But disabling events took 21 seconds, 10 times longer than
removing.
Actually, since perf-events (base of BPF tracer) does disable and
remove at once, it will take more than that.
I also measured it without kprobe jump optimization
(echo 0 > /proc/sys/debug/kprobe-optimization) and it changed
the results as below.
----
Probes: 256 kprobe_events
Enable events
real 0m 0.00s
user 0m 0.00s
sys 0m 0.00s
Disable events
real 0m 2.07s
user 0m 0.00s
sys 0m 0.04s
Remove events
real 0m 2.13s
user 0m 0.00s
sys 0m 0.01s
----
I checked the kprobes and tracing code and what happened, and
found that there might be mutex waiting for unoptimization.
Disabling an optimized kprobe will be done without any RCU
grace period at kprobes API level, but it kicks unoptimizer
which will be done afterwards by delayed workqueue.
On the other hand, since the ftrace needs to remove disabled
probe from an event (which is on RCU list), it waits RCU
grace period to free the event.
Thus, if we disable multiple probes at once, it repeats
(1) disable kprobes (2) kick (un)optimizer (3) remove event
from rcu list and (4) waits RCU grace period. And while (4),
if the (un)optimizer starts, then the 2nd (1) will stopped
because kprobe_mutex is already held by the (un)optimizer.
For kprobe unoptimization, the (un)optimizer locks global
kprobe_mutex and waits for some task to be scheduled on
all CPUs for safety, which takes a while.
So if we repeating the above process, the disabling will
take a long time. (and actually, it took 21 seconds)
This actually came from the delaying period is not enough
for kprobes, at this moment it just wait for 5 jiffies, but
it should be enough longer than the batch disabling process.
Also since such the batch disabling process involves many
rcu grace period wait, from operator's viewpoint, it takes
a long time. For example, from above experiment, disabling
256 probes took around 2 seconds. These waits can be replaced
by call_rcu() and workqueues.
Improvements
============
This series is a trial to solve these issues by introducing
many asynchronous improvements.
Here is the result. I have tried to remove redundant
RCU synchronizations from kprobes and kprobe events with
asynchronous APIs.
-----
Probes: 256 kprobe_events
Enable events
real 0m 0.03s
user 0m 0.00s
sys 0m 0.02s
Disable events
real 0m 0.05s
user 0m 0.00s
sys 0m 0.01s
Remove events
real 0m 0.03s
user 0m 0.00s
sys 0m 0.01s
-----
Note that this results is on the ftrace side, bpf will still
have some time to stop the tracing, because perf_event doesn't
have independent event disabling code.
The ftrace has 2 operations to stop the tracing, disable events
and remove events. Thus it is easy to wait an RCU grace period
between them. On the other hand, perf event has only "destroy"
operation which disables event and remove event at once.
To do it safely, in the destroy operation, the
perf_trace_event_unreg() calls tracepoint_synchronize_unregister(),
which is an alias of synchronize_rcu(), for each event.
Thus I guess eBPF still take O(n) to stop tracing, but it
might be 1/10 shorter than before with this series.
Patches
=======
First 2 patches are minor fix and cleanup. I will send those
as a separated series.
The [3/13] and [4/13] are postpone (and give more delay) the
kprobe unoptimizing process. This will minimize kprobe_mutex
collision with disabling/unregistering process.
[5/13] is modifying ftrace side to remove unused event_file_link
to free in call_rcu().
[6/13] is enabling kprobe-booster on preemptive kernel for
preparation. The purpose of this is replacing synchronize_rcu()
with synchronize_rcu_tasks().
[7/13]-[10/13] are refining instruction cache reclaiming process
to be asynchronous.
[11/13] adds asynchronous unregister API for kprobes and [12/13]
and [13/13] uses that API from ftrace side.
Thank you,
[1] Here is the test script
------------------
#!/bin/sh
# SPDX-License-Identifier: GPL-2.0
# description: Register/unregister many kprobe events
cd /sys/kernel/debug/tracing
# ftrace fentry skip size depends on the machine architecture.
# Currently HAVE_KPROBES_ON_FTRACE defined on x86 and powerpc64le
case `uname -m` in
x86_64|i[3456]86) OFFS=5;;
ppc64le) OFFS=8;;
*) OFFS=0;;
esac
N=0
echo "Setup up kprobes on first available 256 text symbols"
grep -i " t " /proc/kallsyms | cut -f3 -d" " | grep -v .*\\..* | \
while read i; do
echo p ${i}+${OFFS} >> kprobe_events && N=$((N+1)) &> /dev/null ||:
test $N -eq 256 && break
done
echo -n "Probes: " ; wc -l kprobe_events
echo "Enable events"
echo 1 | time tee events/kprobes/enable > /dev/null
sleep 2
echo "Disable events"
echo 0 | time tee events/kprobes/enable > /dev/null
sleep 2
echo "Remove events"
echo | time tee kprobe_events > /dev/null
------------------
---
Masami Hiramatsu (13):
kprobes: Fix to protect kick_kprobe_optimizer() by kprobe_mutex
kprobes: Remove redundant arch_disarm_kprobe() call
kprobes: Postpone optimizer until a bunch of probes (un)registered
kprobes: Make optimizer delay to 1 second
tracing/kprobe: Use call_rcu to defer freeing event_file_link
kprobes: Enable kprobe-booster with CONFIG_PREEMPT=y
kprobes: Use normal list traversal API if a mutex is held
kprobes: Use workqueue for reclaiming kprobe insn cache pages
kprobes: Free kprobe_insn_page asynchronously
kprobes: Make free_*insn_slot() mutex-less
kprobes: Add asynchronous unregistration APIs
tracing/kprobe: Free probe event asynchronously
tracing/kprobe: perf_event: Remove local kprobe event asynchronously
arch/Kconfig | 4 +
arch/ia64/Kconfig | 1
arch/ia64/kernel/kprobes.c | 3
arch/powerpc/kernel/optprobes.c | 1
arch/x86/Kconfig | 1
arch/x86/kernel/kprobes/core.c | 2
include/linux/kprobes.h | 13 ++
kernel/kprobes.c | 286 ++++++++++++++++++++++++++-------------
kernel/trace/trace_dynevent.c | 5 +
kernel/trace/trace_kprobe.c | 85 +++++++++---
kernel/trace/trace_probe.c | 10 +
kernel/trace/trace_probe.h | 1
12 files changed, 293 insertions(+), 119 deletions(-)
--
Masami Hiramatsu (Linaro) <mhiramat@...nel.org>
Powered by blists - more mailing lists