[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <cover.1730150953.git.jpoimboe@kernel.org>
Date: Mon, 28 Oct 2024 14:47:47 -0700
From: Josh Poimboeuf <jpoimboe@...nel.org>
To: x86@...nel.org
Cc: Peter Zijlstra <peterz@...radead.org>,
Steven Rostedt <rostedt@...dmis.org>,
Ingo Molnar <mingo@...nel.org>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
linux-kernel@...r.kernel.org,
Indu Bhagat <indu.bhagat@...cle.com>,
Mark Rutland <mark.rutland@....com>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Jiri Olsa <jolsa@...nel.org>,
Namhyung Kim <namhyung@...nel.org>,
Ian Rogers <irogers@...gle.com>,
Adrian Hunter <adrian.hunter@...el.com>,
linux-perf-users@...r.kernel.org,
Mark Brown <broonie@...nel.org>,
linux-toolchains@...r.kernel.org,
Jordan Rome <jordalgo@...a.com>,
Sam James <sam@...too.org>,
linux-trace-kernel@...r.kerne.org,
Andrii Nakryiko <andrii.nakryiko@...il.com>,
Jens Remus <jremus@...ux.ibm.com>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Florian Weimer <fweimer@...hat.com>,
Andy Lutomirski <luto@...nel.org>
Subject: [PATCH v3 00/19] unwind, perf: sframe user space unwinding
This has all the changes discussed in v2, plus VDSO sframe support and
Namhyung's perf tool patches (see detailed changelog below).
I did quite a bit of testing, it seems to work well. It still needs
some binutils and glibc patches which I'll send in a reply.
Questions for perf experts:
- Is the perf_event lifetime managed correctly or do we need to do
something to ensure it exists in unwind_user_task_work()?
Or alternatively is the original perf_event even needed in
unwind_user_task_work() or can a new one be created on demand?
- Is --call-graph=sframe needed for consistency?
- Should perf use the context cookie? Note that because the callback
is usually only called once for multiple NMIs in the same entry
context, it's possible for the PERF_RECORD_CALLCHAIN_DEFERRED event
to arrive *before* some of the corresponding kernel events. The
context cookie disambiguates the corner cases.
Based on tip/master.
Also at:
git://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git sframe-v3
v3:
- move the "deferred" logic out of perf and into unwind_user with new
unwind_user_deferred() interface [Steven, Mathieu]
- add more sframe sanity checks [Steven]
- make frame pointers optional depending on arch [Jens]
- fix perf event output [Namhyung]
- include Namhyung's perf tool patches
- enable sframe generation in VDSO
- fix build errors [robot]
v2: https://lore.kernel.org/cover.1726268190.git.jpoimboe@kernel.org
- rebase on v6.11-rc7
- reorganize the patches to add sframe first
- change to sframe v2
- add new perf event type: PERF_RECORD_CALLCHAIN_DEFERRED
- add new perf attribute: defer_callchain
v1: https://lore.kernel.org/cover.1699487758.git.jpoimboe@kernel.org
Some distros have started compiling frame pointers into all their
packages to enable the kernel to do system-wide profiling of user space.
Unfortunately that creates a runtime performance penalty across the
entire system. Using DWARF (or .eh_frame) instead isn't feasible
because of complexity and slowness.
For in-kernel unwinding we solved this problem with the creation of the
ORC unwinder for x86_64. Similarly, for user space the GNU assembler
has created the SFrame ("Simple Frame") v2 format starting with binutils
2.41.
These patches add support for unwinding user space from the kernel using
SFrame with perf. It should be easy to add user unwinding support for
other components like ftrace.
There were two main challenges:
1) Finding .sframe sections in shared/dlopened libraries
The kernel has no visibility to the contents of shared libraries.
This was solved by adding a PR_ADD_SFRAME option to prctl() which
allows the runtime linker to manually provide the in-memory address
of an .sframe section to the kernel.
2) Dealing with page faults
Keeping all binaries' sframe data pinned would likely waste a lot of
memory. Instead, read it from user space on demand. That can't be
done from perf NMI context due to page faults, so defer the unwind to
the next user exit. Since the NMI handler doesn't do exit work,
self-IPI and then schedule task work to be run on exit from the IPI.
Special thanks to Indu for the original concept, and to Steven and Peter
for helping a lot with the design. And to Steven for letting me do it ;-)
Josh Poimboeuf (15):
x86/vdso: Fix DWARF generation for getrandom()
x86/asm: Avoid emitting DWARF CFI for non-VDSO
x86/asm: Fix VDSO DWARF generation with kernel IBT enabled
x86/vdso: Use SYM_FUNC_{START,END} in __kernel_vsyscall()
x86/vdso: Use CFI macros in __vdso_sgx_enter_enclave()
x86/vdso: Enable sframe generation in VDSO
unwind: Add user space unwinding API
unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_FP
unwind: Introduce sframe user space unwinding
unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_SFRAME
unwind: Add deferred user space unwinding API
perf: Remove get_perf_callchain() 'init_nr' argument
perf: Remove get_perf_callchain() 'crosstask' argument
perf: Simplify get_perf_callchain() user logic
perf: Add deferred user callchains
Namhyung Kim (4):
perf tools: Minimal CALLCHAIN_DEFERRED support
perf record: Enable defer_callchain for user callchains
perf script: Display PERF_RECORD_CALLCHAIN_DEFERRED
perf tools: Merge deferred user callchains
arch/Kconfig | 14 +
arch/x86/Kconfig | 2 +
arch/x86/entry/vdso/Makefile | 6 +-
arch/x86/entry/vdso/vdso-layout.lds.S | 5 +-
arch/x86/entry/vdso/vdso32/system_call.S | 10 +-
arch/x86/entry/vdso/vgetrandom-chacha.S | 3 +-
arch/x86/entry/vdso/vsgx.S | 19 +-
arch/x86/include/asm/dwarf2.h | 40 ++-
arch/x86/include/asm/linkage.h | 29 +-
arch/x86/include/asm/mmu.h | 2 +-
arch/x86/include/asm/unwind_user.h | 11 +
arch/x86/include/asm/vdso.h | 1 -
fs/binfmt_elf.c | 35 +-
include/linux/entry-common.h | 3 +
include/linux/mm_types.h | 3 +
include/linux/perf_event.h | 12 +-
include/linux/sched.h | 5 +
include/linux/sframe.h | 41 +++
include/linux/unwind_user.h | 99 ++++++
include/uapi/linux/elf.h | 1 +
include/uapi/linux/perf_event.h | 22 +-
include/uapi/linux/prctl.h | 3 +
kernel/Makefile | 1 +
kernel/bpf/stackmap.c | 14 +-
kernel/events/callchain.c | 47 +--
kernel/events/core.c | 70 +++-
kernel/fork.c | 14 +
kernel/sys.c | 11 +
kernel/unwind/Makefile | 2 +
kernel/unwind/sframe.c | 380 ++++++++++++++++++++++
kernel/unwind/sframe.h | 215 ++++++++++++
kernel/unwind/user.c | 318 ++++++++++++++++++
mm/init-mm.c | 6 +
tools/include/uapi/linux/perf_event.h | 22 +-
tools/lib/perf/include/perf/event.h | 7 +
tools/perf/Documentation/perf-script.txt | 5 +
tools/perf/builtin-script.c | 92 ++++++
tools/perf/util/callchain.c | 24 ++
tools/perf/util/callchain.h | 3 +
tools/perf/util/event.c | 1 +
tools/perf/util/evlist.c | 1 +
tools/perf/util/evlist.h | 1 +
tools/perf/util/evsel.c | 32 +-
tools/perf/util/evsel.h | 1 +
tools/perf/util/machine.c | 1 +
tools/perf/util/perf_event_attr_fprintf.c | 1 +
tools/perf/util/sample.h | 3 +-
tools/perf/util/session.c | 78 +++++
tools/perf/util/tool.c | 2 +
tools/perf/util/tool.h | 4 +-
50 files changed, 1634 insertions(+), 88 deletions(-)
create mode 100644 arch/x86/include/asm/unwind_user.h
create mode 100644 include/linux/sframe.h
create mode 100644 include/linux/unwind_user.h
create mode 100644 kernel/unwind/Makefile
create mode 100644 kernel/unwind/sframe.c
create mode 100644 kernel/unwind/sframe.h
create mode 100644 kernel/unwind/user.c
--
2.47.0
Powered by blists - more mailing lists