[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <cover.1737511963.git.jpoimboe@kernel.org>
Date: Tue, 21 Jan 2025 18:30:52 -0800
From: Josh Poimboeuf <jpoimboe@...nel.org>
To: x86@...nel.org
Cc: Peter Zijlstra <peterz@...radead.org>,
Steven Rostedt <rostedt@...dmis.org>,
Ingo Molnar <mingo@...nel.org>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
linux-kernel@...r.kernel.org,
Indu Bhagat <indu.bhagat@...cle.com>,
Mark Rutland <mark.rutland@....com>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Jiri Olsa <jolsa@...nel.org>,
Namhyung Kim <namhyung@...nel.org>,
Ian Rogers <irogers@...gle.com>,
Adrian Hunter <adrian.hunter@...el.com>,
linux-perf-users@...r.kernel.org,
Mark Brown <broonie@...nel.org>,
linux-toolchains@...r.kernel.org,
Jordan Rome <jordalgo@...a.com>,
Sam James <sam@...too.org>,
linux-trace-kernel@...r.kernel.org,
Andrii Nakryiko <andrii.nakryiko@...il.com>,
Jens Remus <jremus@...ux.ibm.com>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Florian Weimer <fweimer@...hat.com>,
Andy Lutomirski <luto@...nel.org>,
Masami Hiramatsu <mhiramat@...nel.org>,
Weinan Liu <wnliu@...gle.com>
Subject: [PATCH v4 00/39] unwind, perf: sframe user space unwinding
This took a bit longer than expected. I fell into some rabbit holes
chasing a number of subtle bugs. I ended up rewriting the deferral code
several times. But I think the end result is much better.
The deferral request has a new interface, which helps make the
implementation MUCH simpler and less fragile. As a bonus it's now
possible for the request implementation to be NMI-safe.
The interface is similar to {task,irq}_work. The caller owns an
unwind_work struct:
struct unwind_work {
struct callback_head work;
unwind_callback_t func;
int pending;
};
For perf, struct unwind_work is embedded in struct perf_event. For
ftrace maybe it would live in task_struct?
The unwind_work can be passed to the following functions:
void unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
int unwind_deferred_request(struct unwind_work *work, u64 *cookie);
bool unwind_deferred_cancel(struct task_struct *task, struct unwind_work *work);
If unwind_deferred_request() returns success, the callback is
guaranteed. If the callback is already pending, it returns an error,
but the returned *cookie is still valid if it's nonzero.
Questions:
- Peter, I'm not sure how well this works with Intel PEBS? This just
uses the original task regs, is that a problem?
- Namhyung, I rebased your perf tool patches on the new missing
feature validation code, do the patches still look sane?
For testing with user space, here are the latest binutils fixes:
1785837a2570 ("ld: fix PR/32297")
938fb512184d ("ld: fix wrong SFrame info for lazy IBT PLT")
47c88752f9ad ("ld: generate SFrame stack trace info for .plt.got")
An out-of-tree glibc patch is also needed -- will attach in a reply.
Code also available at
git://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git sframe-v4
v4:
- split up patches better [Andrii]
- add callback guarantee [Andrii]
- support multiple non-contiguous elf text segments [Andrii]
- sframe section validation [Andrii]
- x86 compat mode support [Peter]
- implement guard(mmap_read_lock) [Peter]
- synchronize callback with perf event lifetime [Peter]
- detect toolchain sframe support with CONFIG_SFRAME_AS [Jens]
- get vdso working (with updated glibc patches) [Jens]
- rebase perf tool on new missing feature validation code
- brand new deferred interface and implementation
- make unwind_deferred_request() NMI-safe
- sframe debugging infrastructure
- fix some task_work bugs
- enclose multiple user copies in single STAC/CLAC pair for performance
- much banging head on wall, refactoring, simplification
- fix a lot of bugs
Previous revisions
------------------
v3:
https://lore.kernel.org/cover.1730150953.git.jpoimboe@kernel.org
- move the "deferred" logic out of perf and into unwind_user with new
unwind_user_deferred() interface [Steven, Mathieu]
- add more sframe sanity checks [Steven]
- make frame pointers optional depending on arch [Jens]
- fix perf event output [Namhyung]
- include Namhyung's perf tool patches
- enable sframe generation in VDSO
- fix build errors [robot]
v2:
https://lore.kernel.org/cover.1726268190.git.jpoimboe@kernel.org
- rebase on v6.11-rc7
- reorganize the patches to add sframe first
- change to sframe v2
- add new perf event type: PERF_RECORD_CALLCHAIN_DEFERRED
- add new perf attribute: defer_callchain
v1:
https://lore.kernel.org/cover.1699487758.git.jpoimboe@kernel.org
Original description
--------------------
Some distros have started compiling frame pointers into all their
packages to enable the kernel to do system-wide profiling of user space.
Unfortunately that creates a runtime performance penalty across the
entire system. Using DWARF (or .eh_frame) instead isn't feasible
because of complexity and slowness.
For in-kernel unwinding we solved this problem with the creation of the
ORC unwinder for x86_64. Similarly, for user space the GNU assembler
has created the SFrame ("Simple Frame") v2 format starting with binutils
2.41.
These patches add support for unwinding user space from the kernel using
SFrame with perf. It should be easy to add user unwinding support for
other components like ftrace.
There were two main challenges:
1) Finding .sframe sections in shared/dlopened libraries
The kernel has no visibility to the contents of shared libraries.
This was solved by adding a PR_ADD_SFRAME option to prctl() which
allows the runtime linker to manually provide the in-memory address
of an .sframe section to the kernel.
2) Dealing with page faults
Keeping all binaries' sframe data pinned would likely waste a lot of
memory. Instead, read it from user space on demand. That can't be
done from perf NMI context due to page faults, so defer the unwind to
the next user exit. Since the NMI handler doesn't do exit work,
self-IPI and then schedule task work to be run on exit from the IPI.
Special thanks to Indu for the original concept, and to Steven and Peter
for helping a lot with the design. And to Steven for letting me do it ;-)
Josh Poimboeuf (35):
task_work: Fix TWA_NMI_CURRENT error handling
task_work: Fix TWA_NMI_CURRENT race with __schedule()
mm: Add guard for mmap_read_lock
x86/vdso: Fix DWARF generation for getrandom()
x86/asm: Avoid emitting DWARF CFI for non-VDSO
x86/asm: Fix VDSO DWARF generation with kernel IBT enabled
x86/vdso: Use SYM_FUNC_{START,END} in __kernel_vsyscall()
x86/vdso: Use CFI macros in __vdso_sgx_enter_enclave()
x86/vdso: Enable sframe generation in VDSO
x86/uaccess: Add unsafe_copy_from_user() implementation
unwind_user: Add user space unwinding API
unwind_user: Add frame pointer support
unwind_user/x86: Enable frame pointer unwinding on x86
perf/x86: Rename get_segment_base() and make it global
unwind_user: Add compat mode frame pointer support
unwind_user/x86: Enable compat mode frame pointer unwinding on x86
unwind_user/sframe: Add support for reading .sframe headers
unwind_user/sframe: Store sframe section data in per-mm maple tree
unwind_user/sframe: Add support for reading .sframe contents
unwind_user/sframe: Detect .sframe sections in executables
unwind_user/sframe: Add prctl() interface for registering .sframe
sections
unwind_user/sframe: Wire up unwind_user to sframe
unwind_user/sframe/x86: Enable sframe unwinding on x86
unwind_user/sframe: Remove .sframe section on detected corruption
unwind_user/sframe: Show file name in debug output
unwind_user/sframe: Enable debugging in uaccess regions
unwind_user/sframe: Add .sframe validation option
unwind_user/deferred: Add deferred unwinding interface
unwind_user/deferred: Add unwind cache
unwind_user/deferred: Make unwind deferral requests NMI-safe
perf: Remove get_perf_callchain() 'init_nr' argument
perf: Remove get_perf_callchain() 'crosstask' argument
perf: Simplify get_perf_callchain() user logic
perf: Skip user unwind if !current->mm
perf: Support deferred user callchains
Namhyung Kim (4):
perf tools: Minimal CALLCHAIN_DEFERRED support
perf record: Enable defer_callchain for user callchains
perf script: Display PERF_RECORD_CALLCHAIN_DEFERRED
perf tools: Merge deferred user callchains
arch/Kconfig | 40 ++
arch/x86/Kconfig | 3 +
arch/x86/entry/vdso/Makefile | 10 +-
arch/x86/entry/vdso/vdso-layout.lds.S | 5 +-
arch/x86/entry/vdso/vdso32/system_call.S | 10 +-
arch/x86/entry/vdso/vgetrandom-chacha.S | 3 +-
arch/x86/entry/vdso/vsgx.S | 19 +-
arch/x86/events/core.c | 10 +-
arch/x86/include/asm/dwarf2.h | 54 +-
arch/x86/include/asm/linkage.h | 29 +-
arch/x86/include/asm/mmu.h | 2 +-
arch/x86/include/asm/perf_event.h | 2 +
arch/x86/include/asm/uaccess.h | 39 +-
arch/x86/include/asm/unwind_user.h | 61 +++
arch/x86/include/asm/unwind_user_types.h | 17 +
arch/x86/include/asm/vdso.h | 1 -
fs/binfmt_elf.c | 49 +-
include/asm-generic/Kbuild | 2 +
include/asm-generic/unwind_user.h | 24 +
include/asm-generic/unwind_user_types.h | 9 +
include/linux/entry-common.h | 3 +
include/linux/mm_types.h | 3 +
include/linux/mmap_lock.h | 2 +
include/linux/perf_event.h | 15 +-
include/linux/sched.h | 5 +
include/linux/sframe.h | 56 ++
include/linux/unwind_deferred.h | 52 ++
include/linux/unwind_deferred_types.h | 17 +
include/linux/unwind_user.h | 15 +
include/linux/unwind_user_types.h | 36 ++
include/uapi/linux/elf.h | 1 +
include/uapi/linux/perf_event.h | 19 +-
include/uapi/linux/prctl.h | 5 +-
kernel/Makefile | 1 +
kernel/bpf/stackmap.c | 14 +-
kernel/events/callchain.c | 47 +-
kernel/events/core.c | 112 +++-
kernel/fork.c | 14 +
kernel/sys.c | 9 +
kernel/task_work.c | 67 ++-
kernel/unwind/Makefile | 2 +
kernel/unwind/deferred.c | 266 ++++++++++
kernel/unwind/sframe.c | 595 ++++++++++++++++++++++
kernel/unwind/sframe.h | 71 +++
kernel/unwind/sframe_debug.h | 95 ++++
kernel/unwind/user.c | 146 ++++++
mm/init-mm.c | 2 +
tools/include/uapi/linux/perf_event.h | 19 +-
tools/lib/perf/include/perf/event.h | 7 +
tools/perf/Documentation/perf-script.txt | 5 +
tools/perf/builtin-script.c | 92 ++++
tools/perf/util/callchain.c | 24 +
tools/perf/util/callchain.h | 3 +
tools/perf/util/event.c | 1 +
tools/perf/util/evlist.c | 1 +
tools/perf/util/evlist.h | 1 +
tools/perf/util/evsel.c | 39 ++
tools/perf/util/evsel.h | 1 +
tools/perf/util/machine.c | 1 +
tools/perf/util/perf_event_attr_fprintf.c | 1 +
tools/perf/util/sample.h | 3 +-
tools/perf/util/session.c | 78 +++
tools/perf/util/tool.c | 2 +
tools/perf/util/tool.h | 4 +-
64 files changed, 2208 insertions(+), 133 deletions(-)
create mode 100644 arch/x86/include/asm/unwind_user.h
create mode 100644 arch/x86/include/asm/unwind_user_types.h
create mode 100644 include/asm-generic/unwind_user.h
create mode 100644 include/asm-generic/unwind_user_types.h
create mode 100644 include/linux/sframe.h
create mode 100644 include/linux/unwind_deferred.h
create mode 100644 include/linux/unwind_deferred_types.h
create mode 100644 include/linux/unwind_user.h
create mode 100644 include/linux/unwind_user_types.h
create mode 100644 kernel/unwind/Makefile
create mode 100644 kernel/unwind/deferred.c
create mode 100644 kernel/unwind/sframe.c
create mode 100644 kernel/unwind/sframe.h
create mode 100644 kernel/unwind/sframe_debug.h
create mode 100644 kernel/unwind/user.c
--
2.48.1
Powered by blists - more mailing lists