[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a59509f0-5888-4663-9e82-98e27fc3e813@linux.ibm.com>
Date: Fri, 24 Oct 2025 15:58:20 +0200
From: Jens Remus <jremus@...ux.ibm.com>
To: Peter Zijlstra <peterz@...radead.org>, Steven Rostedt <rostedt@...nel.org>
Cc: linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
bpf@...r.kernel.org, x86@...nel.org,
Masami Hiramatsu <mhiramat@...nel.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Josh Poimboeuf <jpoimboe@...nel.org>, Ingo Molnar <mingo@...nel.org>,
Jiri Olsa <jolsa@...nel.org>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Namhyung Kim <namhyung@...nel.org>,
Thomas Gleixner <tglx@...utronix.de>,
Andrii Nakryiko <andrii@...nel.org>,
Indu Bhagat <indu.bhagat@...cle.com>,
"Jose E. Marchesi" <jemarch@....org>,
Beau Belgrave <beaub@...ux.microsoft.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Florian Weimer <fweimer@...hat.com>, Sam James <sam@...too.org>,
Kees Cook <kees@...nel.org>, "Carlos O'Donell" <codonell@...hat.com>,
Heiko Carstens <hca@...ux.ibm.com>, Vasily Gorbik <gor@...ux.ibm.com>
Subject: Re: [PATCH v16 0/4] perf: Support the deferred unwinding
infrastructure
Hello Peter!
On 10/24/2025 12:41 PM, Peter Zijlstra wrote:
> On Fri, Oct 24, 2025 at 11:29:26AM +0200, Peter Zijlstra wrote:
>> On Thu, Oct 23, 2025 at 05:00:02PM +0200, Peter Zijlstra wrote:
>>
>>> Trouble is, pretty much every unwind is 510 entries long -- this cannot
>>> be right. I'm sure there's a silly mistake in unwind/user.c but I'm too
>>> tired to find it just now. I'll try again tomorrow.
>>
>> PEBKAC
>
> Anyway, while staring at this, I noted that the perf userspace unwind
> code has a few bits that are missing from the new shiny thing.
>
> How about something like so? This add an optional arch specific unwinder
> at the very highest priority (bit 0) and uses that to do a few extra
> bits before disabling itself and falling back to whatever lower prio
> unwinder to do the actual unwinding.
unwind user sframe does not need any of this special handling, because
it knows for each IP whether the SP or FP is the CFA base register
and whether the FP and RA have been saved.
Isn't this actually specific to unwind user fp? If the IP is at
function entry, then the FP has not been setup yet. I think unwind user
fp could handle this using an arch specific is_uprobe_at_func_entry() to
determine whether to use a new frame_fp_entry instead of frame_fp. For
x86 the following frame_fp_entry should work, if I am not wrong:
#define ARCH_INIT_USER_FP_ENTRY_FRAME(ws) \
.cfa_off = 1*(ws), \
.ra_off = -1*(ws), \
.fp_off = 0, \
.use_fp = false,
Following roughly outlines the required changes:
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
-static int unwind_user_next_fp(struct unwind_user_state *state)
+static int unwind_user_next_common(struct unwind_user_state *state,
+ const struct unwind_user_frame *frame,
+ struct pt_regs *regs)
@@ -71,6 +83,7 @@ static int unwind_user_next_common(struct unwind_user_state *state,
state->sp = sp;
if (frame->fp_off)
state->fp = fp;
+ state->topmost = false;
return 0;
}
@@ -154,6 +167,7 @@ static int unwind_user_start(struct unwind_user_state *state)
state->sp = user_stack_pointer(regs);
state->fp = frame_pointer(regs);
state->ws = compat_user_mode(regs) ? sizeof(int) : sizeof(long);
+ state->topmost = true;
return 0;
}
static int unwind_user_next_fp(struct unwind_user_state *state)
{
const struct unwind_user_frame fp_frame = {
ARCH_INIT_USER_FP_FRAME(state->ws)
};
const struct unwind_user_frame fp_entry_frame = {
ARCH_INIT_USER_FP_ENTRY_FRAME(state->ws)
};
struct pt_regs *regs = task_pt_regs(current);
if (state->topmost && is_uprobe_at_func_entry(regs))
return unwind_user_next_common(state, &fp_entry_frame, regs);
else
return unwind_user_next_common(state, &fp_frame, regs);
}
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
@@ -43,6 +43,7 @@ struct unwind_user_state {
unsigned int ws;
enum unwind_user_type current_type;
unsigned int available_types;
+ bool topmost;
bool done;
};
What do you think?
> +++ b/arch/x86/kernel/unwind_user.c
> @@ -0,0 +1,53 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#include <linux/unwind_user.h>
> +#include <linux/uprobes.h>
> +#include <linux/uaccess.h>
> +#include <linux/sched/task_stack.h>
> +#include <asm/processor.h>
> +#include <asm/tlbflush.h>
> +
> +int unwind_user_next_arch(struct unwind_user_state *state)
> +{
> + struct pt_regs *regs = task_pt_regs(current);
> +
> + /* only once, on the first iteration */
> + state->available_types &= ~UNWIND_USER_TYPE_ARCH;
> +
> + /* We don't know how to unwind VM86 stacks. */
> + if (regs->flags & X86_VM_MASK) {
> + state->done = true;
> + return 0;
> + }
> +
> + /*
> + * If we are called from uprobe handler, and we are indeed at the very
> + * entry to user function (which is normally a `push %rbp` instruction,
> + * under assumption of application being compiled with frame pointers),
> + * we should read return address from *regs->sp before proceeding
> + * to follow frame pointers, otherwise we'll skip immediate caller
> + * as %rbp is not yet setup.
> + */
> + if (!is_uprobe_at_func_entry(regs))
> + return -EINVAL;
> +
> +#ifdef CONFIG_COMPAT
> + if (state->ws == sizeof(int)) {
> + unsigned int retaddr;
> + int ret = get_user(retaddr, (unsigned int __user *)regs->sp);
> + if (ret)
> + return ret;
> +
> + state->ip = retaddr;
> + return 0;
> + }
> +#endif
> + unsigned long retaddr;
> + int ret = get_user(retaddr, (unsigned long __user *)regs->sp);
> + if (ret)
> + return ret;
> +
> + state->ip = retaddr;
> + return 0;
> +}
Above would then not be needed, as the unwind_user_next_fp() logic would
do the right thing.
> +++ b/arch/x86/kernel/uprobes.c
> @@ -1791,3 +1791,35 @@ bool arch_uretprobe_is_alive(struct retu
> else
> return regs->sp <= ret->stack;
> }
> +
> +/*
> + * Heuristic-based check if uprobe is installed at the function entry.
> + *
> + * Under assumption of user code being compiled with frame pointers,
> + * `push %rbp/%ebp` is a good indicator that we indeed are.
> + *
> + * Similarly, `endbr64` (assuming 64-bit mode) is also a common pattern.
> + * If we get this wrong, captured stack trace might have one extra bogus
> + * entry, but the rest of stack trace will still be meaningful.
> + */
> +bool is_uprobe_at_func_entry(struct pt_regs *regs)
> +{
> + struct arch_uprobe *auprobe;
> +
> + if (!current->utask)
> + return false;
> +
> + auprobe = current->utask->auprobe;
> + if (!auprobe)
> + return false;
> +
> + /* push %rbp/%ebp */
> + if (auprobe->insn[0] == 0x55)
> + return true;
> +
> + /* endbr64 (64-bit only) */
> + if (user_64bit_mode(regs) && is_endbr((u32 *)auprobe->insn))
> + return true;
> +
> + return false;
> +}
Regards,
Jens
--
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@...ibm.com
IBM
IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/
Powered by blists - more mailing lists