[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250520115721.7a4307a2@gandalf.local.home>
Date: Tue, 20 May 2025 11:57:21 -0400
From: Steven Rostedt <rostedt@...dmis.org>
To: Ingo Molnar <mingo@...nel.org>
Cc: Namhyung Kim <namhyung@...nel.org>, linux-kernel@...r.kernel.org,
linux-trace-kernel@...r.kernel.org, bpf@...r.kernel.org, x86@...nel.org,
Masami Hiramatsu <mhiramat@...nel.org>, Mathieu Desnoyers
<mathieu.desnoyers@...icios.com>, Josh Poimboeuf <jpoimboe@...nel.org>,
Peter Zijlstra <peterz@...radead.org>, Jiri Olsa <jolsa@...nel.org>, Thomas
Gleixner <tglx@...utronix.de>, Borislav Petkov <bp@...en8.de>, Dave Hansen
<dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>, Andrii
Nakryiko <andrii@...nel.org>, "Jose E. Marchesi" <jemarch@....org>, Indu
Bhagat <indu.bhagat@...cle.com>
Subject: Re: [PATCH v9 00/13] unwind_user: x86: Deferred unwinding
infrastructure
On Tue, 20 May 2025 11:35:56 +0200
Ingo Molnar <mingo@...nel.org> wrote:
> > Ah, I think I forgot about that. I believe the exit path can also be a
> > faultable path. All it needs is a hook to do the exit. Is there any
> > "task work" clean up on exit? I need to take a look.
>
> Could you please not rush this facility into v6.16? It barely had any
> design review so far, and I'm still not entirely sure about the
> approach.
Hi Ingo,
Note, there has been a lot of discussion on this approach, although it's
been mostly at conferences and in meetings. At GNU Cauldron in September
2024 (before Plumbers) Josh, Mathieu and myself discussed it in quite
detail. I've then been hosting a monthly meeting with engineers from
Google, Red Hat, EffiOS, Oracle, Microsoft and others (I can invite you to
it if you would like). There's actually two meetings (one that is in a Asia
friendly timezone and another in a European friendly timezone). The first
patches from this went out in October, 2024:
https://lore.kernel.org/all/cover.1730150953.git.jpoimboe@kernel.org/
There wasn't much discussion on it, although Peter did reply, and I believe
that we did address all of his concerns.
Josh then changed the approach from what we originally discussed, which was
to just have each tracer attach a task_work to the task it wants a trace
thinking that would be good enough, but Mathieu and I found that it doesn't
work because even a perf event can not handle this because it would need to
keep track of several tasks that may migrate.
Let me state the goal of this work, as it started from a session at the
2022 Tracing Summit in London. We needed a way to get reliable user space
stack traces without using frame pointers. We discussed various methods,
including using eh_frame but they all had issues. We concluded that it
would be nice to have an ORC unwinder (technically it's called a stack
walker), for user space.
Then in Nov 2022, I read about "sframes" which was exactly what we were
looking for:
https://www.phoronix.com/news/GNU-Binutils-SFrame
At FOSDEM 2023, I asked Jose Marchesi (a GCC maintainer) about sframes, and
it just happened to be one of his employees that created it (Indu Bhagat).
At Kernel Recipes 2023, Brendan Gregg, during his talk, asked if it would
be great to have user space stack walking without needing to run everything
with frame pointers. I raised my hand and asked if he had heard about
"sframes", which he did not. I explained what it was and he was very
interested. After that talk, Josh Poimboeuf came up to me and asked if he
could implement this, which I agreed.
One thing that is needed for sframes is that it has to be done in a
faultable context. The sframe sections are like ORC, where it has lookup
tables, but they live in the user space address, and because they can be
large, they can't be locked into memory. This brings up the deferred
unwinding aspect.
At the LSFMMBPF 2023 conference, Indu and myself ran a session on sframes
to get a better idea on how to implement this. The perf user space stack
tracing was mentioned, where if frame pointers are not there, perf may copy
thousands of bytes of the user space stack into the perf buffer. If there's
a long system call, it may do this several times, and because the user
space stack does not change while the task is in the kernel, this is
thousands of bytes of duplicate data. I asked Jiri Olsa "why not just defer
the stack trace and only make a single entry", and I believe he replied "we
are thinking about it", but nothing further came about it.
Now, there's a serious push to get sframes upstream, and that will take a
few steps. These are:
1) Create a new user unwind stack call that is expected to be called in
faultable context. If a tracer (perf, ftrace, BPF or whatever) knows
it's in a faultable context and wants a trace, it can simply ask for it.
It was also asked to have a user space system call that can get this
trace so that a function in a user space applications can see what is
calling it.
2) Create a deferred stack unwinding infrastructure that can be used by
many clients (perf, ftrace, BPF or whatever) and called in any context
(interrupt or NMI).
As the unwind stack call needs to be in a faultable context, and it is
very common to want both a kernel stack trace along with a user space
stack trace and this can happen in an NMI, allow the kernel stack trace
to be executed and delay the user stack trace.
The accounting for this is not easy, as it has a many to many
relationship. You could have perf, ftrace and BPF all asking for a
delayed stack trace and they all need to be called. But each could be
wanting a stack trace from a different set of tasks. Keeping track of
which tracer gets a callback for which task is where this patch set
comes in. Or at least just the infrastructure part.
3) Add sframes.
The final part is to get this working with sframes. Where a distro
builds all its applications with sframes enabled and then perf, ftrace,
BPF or whatever gets access to it through this interface.
There's quite a momentum of work happening today that is being built
expecting us to get to step 3. There's no use to adding sframes to
applications if the kernel can't read them. The only way to read them is to
have this deferred infrastructure.
We've discussed the current design quiet a bit, but until there's actual
users starting to build on top of it, all the corner cases may not come
out. That's why I'm suggesting if we can just get the basic infrastructure
in this merge window, where it's not fully enabled (there are no users of
it), we can then have several users build on top of it in the next merge
window to see if it finds anything that breaks.
As perf has the biggest user space ABI, where the delayed stack trace may
be in a different event buffer than where the kernel stack trace occurred
(due to migration), it's the one I'm most concern about getting this right.
As once it is exposed to user space, it can never change. That's the one I
do want to focus on the most. But it shouldn't delay getting the non user
space visible aspect of the kernel side moving forward. If we find an
issue, it can always be changed because it doesn't affect user space.
I've been testing this on the ftrace side and so far, everything works
fine. But the ftrace side has the deferred trace always in the same
instance buffer as where it was triggered, as it doesn't depend on user
space API for that.
I've also been running this on perf and it too is working well. But I don't
know the perf interface well enough to make sure there isn't other corner
cases that I may be missing.
For 6.16, I would just like to get the common infrastructure in the kernel
so there's no dependency on different tracers. This patch set adds the
changes but does not enable it yet. This way, perf, ftrace and BPF can all
work to build on top of the changes without needing to share code.
There's only two patches that touch x86 here. I have another patch series
that removes them and implements this for ftrace, but because no
architecture has it enabled, it's just dead code. But it would also allow
to build on top of the infrastructure where any architecture could enable
it. Kind of like what PREEMPT_RT did.
-- Steve
Powered by blists - more mailing lists