[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250829212023.4ab9506f@gandalf.local.home>
Date: Fri, 29 Aug 2025 21:20:23 -0400
From: Steven Rostedt <rostedt@...dmis.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Arnaldo Carvalho de Melo <arnaldo.melo@...il.com>, Steven Rostedt
<rostedt@...nel.org>, linux-kernel@...r.kernel.org,
linux-trace-kernel@...r.kernel.org, bpf@...r.kernel.org, x86@...nel.org,
Masami Hiramatsu <mhiramat@...nel.org>, Mathieu Desnoyers
<mathieu.desnoyers@...icios.com>, Josh Poimboeuf <jpoimboe@...nel.org>,
Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...nel.org>, Jiri
Olsa <jolsa@...nel.org>, Arnaldo Carvalho de Melo <acme@...nel.org>,
Namhyung Kim <namhyung@...nel.org>, Thomas Gleixner <tglx@...utronix.de>,
Andrii Nakryiko <andrii@...nel.org>, Indu Bhagat <indu.bhagat@...cle.com>,
"Jose E. Marchesi" <jemarch@....org>, Beau Belgrave
<beaub@...ux.microsoft.com>, Jens Remus <jremus@...ux.ibm.com>, Andrew
Morton <akpm@...ux-foundation.org>, Florian Weimer <fweimer@...hat.com>,
Sam James <sam@...too.org>, Kees Cook <kees@...nel.org>, "Carlos O'Donell"
<codonell@...hat.com>
Subject: Re: [PATCH v6 5/6] tracing: Show inode and device major:minor in
deferred user space stacktrace
On Fri, 29 Aug 2025 17:45:39 -0700
Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> On Fri, 29 Aug 2025 at 16:09, Steven Rostedt <rostedt@...dmis.org> wrote:
> >
> > Perf does do things differently, as I believe it processes the events as it
> > reads from the kernel (Arnaldo correct me if I'm wrong).
> >
> > For the tracefs code, the raw data gets saved directly into a file, and the
> > processing happens after the fact. If a tool is recording, it still needs a
> > way to know what those hash values mean, after the tracing is complete.
>
> But the data IS ALL THERE.
But only in the kernel. How do I expose it?
>
> Really. That's the point.
>
> It's there in the same file, it just needs those mmap events that
> whoever pasrses it - whether it be perf, or somebody reading some
What mmap events are you talking about? Nothing happens to be tracing mmap
events. An interrupt triggered, we want a user space stack trace for that
interrupt, it records the kernel stack trace and a cookie that gets matched
to the user stack trace. It is then deferred until it goes back to user
space and the deferred infrastructure does a callback to the tracer with
the list of addresses that represent the user space call stack.
We do a vma_lookup() to get the vma of each of those addresses. Now we make
some hash that represents that vma for each address. But there has been no
event that maps to this vma to what the file is. And the vma's in these
stack traces are a subset of all the vma's. When the user finally gets
around to reading them, the vmas could be long gone. How is user space
supposed to find out what files they belong to?
Do we need to record most events to grab all the vma's and the files they
belong to? Note, one of the constraints to tracing is the buffer size. We
don't want to be recording information that we don't care about.
> tracefs code - sees the mmap data, sees the cookies (hash values) that
> implies, and then matches those cookies with the subsequent trace
> entry cookies.
That was basically what I was doing with the vma hash table. To print out
the vmas as soon as a new one is referenced. It created the event needed,
and only for the vmas we care about.
>
> But what it does *NOT* need is munmap() events.
This wouldn't be recording munmap events. It would use the unmap event to
callback and remove the vma from the hash when they happened, so that if
they get reused the new ones would be printed. It's no different if we use
munmap or mmap. I could hook into the mmap event instead and check if it is
in the vma hash and if so, either reprint it, or remove it so if the vma is
in a call stack it would get reprinted.
Writing the file for every mmap seems to be a waste of ring buffer space if
the majority of them is not going to be in a stack trace.
>
> What it does *NOT* need is translating each hash value for each entry
> by the kernel, when whoever treads the file can just remember and
> re-create it in user space.
What's reading the files? The applications that are being traced?
>
> I'm done arguing. You're not listening, so I'll just let you know that
I am listening. I'm just not understanding you.
> I'm not pulling garbage. I've had enough garbage in tracefs, I'm still
> smarting from having to fix up the horrendous VFS interfaces, I'm not
> going to pull anything that messes up this too.
I know you keep bringing up the tracefs eventfs issue. Hey, I asked for
help with that when I first started it. I was basically told by some of the
VFS folks (I'm not going to name names) that "don't worry, if it works it's
fine". I was very worried that I wasn't doing it right. And it wasn't until
you got involved where you were the first one to tell me that using dentry
outside of VFS was a bad idea. Most of our arguing then was because I
didn't understand that. That also lead to the "garbage" code you had to fix
up.
So keep bringing that up. It just shows how much of tribal knowledge is
needed to work in the kernel. Heck, the VFS folks are still arguing about
how to handle things like kernfs. Which is similar to the eventfs issue.
And that boils down to things like kernefs, eventfs and procfs have a
fundamental difference to all other file systems. And that is it's a file
interface to the kernel itself, and not some external source. I realized
this during our arguments over eventfs. You do a write or read from a file,
and unlike other file systems, those actions trigger kernel functions
outside of vfs. But this is another topic altogether, and I only brought it
up because you did.
-- Steve
Powered by blists - more mailing lists