linux-kernel - Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20251118112452.61c7de68@gandalf.local.home>
Date: Tue, 18 Nov 2025 11:24:52 -0500
From: Steven Rostedt <rostedt@...dmis.org>
To: Namhyung Kim <namhyung@...nel.org>
Cc: Steven Rostedt <rostedt@...nel.org>, linux-kernel@...r.kernel.org,
 linux-trace-kernel@...r.kernel.org, Masami Hiramatsu <mhiramat@...nel.org>,
 Mark Rutland <mark.rutland@....com>, Mathieu Desnoyers
 <mathieu.desnoyers@...icios.com>, Andrew Morton
 <akpm@...ux-foundation.org>, Peter Zijlstra <peterz@...radead.org>, Thomas
 Gleixner <tglx@...utronix.de>, Ian Rogers <irogers@...gle.com>, Arnaldo
 Carvalho de Melo <acme@...nel.org>, Jiri Olsa <jolsa@...nel.org>, Douglas
 Raillard <douglas.raillard@....com>
Subject: Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer

On Mon, 17 Nov 2025 23:25:56 -0800
Namhyung Kim <namhyung@...nel.org> wrote:

> > As for the perf event that is triggered. It currently is a dynamic array of
> > 64 bit values. Each value is broken up into 8 bits for what type of perf
> > event it is, and 56 bits for the counter. It only writes a per CPU raw
> > counter and does not do any math. That would be needed to be done by any
> > post processing.  
> 
> If you want to keep the perf events per CPU, you may consider CPU
> migrations for the func-graph case.  Otherwise userspace may not
> calculate the diff from the begining correctly.

That's easily solved by the user space too adding a sched_switch perf event
trigger. ;-)


> 
> Just FYI, I did the similar thing (like fgraph case) in uftrace and I
> grouped two related events to produce a metric.
> 
>   $ uftrace -T a@...d=pmu-cycle ~/tmp/abc
>   # DURATION     TID      FUNCTION
>               [ 521741] | main() {
>               [ 521741] |   a() {
>               [ 521741] |     /* read:pmu-cycle (cycles=482 instructions=38) */
>               [ 521741] |     b() {
>               [ 521741] |       c() {
>      0.659 us [ 521741] |         getpid();
>      1.600 us [ 521741] |       } /* c */
>      1.780 us [ 521741] |     } /* b */
>               [ 521741] |     /* diff:pmu-cycle (cycles=+7361 instructions=+3955 IPC=0.54) */
>     24.485 us [ 521741] |   } /* a */
>     34.797 us [ 521741] | } /* main */
> 
> It reads cycles and instructions events (specified by 'pmu-cycle') at
> entry and exit of the given function ('a') and shows the diff with the
> metric IPC.

I originally tried to implement this, but then it became more complex than
I wanted in the kernel. As then I need to add a hook in the sched_switch
and record the perf event counter there, and keep track of it for every
task. That would require memory to be saved somewhere. I started adding it
to the function graph shadow stack and then just decided that it would be
so much easier to let user space figure it out.

By running function graph tracer and showing the start and end counters, as
well as the counters at the sched_switch trace event, user space could do
all the math and accounting, and the code in the kernel can remain simple.

-- Steve