linux-kernel - Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20251118120821.0c47ef684b53d5d9a2d6dc83@kernel.org>
Date: Tue, 18 Nov 2025 12:08:21 +0900
From: Masami Hiramatsu (Google) <mhiramat@...nel.org>
To: Steven Rostedt <rostedt@...nel.org>
Cc: linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org, Masami
 Hiramatsu <mhiramat@...nel.org>, Mark Rutland <mark.rutland@....com>,
 Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, Andrew Morton
 <akpm@...ux-foundation.org>, Peter Zijlstra <peterz@...radead.org>, Thomas
 Gleixner <tglx@...utronix.de>, Ian Rogers <irogers@...gle.com>, Namhyung
 Kim <namhyung@...nel.org>, Arnaldo Carvalho de Melo <acme@...nel.org>, Jiri
 Olsa <jolsa@...nel.org>, Douglas Raillard <douglas.raillard@....com>
Subject: Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer

Hi Steve,

Thanks for the great idea!

On Mon, 17 Nov 2025 19:29:50 -0500
Steven Rostedt <rostedt@...nel.org> wrote:

> 
> This series adds a perf event to the ftrace ring buffer.
> It is currently a proof of concept as I'm not happy with the interface
> and I also think the recorded perf event format may be changed too.
> 
> This proof-of-concept interface (which I have no plans on using), currently
> just adds 6 new trace options.
> 
>   event_cache_misses
>   event_cpu_cycles
>   func-cache-misses
>   func-cpu-cycles
>   funcgraph-cache-misses
>   funcgraph-cpu-cycles
> 
> The first two trigger a perf event after every event, the second two trigger
> a perf event after every function and the last two trigger a perf event
> right after the start of a function and again at the end of the function.
> 
> As this will eventual work with many more perf events than just cache-misses
> and cpu-cycles , using options is not appropriate. Especially since the
> options are limited to a 64 bit bitmask, and that can easily go much higher.
> I'm thinking about having a file instead that will act as a way to enable
> perf events for events, function and function graph tracing.
> 
>   set_event_perf, set_ftrace_perf, set_fgraph_perf

What about adding a global `trigger` action file so that user can
add these "perf" actions to write into it. It is something like
stacktrace for events. (Maybe we can move stacktrace/user-stacktrace
into it too)

For pre-defined/software counters:
# echo "perf:cpu_cycles" >> /sys/kernel/tracing/trigger

For some hardware event sources (see /sys/bus/event_source/devices/):
# echo "perf:cstate_core.c3-residency" >> /sys/kernel/tracing/trigger

echo "perf:my_counter=pmu/config=M,config1=N" >> /sys/kernel/tracing/trigger

If we need to set those counters for tracers and events separately,
we can add `events/trigger` and `tracer-trigger` files.

echo "perf:cpu_cycles" >> /sys/kernel/tracing/events/trigger

To disable counters, we can use '!' as same as event triggers.

echo !perf:cpu_cycles > trigger

To add more than 2 counters, connect it with ':'.
(or, we will allow to append new perf counters)
This allows user to set perf counter options for each events.

Maybe we also should move 'stacktrace'/'userstacktrace' option
flags to it too eventually.



> 
> And an available_perf_events that show what can be written into these files,
> (similar to how set_ftrace_filter works). But for now, it was just easier to
> implement them as options.
> 
> As for the perf event that is triggered. It currently is a dynamic array of
> 64 bit values. Each value is broken up into 8 bits for what type of perf
> event it is, and 56 bits for the counter. It only writes a per CPU raw
> counter and does not do any math. That would be needed to be done by any
> post processing.
> 
> Since the values are for user space to do the subtraction to figure out the
> difference between events, for example, the function_graph tracer may have:
> 
>              is_vmalloc_addr() {
>                /* cpu_cycles: 5582263593 cache_misses: 2869004572 */
>                /* cpu_cycles: 5582267527 cache_misses: 2869006049 */
>              }

Just a style question: Would this mean the first line is for function entry
and the second one is function return?

> 
> User space would subtract 2869006049 - 2869004572 = 1477
> 
> Then 56 bits should be plenty.
> 
>   2^55 / 1,000,000,000 / 60 / 60 / 24 = 416
>   416 / 4 = 104
> 
> If you have a 4GHz machine, the cpu-cycles will overflow the 55 bits in 104
> days. This tooling is not for seeing how many cycles run over 104 days.
> User space tooling would just need to be aware that the vale is 56 bits and
> when calculating the difference between start and end do something like:
> 
>   if (start > end)
>       end |= 1ULL << 56;
> 
>   delta = end - start;
> 
> The next question is how to label the perf events to be in the 8 bit
> portion. It could simply be a value that is registered, and listed in the
> available_perf_events file.
> 
>   cpu_cycles:1
>   cach_misses:2
>   [..]

Looks good to me. I think pre-definied events of `perf list`
will be there and have fixed numbers.

Thank you,

> 
> And this would need to be recorded by any tooling reading the events
> so that it knows how to map the events with their attached ids.
> 
> But again, this is just a proof-of-concept. How this will eventually be
> implemented is yet to be determined.
> 
> But to test these patches (which are based on top of my linux-next branch,
> which should now be in linux-next):
> 
>   # cd /sys/kernel/tracing
>   # echo 1 > options/event_cpu_cycles
>   # echo 1 > options/event_cache_misses
>   # echo 1 > events/syscalls/enable
>   # cat trace
> [..]
>             bash-995     [007] .....    98.255252: sys_write -> 0x2
>             bash-995     [007] .....    98.255257: cpu_cycles: 1557241774 cache_misses: 449901166
>             bash-995     [007] .....    98.255284: sys_dup2(oldfd: 0xa, newfd: 1)
>             bash-995     [007] .....    98.255285: cpu_cycles: 1557260057 cache_misses: 449902679
>             bash-995     [007] .....    98.255305: sys_dup2 -> 0x1
>             bash-995     [007] .....    98.255305: cpu_cycles: 1557280203 cache_misses: 449906196
>             bash-995     [007] .....    98.255343: sys_fcntl(fd: 0xa, cmd: 1, arg: 0)
>             bash-995     [007] .....    98.255344: cpu_cycles: 1557322304 cache_misses: 449915522
>             bash-995     [007] .....    98.255352: sys_fcntl -> 0x1
>             bash-995     [007] .....    98.255353: cpu_cycles: 1557327809 cache_misses: 449916844
>             bash-995     [007] .....    98.255361: sys_close(fd: 0xa)
>             bash-995     [007] .....    98.255362: cpu_cycles: 1557335383 cache_misses: 449918232
>             bash-995     [007] .....    98.255369: sys_close -> 0x0
> 
> 
> 
> Comments welcomed.
> 
> 
> Steven Rostedt (3):
>       tracing: Add perf events
>       ftrace: Add perf counters to function tracing
>       fgraph: Add perf counters to function graph tracer
> 
> ----
>  include/linux/trace_recursion.h      |   5 +-
>  kernel/trace/trace.c                 | 153 ++++++++++++++++++++++++++++++++-
>  kernel/trace/trace.h                 |  38 ++++++++
>  kernel/trace/trace_entries.h         |  13 +++
>  kernel/trace/trace_event_perf.c      | 162 +++++++++++++++++++++++++++++++++++
>  kernel/trace/trace_functions.c       | 124 +++++++++++++++++++++++++--
>  kernel/trace/trace_functions_graph.c | 117 +++++++++++++++++++++++--
>  kernel/trace/trace_output.c          |  70 +++++++++++++++
>  8 files changed, 670 insertions(+), 12 deletions(-)


-- 
Masami Hiramatsu (Google) <mhiramat@...nel.org>