[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20251118120821.0c47ef684b53d5d9a2d6dc83@kernel.org>
Date: Tue, 18 Nov 2025 12:08:21 +0900
From: Masami Hiramatsu (Google) <mhiramat@...nel.org>
To: Steven Rostedt <rostedt@...nel.org>
Cc: linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org, Masami
Hiramatsu <mhiramat@...nel.org>, Mark Rutland <mark.rutland@....com>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, Andrew Morton
<akpm@...ux-foundation.org>, Peter Zijlstra <peterz@...radead.org>, Thomas
Gleixner <tglx@...utronix.de>, Ian Rogers <irogers@...gle.com>, Namhyung
Kim <namhyung@...nel.org>, Arnaldo Carvalho de Melo <acme@...nel.org>, Jiri
Olsa <jolsa@...nel.org>, Douglas Raillard <douglas.raillard@....com>
Subject: Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer
Hi Steve,
Thanks for the great idea!
On Mon, 17 Nov 2025 19:29:50 -0500
Steven Rostedt <rostedt@...nel.org> wrote:
>
> This series adds a perf event to the ftrace ring buffer.
> It is currently a proof of concept as I'm not happy with the interface
> and I also think the recorded perf event format may be changed too.
>
> This proof-of-concept interface (which I have no plans on using), currently
> just adds 6 new trace options.
>
> event_cache_misses
> event_cpu_cycles
> func-cache-misses
> func-cpu-cycles
> funcgraph-cache-misses
> funcgraph-cpu-cycles
>
> The first two trigger a perf event after every event, the second two trigger
> a perf event after every function and the last two trigger a perf event
> right after the start of a function and again at the end of the function.
>
> As this will eventual work with many more perf events than just cache-misses
> and cpu-cycles , using options is not appropriate. Especially since the
> options are limited to a 64 bit bitmask, and that can easily go much higher.
> I'm thinking about having a file instead that will act as a way to enable
> perf events for events, function and function graph tracing.
>
> set_event_perf, set_ftrace_perf, set_fgraph_perf
What about adding a global `trigger` action file so that user can
add these "perf" actions to write into it. It is something like
stacktrace for events. (Maybe we can move stacktrace/user-stacktrace
into it too)
For pre-defined/software counters:
# echo "perf:cpu_cycles" >> /sys/kernel/tracing/trigger
For some hardware event sources (see /sys/bus/event_source/devices/):
# echo "perf:cstate_core.c3-residency" >> /sys/kernel/tracing/trigger
echo "perf:my_counter=pmu/config=M,config1=N" >> /sys/kernel/tracing/trigger
If we need to set those counters for tracers and events separately,
we can add `events/trigger` and `tracer-trigger` files.
echo "perf:cpu_cycles" >> /sys/kernel/tracing/events/trigger
To disable counters, we can use '!' as same as event triggers.
echo !perf:cpu_cycles > trigger
To add more than 2 counters, connect it with ':'.
(or, we will allow to append new perf counters)
This allows user to set perf counter options for each events.
Maybe we also should move 'stacktrace'/'userstacktrace' option
flags to it too eventually.
>
> And an available_perf_events that show what can be written into these files,
> (similar to how set_ftrace_filter works). But for now, it was just easier to
> implement them as options.
>
> As for the perf event that is triggered. It currently is a dynamic array of
> 64 bit values. Each value is broken up into 8 bits for what type of perf
> event it is, and 56 bits for the counter. It only writes a per CPU raw
> counter and does not do any math. That would be needed to be done by any
> post processing.
>
> Since the values are for user space to do the subtraction to figure out the
> difference between events, for example, the function_graph tracer may have:
>
> is_vmalloc_addr() {
> /* cpu_cycles: 5582263593 cache_misses: 2869004572 */
> /* cpu_cycles: 5582267527 cache_misses: 2869006049 */
> }
Just a style question: Would this mean the first line is for function entry
and the second one is function return?
>
> User space would subtract 2869006049 - 2869004572 = 1477
>
> Then 56 bits should be plenty.
>
> 2^55 / 1,000,000,000 / 60 / 60 / 24 = 416
> 416 / 4 = 104
>
> If you have a 4GHz machine, the cpu-cycles will overflow the 55 bits in 104
> days. This tooling is not for seeing how many cycles run over 104 days.
> User space tooling would just need to be aware that the vale is 56 bits and
> when calculating the difference between start and end do something like:
>
> if (start > end)
> end |= 1ULL << 56;
>
> delta = end - start;
>
> The next question is how to label the perf events to be in the 8 bit
> portion. It could simply be a value that is registered, and listed in the
> available_perf_events file.
>
> cpu_cycles:1
> cach_misses:2
> [..]
Looks good to me. I think pre-definied events of `perf list`
will be there and have fixed numbers.
Thank you,
>
> And this would need to be recorded by any tooling reading the events
> so that it knows how to map the events with their attached ids.
>
> But again, this is just a proof-of-concept. How this will eventually be
> implemented is yet to be determined.
>
> But to test these patches (which are based on top of my linux-next branch,
> which should now be in linux-next):
>
> # cd /sys/kernel/tracing
> # echo 1 > options/event_cpu_cycles
> # echo 1 > options/event_cache_misses
> # echo 1 > events/syscalls/enable
> # cat trace
> [..]
> bash-995 [007] ..... 98.255252: sys_write -> 0x2
> bash-995 [007] ..... 98.255257: cpu_cycles: 1557241774 cache_misses: 449901166
> bash-995 [007] ..... 98.255284: sys_dup2(oldfd: 0xa, newfd: 1)
> bash-995 [007] ..... 98.255285: cpu_cycles: 1557260057 cache_misses: 449902679
> bash-995 [007] ..... 98.255305: sys_dup2 -> 0x1
> bash-995 [007] ..... 98.255305: cpu_cycles: 1557280203 cache_misses: 449906196
> bash-995 [007] ..... 98.255343: sys_fcntl(fd: 0xa, cmd: 1, arg: 0)
> bash-995 [007] ..... 98.255344: cpu_cycles: 1557322304 cache_misses: 449915522
> bash-995 [007] ..... 98.255352: sys_fcntl -> 0x1
> bash-995 [007] ..... 98.255353: cpu_cycles: 1557327809 cache_misses: 449916844
> bash-995 [007] ..... 98.255361: sys_close(fd: 0xa)
> bash-995 [007] ..... 98.255362: cpu_cycles: 1557335383 cache_misses: 449918232
> bash-995 [007] ..... 98.255369: sys_close -> 0x0
>
>
>
> Comments welcomed.
>
>
> Steven Rostedt (3):
> tracing: Add perf events
> ftrace: Add perf counters to function tracing
> fgraph: Add perf counters to function graph tracer
>
> ----
> include/linux/trace_recursion.h | 5 +-
> kernel/trace/trace.c | 153 ++++++++++++++++++++++++++++++++-
> kernel/trace/trace.h | 38 ++++++++
> kernel/trace/trace_entries.h | 13 +++
> kernel/trace/trace_event_perf.c | 162 +++++++++++++++++++++++++++++++++++
> kernel/trace/trace_functions.c | 124 +++++++++++++++++++++++++--
> kernel/trace/trace_functions_graph.c | 117 +++++++++++++++++++++++--
> kernel/trace/trace_output.c | 70 +++++++++++++++
> 8 files changed, 670 insertions(+), 12 deletions(-)
--
Masami Hiramatsu (Google) <mhiramat@...nel.org>
Powered by blists - more mailing lists