lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0hAmgjozeX0egBs_ii_zzKXGPsPBUWwmGD+23KD++Rzqw@mail.gmail.com>
Date: Thu, 4 Dec 2025 15:57:41 +0100
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Christian Loehle <christian.loehle@....com>
Cc: Samuel Wu <wusamuel@...gle.com>, Huang Rui <ray.huang@....com>, 
	"Gautham R. Shenoy" <gautham.shenoy@....com>, Mario Limonciello <mario.limonciello@....com>, 
	Perry Yuan <perry.yuan@....com>, Jonathan Corbet <corbet@....net>, 
	"Rafael J. Wysocki" <rafael@...nel.org>, Viresh Kumar <viresh.kumar@...aro.org>, 
	Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu <mhiramat@...nel.org>, 
	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, 
	Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>, Len Brown <lenb@...nel.org>, 
	Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>, 
	Andrii Nakryiko <andrii@...nel.org>, Martin KaFai Lau <martin.lau@...ux.dev>, 
	Eduard Zingerman <eddyz87@...il.com>, Song Liu <song@...nel.org>, 
	Yonghong Song <yonghong.song@...ux.dev>, John Fastabend <john.fastabend@...il.com>, 
	KP Singh <kpsingh@...nel.org>, Stanislav Fomichev <sdf@...ichev.me>, Hao Luo <haoluo@...gle.com>, 
	Jiri Olsa <jolsa@...nel.org>, Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
	Arnaldo Carvalho de Melo <acme@...nel.org>, Namhyung Kim <namhyung@...nel.org>, 
	Mark Rutland <mark.rutland@....com>, 
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>, Ian Rogers <irogers@...gle.com>, 
	Adrian Hunter <adrian.hunter@...el.com>, James Clark <james.clark@...aro.org>, 
	kernel-team@...roid.com, linux-pm@...r.kernel.org, linux-doc@...r.kernel.org, 
	linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org, 
	bpf@...r.kernel.org, linux-perf-users@...r.kernel.org
Subject: Re: [PATCH v3 1/2] cpufreq: Replace trace_cpu_frequency with trace_policy_frequency

On Thu, Dec 4, 2025 at 1:49 PM Christian Loehle
<christian.loehle@....com> wrote:
>
> On 12/1/25 20:24, Samuel Wu wrote:
> > The existing cpu_frequency trace_event can be verbose, emitting a nearly
> > identical trace event for every CPU in the policy even when their
> > frequencies are identical.
> >
> > This patch replaces the cpu_frequency trace event with policy_frequency
> > trace event, a more efficient alternative. From the kernel's
> > perspective, emitting a trace event once per policy instead of once per
> > cpu saves some memory and is less overhead. From the post-processing
> > perspective, analysis of the trace log is simplified without any loss of
> > information.
> >
> > Signed-off-by: Samuel Wu <wusamuel@...gle.com>
> > ---
> >  drivers/cpufreq/cpufreq.c      | 14 ++------------
> >  drivers/cpufreq/intel_pstate.c |  6 ++++--
> >  include/trace/events/power.h   | 24 +++++++++++++++++++++---
> >  kernel/trace/power-traces.c    |  2 +-
> >  samples/bpf/cpustat_kern.c     |  8 ++++----
> >  samples/bpf/cpustat_user.c     |  6 +++---
> >  tools/perf/builtin-timechart.c | 12 ++++++------
> >  7 files changed, 41 insertions(+), 31 deletions(-)
> >
> > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > index 4472bb1ec83c..dd3f08f3b958 100644
> > --- a/drivers/cpufreq/cpufreq.c
> > +++ b/drivers/cpufreq/cpufreq.c
> > @@ -309,8 +309,6 @@ static void cpufreq_notify_transition(struct cpufreq_policy *policy,
> >                                     struct cpufreq_freqs *freqs,
> >                                     unsigned int state)
> >  {
> > -     int cpu;
> > -
> >       BUG_ON(irqs_disabled());
> >
> >       if (cpufreq_disabled())
> > @@ -344,10 +342,7 @@ static void cpufreq_notify_transition(struct cpufreq_policy *policy,
> >               adjust_jiffies(CPUFREQ_POSTCHANGE, freqs);
> >               pr_debug("FREQ: %u - CPUs: %*pbl\n", freqs->new,
> >                        cpumask_pr_args(policy->cpus));
> > -
> > -             for_each_cpu(cpu, policy->cpus)
> > -                     trace_cpu_frequency(freqs->new, cpu);
> > -
> > +             trace_policy_frequency(freqs->new, policy->cpu, policy->cpus);
> >               srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
> >                                        CPUFREQ_POSTCHANGE, freqs);
> >
> > @@ -2201,7 +2196,6 @@ unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
> >                                       unsigned int target_freq)
> >  {
> >       unsigned int freq;
> > -     int cpu;
> >
> >       target_freq = clamp_val(target_freq, policy->min, policy->max);
> >       freq = cpufreq_driver->fast_switch(policy, target_freq);
> > @@ -2213,11 +2207,7 @@ unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
> >       arch_set_freq_scale(policy->related_cpus, freq,
> >                           arch_scale_freq_ref(policy->cpu));
> >       cpufreq_stats_record_transition(policy, freq);
> > -
> > -     if (trace_cpu_frequency_enabled()) {
> > -             for_each_cpu(cpu, policy->cpus)
> > -                     trace_cpu_frequency(freq, cpu);
> > -     }
> > +     trace_policy_frequency(freq, policy->cpu, policy->cpus);
> >
> >       return freq;
> >  }
> > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> > index ec4abe374573..9724b5d19d83 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -2297,7 +2297,8 @@ static int hwp_get_cpu_scaling(int cpu)
> >
> >  static void intel_pstate_set_pstate(struct cpudata *cpu, int pstate)
> >  {
> > -     trace_cpu_frequency(pstate * cpu->pstate.scaling, cpu->cpu);
> > +     trace_policy_frequency(pstate * cpu->pstate.scaling, cpu->cpu,
> > +                            cpumask_of(cpu->cpu));
> >       cpu->pstate.current_pstate = pstate;
> >       /*
> >        * Generally, there is no guarantee that this code will always run on
> > @@ -2587,7 +2588,8 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu)
> >
> >       target_pstate = get_target_pstate(cpu);
> >       target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
> > -     trace_cpu_frequency(target_pstate * cpu->pstate.scaling, cpu->cpu);
> > +     trace_policy_frequency(target_pstate * cpu->pstate.scaling, cpu->cpu,
> > +                            cpumask_of(cpu->cpu));
> >       intel_pstate_update_pstate(cpu, target_pstate);
> >
> >       sample = &cpu->sample;
> > diff --git a/include/trace/events/power.h b/include/trace/events/power.h
> > index 370f8df2fdb4..317098ffdd5f 100644
> > --- a/include/trace/events/power.h
> > +++ b/include/trace/events/power.h
> > @@ -182,11 +182,29 @@ TRACE_EVENT(pstate_sample,
> >               { PM_EVENT_RECOVER, "recover" }, \
> >               { PM_EVENT_POWEROFF, "poweroff" })
> >
> > -DEFINE_EVENT(cpu, cpu_frequency,
> > +TRACE_EVENT(policy_frequency,
> >
> > -     TP_PROTO(unsigned int frequency, unsigned int cpu_id),
> > +     TP_PROTO(unsigned int frequency, unsigned int cpu_id,
> > +              const struct cpumask *policy_cpus),
> >
> > -     TP_ARGS(frequency, cpu_id)
> > +     TP_ARGS(frequency, cpu_id, policy_cpus),
> > +
> > +     TP_STRUCT__entry(
> > +             __field(u32, state)
> > +             __field(u32, cpu_id)
> > +             __cpumask(cpumask)
> > +     ),
> > +
> > +     TP_fast_assign(
> > +             __entry->state = frequency;
> > +             __entry->cpu_id = cpu_id;
> > +             __assign_cpumask(cpumask, policy_cpus);
> > +     ),
> > +
> > +     TP_printk("state=%lu cpu_id=%lu policy_cpus=%*pb",
> > +               (unsigned long)__entry->state,
> > +               (unsigned long)__entry->cpu_id,
> > +               cpumask_pr_args((struct cpumask *)__get_dynamic_array(cpumask)))
> >  );
> >
> >  TRACE_EVENT(cpu_frequency_limits,
> > diff --git a/kernel/trace/power-traces.c b/kernel/trace/power-traces.c
> > index f2fe33573e54..a537e68a6878 100644
> > --- a/kernel/trace/power-traces.c
> > +++ b/kernel/trace/power-traces.c
> > @@ -16,5 +16,5 @@
> >
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(suspend_resume);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_idle);
> > -EXPORT_TRACEPOINT_SYMBOL_GPL(cpu_frequency);
> > +EXPORT_TRACEPOINT_SYMBOL_GPL(policy_frequency);
> >
> > diff --git a/samples/bpf/cpustat_kern.c b/samples/bpf/cpustat_kern.c
> > index 7ec7143e2757..f485de0f89b2 100644
> > --- a/samples/bpf/cpustat_kern.c
> > +++ b/samples/bpf/cpustat_kern.c
> > @@ -75,9 +75,9 @@ struct {
> >  } pstate_duration SEC(".maps");
> >
> >  /*
> > - * The trace events for cpu_idle and cpu_frequency are taken from:
> > + * The trace events for cpu_idle and policy_frequency are taken from:
> >   * /sys/kernel/tracing/events/power/cpu_idle/format
> > - * /sys/kernel/tracing/events/power/cpu_frequency/format
> > + * /sys/kernel/tracing/events/power/policy_frequency/format
> >   *
> >   * These two events have same format, so define one common structure.
> >   */
> > @@ -162,7 +162,7 @@ int bpf_prog1(struct cpu_args *ctx)
> >        */
> >       if (ctx->state != (u32)-1) {
> >
> > -             /* record pstate after have first cpu_frequency event */
> > +             /* record pstate after have first policy_frequency event */
> >               if (!*pts)
> >                       return 0;
> >
> > @@ -208,7 +208,7 @@ int bpf_prog1(struct cpu_args *ctx)
> >       return 0;
> >  }
> >
> > -SEC("tracepoint/power/cpu_frequency")
> > +SEC("tracepoint/power/policy_frequency")
> >  int bpf_prog2(struct cpu_args *ctx)
> >  {
> >       u64 *pts, *cstate, *pstate, cur_ts, delta;
> > diff --git a/samples/bpf/cpustat_user.c b/samples/bpf/cpustat_user.c
> > index 356f756cba0d..f7e81f702358 100644
> > --- a/samples/bpf/cpustat_user.c
> > +++ b/samples/bpf/cpustat_user.c
> > @@ -143,12 +143,12 @@ static int cpu_stat_inject_cpu_idle_event(void)
> >
> >  /*
> >   * It's possible to have no any frequency change for long time and cannot
> > - * get ftrace event 'trace_cpu_frequency' for long period, this introduces
> > + * get ftrace event 'trace_policy_frequency' for long period, this introduces
> >   * big deviation for pstate statistics.
> >   *
> >   * To solve this issue, below code forces to set 'scaling_max_freq' to 208MHz
> > - * for triggering ftrace event 'trace_cpu_frequency' and then recovery back to
> > - * the maximum frequency value 1.2GHz.
> > + * for triggering ftrace event 'trace_policy_frequency' and then recovery back
> > + * to the maximum frequency value 1.2GHz.
> >   */
> >  static int cpu_stat_inject_cpu_frequency_event(void)
> >  {
> > diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
> > index 22050c640dfa..3ef1a2fd0493 100644
> > --- a/tools/perf/builtin-timechart.c
> > +++ b/tools/perf/builtin-timechart.c
> > @@ -612,10 +612,10 @@ process_sample_cpu_idle(struct timechart *tchart __maybe_unused,
> >  }
> >
> >  static int
> > -process_sample_cpu_frequency(struct timechart *tchart,
> > -                          struct evsel *evsel,
> > -                          struct perf_sample *sample,
> > -                          const char *backtrace __maybe_unused)
> > +process_sample_policy_frequency(struct timechart *tchart,
> > +                             struct evsel *evsel,
> > +                             struct perf_sample *sample,
> > +                             const char *backtrace __maybe_unused)
> >  {
> >       u32 state  = evsel__intval(evsel, sample, "state");
> >       u32 cpu_id = evsel__intval(evsel, sample, "cpu_id");
> > @@ -1541,7 +1541,7 @@ static int __cmd_timechart(struct timechart *tchart, const char *output_name)
> >  {
> >       const struct evsel_str_handler power_tracepoints[] = {
> >               { "power:cpu_idle",             process_sample_cpu_idle },
> > -             { "power:cpu_frequency",        process_sample_cpu_frequency },
> > +             { "power:policy_frequency",     process_sample_policy_frequency },
> >               { "sched:sched_wakeup",         process_sample_sched_wakeup },
> >               { "sched:sched_switch",         process_sample_sched_switch },
> >  #ifdef SUPPORT_OLD_POWER_EVENTS
> > @@ -1804,7 +1804,7 @@ static int timechart__record(struct timechart *tchart, int argc, const char **ar
> >       unsigned int backtrace_args_no = ARRAY_SIZE(backtrace_args);
> >
> >       const char * const power_args[] = {
> > -             "-e", "power:cpu_frequency",
> > +             "-e", "power:policy_frequency",
> >               "-e", "power:cpu_idle",
> >       };
> >       unsigned int power_args_nr = ARRAY_SIZE(power_args);
>
> perf timechart seem to do per-CPU reporting though?
> So this is broken by not emitting an event per-CPU? At least with a simple s/cpu_frequency/policy_frequency/
> like here.
> Similar for the bpf samples technically...

This kind of boils down to whether or not tracepoints can be regarded
as ABI and to what extent.

In this particular case, I'm not sure I agree with the stated motivation.

First of all, on systems with one CPU per cpufreq policy (the vast
majority of x86, including AMD, and the ARM systems using the CPPC
driver AFAICS), the "issue" at hand is actually a non-issue and
changing the name of the tracepoint alone would confuse things in user
space IIUC.  Those need to work the way they do today.

On systems with multiple CPUs per cpufreq policy there is some extra
overhead related to the cpu_frequency tracepoint, but the if someone
is only interested in the "policy" frequency, they can filter out all
CPUs belonging to the same policy except for one from the traces,
don't they?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ