linux-kernel - Re: [PATCH v1 12/12] perf parse-events: Support user CPUs mixed with threads/processes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP-5=fV5gtSQOPkOLaAauBcwjgs1ohADs5O3GF+bJOx2tSynwQ@mail.gmail.com>
Date: Wed, 16 Jul 2025 17:04:36 -0700
From: Ian Rogers <irogers@...gle.com>
To: Namhyung Kim <namhyung@...nel.org>, "Mi, Dapeng1" <dapeng1.mi@...el.com>, 
	Andi Kleen <ak@...ux.intel.com>, Peter Zijlstra <peterz@...radead.org>, 
	Ingo Molnar <mingo@...hat.com>
Cc: Thomas Falcon <thomas.falcon@...el.com>, Arnaldo Carvalho de Melo <acme@...nel.org>, 
	Mark Rutland <mark.rutland@....com>, 
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>, Jiri Olsa <jolsa@...nel.org>, 
	Adrian Hunter <adrian.hunter@...el.com>, Kan Liang <kan.liang@...ux.intel.com>, 
	Ben Gainey <ben.gainey@....com>, James Clark <james.clark@...aro.org>, 
	Howard Chu <howardchu95@...il.com>, Weilin Wang <weilin.wang@...el.com>, 
	Levi Yun <yeoreum.yun@....com>, "Dr. David Alan Gilbert" <linux@...blig.org>, 
	Zhongqiu Han <quic_zhonhan@...cinc.com>, Blake Jones <blakejones@...gle.com>, 
	Yicong Yang <yangyicong@...ilicon.com>, Anubhav Shelat <ashelat@...hat.com>, 
	Thomas Richter <tmricht@...ux.ibm.com>, 
	Jean-Philippe Romain <jean-philippe.romain@...s.st.com>, Song Liu <song@...nel.org>, 
	linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v1 12/12] perf parse-events: Support user CPUs mixed with threads/processes

On Wed, Jul 16, 2025 at 1:28 PM Namhyung Kim <namhyung@...nel.org> wrote:
>
> On Fri, Jun 27, 2025 at 12:24:17PM -0700, Ian Rogers wrote:
> > Counting events system-wide with a specified CPU prior to this change worked:
> > ```
> > $ perf stat -e 'msr/tsc/,msr/tsc,cpu=cpu_core/,msr/tsc,cpu=cpu_atom/' -a sleep 1
> >
> >   Performance counter stats for 'system wide':
> >
> >      59,393,419,099      msr/tsc/
> >      33,927,965,927      msr/tsc,cpu=cpu_core/
> >      25,465,608,044      msr/tsc,cpu=cpu_atom/
> > ```
> >
> > However, when counting with process the counts became system wide:
> > ```
> > $ perf stat -e 'msr/tsc/,msr/tsc,cpu=cpu_core/,msr/tsc,cpu=cpu_atom/' perf test -F 10
> >  10.1: Basic parsing test                                            : Ok
> >  10.2: Parsing without PMU name                                      : Ok
> >  10.3: Parsing with PMU name                                         : Ok
> >
> >  Performance counter stats for 'perf test -F 10':
> >
> >         59,233,549      msr/tsc/
> >         59,227,556      msr/tsc,cpu=cpu_core/
> >         59,224,053      msr/tsc,cpu=cpu_atom/
> > ```
> >
> > Make the handling of CPU maps with event parsing clearer. When an
> > event is parsed creating an evsel the cpus should be either the PMU's
> > cpumask or user specified CPUs.
> >
> > Update perf_evlist__propagate_maps so that it doesn't clobber the user
> > specified CPUs. Try to make the behavior clearer, firstly fix up
> > missing cpumasks. Next, perform sanity checks and adjustments from the
> > global evlist CPU requests and for the PMU including simplifying to
> > the "any CPU"(-1) value. Finally remove the event if the cpumask is
> > empty.
> >
> > So that events are opened with a CPU and a thread change stat's
> > create_perf_stat_counter to give both.
> >
> > With the change things are fixed:
> > ```
> > $ perf stat --no-scale -e 'msr/tsc/,msr/tsc,cpu=cpu_core/,msr/tsc,cpu=cpu_atom/' perf test -F 10
> >  10.1: Basic parsing test                                            : Ok
> >  10.2: Parsing without PMU name                                      : Ok
> >  10.3: Parsing with PMU name                                         : Ok
> >
> >  Performance counter stats for 'perf test -F 10':
> >
> >         63,704,975      msr/tsc/
> >         47,060,704      msr/tsc,cpu=cpu_core/                        (4.62%)
> >         16,640,591      msr/tsc,cpu=cpu_atom/                        (2.18%)
> > ```
> >
> > However, note the "--no-scale" option is used. This is necessary as
> > the running time for the event on the counter isn't the same as the
> > enabled time because the thread doesn't necessarily run on the CPUs
> > specified for the counter. All counter values are scaled with:
> >
> >   scaled_value = value * time_enabled / time_running
> >
> > and so without --no-scale the scaled_value becomes very large. This
> > problem already exists on hybrid systems for the same reason. Here are
> > 2 runs of the same code with an instructions event that counts the
> > same on both types of core, there is no real multiplexing happening on
> > the event:
>
> This is unfortunate.  The event is in a task context but also has a CPU
> constraint.  So it's not schedulable on other CPUs.
>
> A problem is that it cannot distinguish the real multiplexing..
>
> >
> > ```
> > $ perf stat -e instructions perf test -F 10
> > ...
> >  Performance counter stats for 'perf test -F 10':
> >
> >         87,896,447      cpu_atom/instructions/                       (14.37%)
> >         98,171,964      cpu_core/instructions/                       (85.63%)
> > ...
> > $ perf stat --no-scale -e instructions perf test -F 10
> > ...
> >  Performance counter stats for 'perf test -F 10':
> >
> >         13,069,890      cpu_atom/instructions/                       (19.32%)
> >         83,460,274      cpu_core/instructions/                       (80.68%)
> > ...
> > ```
> > The scaling has inflated per-PMU instruction counts and the overall
> > count by 2x.
> >
> > To fix this the kernel needs changing when a task+CPU event (or just
> > task event on hybrid) is scheduled out. A fix could be that the state
> > isn't inactive but off for such events, so that time_enabled counts
> > don't accumulate on them.
>
> Right, maybe we need to add a new state (UNSCHEDULABLE?) to skip
> updating the enabled time.

Right having a new state would mean in __perf_update_times the enabled
and time wouldn't increase if the filter happened:
https://web.git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/kernel/events/core.c#n716
```
enum perf_event_state state = __perf_effective_state(event);
u64 delta = now - event->tstamp;

*enabled = event->total_time_enabled;
if (state >= PERF_EVENT_STATE_INACTIVE)
        *enabled += delta;

*running = event->total_time_running;
if (state >= PERF_EVENT_STATE_ACTIVE)
        *running += delta;
```

I sent out this RFC patch that just about makes this change:
https://lore.kernel.org/lkml/20250716223924.825772-1-irogers@google.com/
but for now it seems the only workaround is to use `--no-scale` :-(
This series is still necessary for the hybrid TMA fix for msr/tsc/.

Thanks,
Ian


> Thanks,
> Namhyung
>
> >
> > Signed-off-by: Ian Rogers <irogers@...gle.com>
> > ---
> >  tools/lib/perf/evlist.c        | 118 ++++++++++++++++++++++-----------
> >  tools/perf/util/parse-events.c |  10 ++-
> >  tools/perf/util/stat.c         |   6 +-
> >  3 files changed, 86 insertions(+), 48 deletions(-)
> >
> > diff --git a/tools/lib/perf/evlist.c b/tools/lib/perf/evlist.c
> > index 9d9dec21f510..2d2236400220 100644
> > --- a/tools/lib/perf/evlist.c
> > +++ b/tools/lib/perf/evlist.c
> > @@ -36,49 +36,87 @@ void perf_evlist__init(struct perf_evlist *evlist)
> >  static void __perf_evlist__propagate_maps(struct perf_evlist *evlist,
> >                                         struct perf_evsel *evsel)
> >  {
> > -     if (evsel->system_wide) {
> > -             /* System wide: set the cpu map of the evsel to all online CPUs. */
> > -             perf_cpu_map__put(evsel->cpus);
> > -             evsel->cpus = perf_cpu_map__new_online_cpus();
> > -     } else if (evlist->has_user_cpus && evsel->is_pmu_core) {
> > -             /*
> > -              * User requested CPUs on a core PMU, ensure the requested CPUs
> > -              * are valid by intersecting with those of the PMU.
> > -              */
> > +     if (perf_cpu_map__is_empty(evsel->cpus)) {
> > +             if (perf_cpu_map__is_empty(evsel->pmu_cpus)) {
> > +                     /*
> > +                      * Assume the unset PMU cpus were for a system-wide
> > +                      * event, like a software or tracepoint.
> > +                      */
> > +                     evsel->pmu_cpus = perf_cpu_map__new_online_cpus();
> > +             }
> > +             if (evlist->has_user_cpus && !evsel->system_wide) {
> > +                     /*
> > +                      * Use the user CPUs unless the evsel is set to be
> > +                      * system wide, such as the dummy event.
> > +                      */
> > +                     evsel->cpus = perf_cpu_map__get(evlist->user_requested_cpus);
> > +             } else {
> > +                     /*
> > +                      * System wide and other modes, assume the cpu map
> > +                      * should be set to all PMU CPUs.
> > +                      */
> > +                     evsel->cpus = perf_cpu_map__get(evsel->pmu_cpus);
> > +             }
> > +     }
> > +     /*
> > +      * Avoid "any CPU"(-1) for uncore and PMUs that require a CPU, even if
> > +      * requested.
> > +      */
> > +     if (evsel->requires_cpu && perf_cpu_map__has_any_cpu(evsel->cpus)) {
> >               perf_cpu_map__put(evsel->cpus);
> > -             evsel->cpus = perf_cpu_map__intersect(evlist->user_requested_cpus, evsel->pmu_cpus);
> > +             evsel->cpus = perf_cpu_map__get(evsel->pmu_cpus);
> > +     }
> >
> > -             /*
> > -              * Empty cpu lists would eventually get opened as "any" so remove
> > -              * genuinely empty ones before they're opened in the wrong place.
> > -              */
> > -             if (perf_cpu_map__is_empty(evsel->cpus)) {
> > -                     struct perf_evsel *next = perf_evlist__next(evlist, evsel);
> > -
> > -                     perf_evlist__remove(evlist, evsel);
> > -                     /* Keep idx contiguous */
> > -                     if (next)
> > -                             list_for_each_entry_from(next, &evlist->entries, node)
> > -                                     next->idx--;
> > +     /*
> > +      * Globally requested CPUs replace user requested unless the evsel is
> > +      * set to be system wide.
> > +      */
> > +     if (evlist->has_user_cpus && !evsel->system_wide) {
> > +             assert(!perf_cpu_map__has_any_cpu(evlist->user_requested_cpus));
> > +             if (!perf_cpu_map__equal(evsel->cpus, evlist->user_requested_cpus)) {
> > +                     perf_cpu_map__put(evsel->cpus);
> > +                     evsel->cpus = perf_cpu_map__get(evlist->user_requested_cpus);
> >               }
> > -     } else if (!evsel->pmu_cpus || evlist->has_user_cpus ||
> > -             (!evsel->requires_cpu && perf_cpu_map__has_any_cpu(evlist->user_requested_cpus))) {
> > -             /*
> > -              * The PMU didn't specify a default cpu map, this isn't a core
> > -              * event and the user requested CPUs or the evlist user
> > -              * requested CPUs have the "any CPU" (aka dummy) CPU value. In
> > -              * which case use the user requested CPUs rather than the PMU
> > -              * ones.
> > -              */
> > +     }
> > +
> > +     /* Ensure cpus only references valid PMU CPUs. */
> > +     if (!perf_cpu_map__has_any_cpu(evsel->cpus) &&
> > +         !perf_cpu_map__is_subset(evsel->pmu_cpus, evsel->cpus)) {
> > +             struct perf_cpu_map *tmp = perf_cpu_map__intersect(evsel->pmu_cpus, evsel->cpus);
> > +
> >               perf_cpu_map__put(evsel->cpus);
> > -             evsel->cpus = perf_cpu_map__get(evlist->user_requested_cpus);
> > -     } else if (evsel->cpus != evsel->pmu_cpus) {
> > -             /*
> > -              * No user requested cpu map but the PMU cpu map doesn't match
> > -              * the evsel's. Reset it back to the PMU cpu map.
> > -              */
> > +             evsel->cpus = tmp;
> > +     }
> > +
> > +     /*
> > +      * Was event requested on all the PMU's CPUs but the user requested is
> > +      * any CPU (-1)? If so switch to using any CPU (-1) to reduce the number
> > +      * of events.
> > +      */
> > +     if (!evsel->system_wide &&
> > +         perf_cpu_map__equal(evsel->cpus, evsel->pmu_cpus) &&
> > +         perf_cpu_map__has_any_cpu(evlist->user_requested_cpus)) {
> >               perf_cpu_map__put(evsel->cpus);
> > -             evsel->cpus = perf_cpu_map__get(evsel->pmu_cpus);
> > +             evsel->cpus = perf_cpu_map__get(evlist->user_requested_cpus);
> > +     }
> > +
> > +     /* Sanity check assert before the evsel is potentially removed. */
> > +     assert(!evsel->requires_cpu || !perf_cpu_map__has_any_cpu(evsel->cpus));
> > +
> > +     /*
> > +      * Empty cpu lists would eventually get opened as "any" so remove
> > +      * genuinely empty ones before they're opened in the wrong place.
> > +      */
> > +     if (perf_cpu_map__is_empty(evsel->cpus)) {
> > +             struct perf_evsel *next = perf_evlist__next(evlist, evsel);
> > +
> > +             perf_evlist__remove(evlist, evsel);
> > +             /* Keep idx contiguous */
> > +             if (next)
> > +                     list_for_each_entry_from(next, &evlist->entries, node)
> > +                             next->idx--;
> > +
> > +             return;
> >       }
> >
> >       if (evsel->system_wide) {
> > @@ -98,6 +136,10 @@ static void perf_evlist__propagate_maps(struct perf_evlist *evlist)
> >
> >       evlist->needs_map_propagation = true;
> >
> > +     /* Clear the all_cpus set which will be merged into during propagation. */
> > +     perf_cpu_map__put(evlist->all_cpus);
> > +     evlist->all_cpus = NULL;
> > +
> >       list_for_each_entry_safe(evsel, n, &evlist->entries, node)
> >               __perf_evlist__propagate_maps(evlist, evsel);
> >  }
> > diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
> > index 4092a43aa84e..0ff7ae75d8f9 100644
> > --- a/tools/perf/util/parse-events.c
> > +++ b/tools/perf/util/parse-events.c
> > @@ -313,20 +313,18 @@ __add_event(struct list_head *list, int *idx,
> >       if (pmu) {
> >               is_pmu_core = pmu->is_core;
> >               pmu_cpus = perf_cpu_map__get(pmu->cpus);
> > +             if (perf_cpu_map__is_empty(pmu_cpus))
> > +                     pmu_cpus = cpu_map__online();
> >       } else {
> >               is_pmu_core = (attr->type == PERF_TYPE_HARDWARE ||
> >                              attr->type == PERF_TYPE_HW_CACHE);
> >               pmu_cpus = is_pmu_core ? cpu_map__online() : NULL;
> >       }
> >
> > -     if (has_user_cpus) {
> > +     if (has_user_cpus)
> >               cpus = perf_cpu_map__get(user_cpus);
> > -             /* Existing behavior that pmu_cpus matches the given user ones. */
> > -             perf_cpu_map__put(pmu_cpus);
> > -             pmu_cpus = perf_cpu_map__get(user_cpus);
> > -     } else {
> > +     else
> >               cpus = perf_cpu_map__get(pmu_cpus);
> > -     }
> >
> >       if (init_attr)
> >               event_attr_init(attr);
> > diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
> > index 355a7d5c8ab8..8d3bcdb69d37 100644
> > --- a/tools/perf/util/stat.c
> > +++ b/tools/perf/util/stat.c
> > @@ -769,8 +769,6 @@ int create_perf_stat_counter(struct evsel *evsel,
> >                       attr->enable_on_exec = 1;
> >       }
> >
> > -     if (target__has_cpu(target) && !target__has_per_thread(target))
> > -             return evsel__open_per_cpu(evsel, evsel__cpus(evsel), cpu_map_idx);
> > -
> > -     return evsel__open_per_thread(evsel, evsel->core.threads);
> > +     return evsel__open_per_cpu_and_thread(evsel, evsel__cpus(evsel), cpu_map_idx,
> > +                                           evsel->core.threads);
> >  }
> > --
> > 2.50.0.727.gbf7dc18ff4-goog
> >