linux-kernel - Re: [PATCH] perf tools topdown: Fix incorrect topdown slots regroup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAP-5=fXiUyb2nmP55HPKakS9-zyPmkxHy-Cmdkm6mJ4nUf+wFQ@mail.gmail.com>
Date: Mon, 25 Aug 2025 14:13:51 -0700
From: Ian Rogers <irogers@...gle.com>
To: "Mi, Dapeng" <dapeng1.mi@...ux.intel.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
	Arnaldo Carvalho de Melo <acme@...nel.org>, Namhyung Kim <namhyung@...nel.org>, 
	Adrian Hunter <adrian.hunter@...el.com>, 
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>, Kan Liang <kan.liang@...ux.intel.com>, 
	linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Dapeng Mi <dapeng1.mi@...el.com>, Xudong Hao <xudong.hao@...el.com>
Subject: Re: [PATCH] perf tools topdown: Fix incorrect topdown slots regroup

On Mon, Aug 25, 2025 at 8:54 AM Ian Rogers <irogers@...gle.com> wrote:
>
> On Mon, Aug 25, 2025 at 8:29 AM Ian Rogers <irogers@...gle.com> wrote:
> >
> > On Sun, Aug 24, 2025 at 5:54 PM Mi, Dapeng <dapeng1.mi@...ux.intel.com> wrote:
> > >
> > >
> > > On 8/23/2025 1:35 AM, Ian Rogers wrote:
> > > > On Fri, Aug 22, 2025 at 1:23 AM Dapeng Mi <dapeng1.mi@...ux.intel.com> wrote:
> > > >> When running the command "perf stat -e "slots,slots" -C0 sleep 1", we
> > > >> see below error.
> > > >>
> > > >> perf stat -e "slots,slots" -C0 sleep 1
> > > >> WARNING: events were regrouped to match PMUs
> > > >>  Performance counter stats for 'CPU(s) 0':
> > > >>      <not counted>      slots
> > > >>    <not supported>      slots
> > > >>
> > > >>      1.002265769 seconds time elapsed
> > > >>
> > > >> The topdown slots events are not correctly counted. The root cause is
> > > >> that perf tools incorrectly regroup these 2 slots events into a group.
> > > >> If there are only topdown slots events but no topdown metrics events,
> > > >> the regroup should not be done since topdown slots event can only be
> > > >> programed on fixed counter 3 and multiple slots events can only be
> > > >> multiplexed to run on fixed counter 3, but grouping them blocks
> > > >> multiplexing.
> > > >>
> > > >> So avoid to regroup topdown slots events if there is no topdown metrics
> > > >> events.
> > > >>
> > > >> With this change, above command can be run successfully.
> > > >>
> > > >> perf stat -e "slots,slots" -C0 sleep 1
> > > >>  Performance counter stats for 'CPU(s) 0':
> > > >>        103,973,791      slots
> > > >>        106,488,170      slots
> > > >>
> > > >>        1.003517284 seconds time elapsed
> > > >>
> > > >> Besides, run perf stats/record test on SPR and PTL, both passed.
> > > >>
> > > >> Reported-by: Xudong Hao <xudong.hao@...el.com>
> > > >> Signed-off-by: Dapeng Mi <dapeng1.mi@...ux.intel.com>
> > > > I don't think we should do this and if we were to do it we shouldn't
> > > > do it in the common code. The perf metrics requiring a slots event is
> > > > a massive mess that never seems to end. What should we do with:
> > > > ```
> > > > $ perf stat -e "topdown-fe-bound,topdown-fe-bound" true
> > > >
> > > > Performance counter stats for 'true':
> > > >
> > > >     <not counted>      slots
> > > >     <not counted>      topdown-fe-bound
> > > >   <not supported>      topdown-fe-bound
> > > >
> > > >       0.000960472 seconds time elapsed
> > > >
> > > >       0.001060000 seconds user
> > > >       0.000000000 seconds sys
> > > >
> > > >
> > > > Some events weren't counted. Try disabling the NMI watchdog:
> > > >        echo 0 > /proc/sys/kernel/nmi_watchdog
> > > >        perf stat ...
> > > >        echo 1 > /proc/sys/kernel/nmi_watchdog
> > > > ```
> > > >
> > > > We've injected the slots event but the repeated topdown-fe-bound
> > > > causes the group to fail in a similar way. Why is repeating slots a
> > > > case we care about more?
> > > > Were we to say, okay slots is more special than the other perf metric
> > > > events then I'd prefer arch_evsel__must_be_in_group to return false
> > > > for the slots event when there are no other perf metric events in the
> > > > evlist. But then what do you do if the slots event is in a different
> > > > group like:
> > > > ```
> > > > $ perf stat -e "slots,{slots,topdown-fe-bound}" true
> > > > ```
> > > > It is pretty easy to teach the code to deduplicate events, but then
> > > > again, what about the group behavior?
> > > > It is not clear to me we can ever clean up all the mess that the perf
> > > > metric events on p-cores create and I'm in favor of having the code be
> > > > no more complex than it needs to be, so I'm not sure we should be
> > > > solving this problem.
> > >
> > > Ian, thanks for reviewing. Yeah, I agree what you said "topdown events are
> > > a massive mess", we could never solve all issues. But it's annoying for
> > > users that it reports "not counted" instead of an error. Is it possible
> > > that we take a step back and don't try to resolve this issue and just
> > > report an error (maybe plus a message) to warn users that multiple same
> > > topdown events are not allowed? Thanks.
> >
> > Hi Dapeng,
> >
> > looking at:
> > ```
> > $ sudo perf stat -vv -e
> > 'cpu_core/topdown-fe-bound/,cpu_core/topdown-fe-bound/' true
> > Using CPUID GenuineIntel-6-B7-1
> > Attempt to add: cpu_core/topdown-fe-bound/
> > ..after resolving event: cpu_core/event=0,umask=0x82/
> > Attempt to add: cpu_core/topdown-fe-bound/
> > ..after resolving event: cpu_core/event=0,umask=0x82/
> > WARNING: events were regrouped to match PMUs
> > evlist after sorting/fixing:
> > '{cpu_core/slots/,cpu_core/topdown-fe-bound/,cpu_core/topdown-fe-bound/}'
> > Control descriptor is not initialized
> > ------------------------------------------------------------
> > perf_event_attr:
> >   type                             4 (cpu_core)
> >   size                             136
> >   config                           0x400 (slots)
> >   sample_type                      IDENTIFIER
> >   read_format
> > TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID|GROUP
> >   disabled                         1
> >   inherit                          1
> >   enable_on_exec                   1
> > ------------------------------------------------------------
> > sys_perf_event_open: pid 2631204  cpu -1  group_fd -1  flags 0x8 = 3
> > ------------------------------------------------------------
> > perf_event_attr:
> >   type                             4 (cpu_core)
> >   size                             136
> >   config                           0x8200 (topdown-fe-bound)
> >   sample_type                      IDENTIFIER
> >   read_format
> > TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID|GROUP
> >   inherit                          1
> > ------------------------------------------------------------
> > sys_perf_event_open: pid 2631204  cpu -1  group_fd 3  flags 0x8 = 4
> > ------------------------------------------------------------
> > perf_event_attr:
> >   type                             4 (cpu_core)
> >   size                             136
> >   config                           0x8200 (topdown-fe-bound)
> >   sample_type                      IDENTIFIER
> >   read_format
> > TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID|GROUP
> >   inherit                          1
> > ------------------------------------------------------------
> > sys_perf_event_open: pid 2631204  cpu -1  group_fd 3  flags 0x8
> > sys_perf_event_open failed, error -22
> > switching off exclude_guest for PMU cpu_core
> > Using PERF_SAMPLE_READ / :S modifier is not compatible with inherit,
> > falling back to no-inherit.
> > Warning:
> > cpu_core/topdown-fe-bound/ event is not supported by the kernel.
> > failed to read counter cpu_core/slots/
> > failed to read counter cpu_core/topdown-fe-bound/
> > failed to read counter cpu_core/topdown-fe-bound/
> >
> >  Performance counter stats for 'true':
> >
> >      <not counted>      cpu_core/slots/
> >      <not counted>      cpu_core/topdown-fe-bound/
> >    <not supported>      cpu_core/topdown-fe-bound/
> >
> >        0.000975600 seconds time elapsed
> >
> >        0.001034000 seconds user
> >        0.000000000 seconds sys
> > ```
> > We can see that we have two good perf_event_opens and then the evsel
> > open code is trying to disable features because of the perf event open
> > failing. We can improve the error message:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/evsel.c?h=perf-tools-next#n3828
> > so that a perf metric event that returns EINVAL warns about the broken
> > grouping (perhaps some arch/PMU specific EINVAL handling). Perhaps if
> > this is a slots event that isn't the group leader, the event can be
> > removed from the group at this point. There is some overlap with
> > breaking weak groups, when too many events are placed within a group:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/evlist.c?h=perf-tools-next#n1848
> >
> > Generating the warning message looks relatively easy, making this a
> > case that fails or changes the evlist looks to be a larger refactor.
> > Should the failing perf_event_open cause the other members of the
> > group to break? That seems like an kernel and Intel PMU decision that
> > was already made. My suggestion is we create a warning message for
> > these perf metric event EINVAL cases but otherwise leave things
> > unchanged.
>
> I also think it'd be worthwhile to add a command line option to
> disable parse_events__sort_events_and_fix_groups code so that things
> like "slots,slots" can be just passed through and not get broken. This
> may be unfortunate for uncore things like:
> ```
> $ sudo perf stat -v -e '{data_read,data_write}' -a -A true
> Using CPUID GenuineIntel-6-B7-1
> data_read -> uncore_imc_free_running_0/data_read/
> data_read -> uncore_imc_free_running_1/data_read/
> data_write -> uncore_imc_free_running_0/data_write/
> data_write -> uncore_imc_free_running_1/data_write/
> WARNING: events were regrouped to match PMUs
> evlist after sorting/fixing: '{data_read,data_write},{data_read,data_write}'
> Control descriptor is not initialized
>
> Performance counter stats for 'system wide':
>
> CPU0                 1.85 MiB  uncore_imc_free_running_0/data_read/
>
> CPU0                 0.08 MiB  uncore_imc_free_running_0/data_write/
>
> CPU0                 1.87 MiB  uncore_imc_free_running_1/data_read/
>
> CPU0                 0.07 MiB  uncore_imc_free_running_1/data_write/
>
>
>       0.000851335 seconds time elapsed
> ```
> where the different uncore PMUs for the data_read and data_write
> events won't get resorted and the groups broken, etc. Perhaps rather
> than a command line option for the evlist it can be an option on the
> event, so something like:
> ```
> $ perf stat -e 'slots:X,slots:X,{data_read,data_write}' ...
> ```
> where the X modifier is indicating not to change the group of the event.

So a patch series doing this is here:
https://lore.kernel.org/lkml/20250825211204.2784695-1-irogers@google.com/

Thanks,
Ian