linux-kernel - Re: [PATCH] perf tools topdown: Fix incorrect topdown slots regroup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP-5=fVdzM=VABXtcX_qM1_o8M5sKB5nJpojVg+Pw+XbfsRWkg@mail.gmail.com>
Date: Mon, 25 Aug 2025 08:29:12 -0700
From: Ian Rogers <irogers@...gle.com>
To: "Mi, Dapeng" <dapeng1.mi@...ux.intel.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
	Arnaldo Carvalho de Melo <acme@...nel.org>, Namhyung Kim <namhyung@...nel.org>, 
	Adrian Hunter <adrian.hunter@...el.com>, 
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>, Kan Liang <kan.liang@...ux.intel.com>, 
	linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Dapeng Mi <dapeng1.mi@...el.com>, Xudong Hao <xudong.hao@...el.com>
Subject: Re: [PATCH] perf tools topdown: Fix incorrect topdown slots regroup

On Sun, Aug 24, 2025 at 5:54 PM Mi, Dapeng <dapeng1.mi@...ux.intel.com> wrote:
>
>
> On 8/23/2025 1:35 AM, Ian Rogers wrote:
> > On Fri, Aug 22, 2025 at 1:23 AM Dapeng Mi <dapeng1.mi@...ux.intel.com> wrote:
> >> When running the command "perf stat -e "slots,slots" -C0 sleep 1", we
> >> see below error.
> >>
> >> perf stat -e "slots,slots" -C0 sleep 1
> >> WARNING: events were regrouped to match PMUs
> >>  Performance counter stats for 'CPU(s) 0':
> >>      <not counted>      slots
> >>    <not supported>      slots
> >>
> >>      1.002265769 seconds time elapsed
> >>
> >> The topdown slots events are not correctly counted. The root cause is
> >> that perf tools incorrectly regroup these 2 slots events into a group.
> >> If there are only topdown slots events but no topdown metrics events,
> >> the regroup should not be done since topdown slots event can only be
> >> programed on fixed counter 3 and multiple slots events can only be
> >> multiplexed to run on fixed counter 3, but grouping them blocks
> >> multiplexing.
> >>
> >> So avoid to regroup topdown slots events if there is no topdown metrics
> >> events.
> >>
> >> With this change, above command can be run successfully.
> >>
> >> perf stat -e "slots,slots" -C0 sleep 1
> >>  Performance counter stats for 'CPU(s) 0':
> >>        103,973,791      slots
> >>        106,488,170      slots
> >>
> >>        1.003517284 seconds time elapsed
> >>
> >> Besides, run perf stats/record test on SPR and PTL, both passed.
> >>
> >> Reported-by: Xudong Hao <xudong.hao@...el.com>
> >> Signed-off-by: Dapeng Mi <dapeng1.mi@...ux.intel.com>
> > I don't think we should do this and if we were to do it we shouldn't
> > do it in the common code. The perf metrics requiring a slots event is
> > a massive mess that never seems to end. What should we do with:
> > ```
> > $ perf stat -e "topdown-fe-bound,topdown-fe-bound" true
> >
> > Performance counter stats for 'true':
> >
> >     <not counted>      slots
> >     <not counted>      topdown-fe-bound
> >   <not supported>      topdown-fe-bound
> >
> >       0.000960472 seconds time elapsed
> >
> >       0.001060000 seconds user
> >       0.000000000 seconds sys
> >
> >
> > Some events weren't counted. Try disabling the NMI watchdog:
> >        echo 0 > /proc/sys/kernel/nmi_watchdog
> >        perf stat ...
> >        echo 1 > /proc/sys/kernel/nmi_watchdog
> > ```
> >
> > We've injected the slots event but the repeated topdown-fe-bound
> > causes the group to fail in a similar way. Why is repeating slots a
> > case we care about more?
> > Were we to say, okay slots is more special than the other perf metric
> > events then I'd prefer arch_evsel__must_be_in_group to return false
> > for the slots event when there are no other perf metric events in the
> > evlist. But then what do you do if the slots event is in a different
> > group like:
> > ```
> > $ perf stat -e "slots,{slots,topdown-fe-bound}" true
> > ```
> > It is pretty easy to teach the code to deduplicate events, but then
> > again, what about the group behavior?
> > It is not clear to me we can ever clean up all the mess that the perf
> > metric events on p-cores create and I'm in favor of having the code be
> > no more complex than it needs to be, so I'm not sure we should be
> > solving this problem.
>
> Ian, thanks for reviewing. Yeah, I agree what you said "topdown events are
> a massive mess", we could never solve all issues. But it's annoying for
> users that it reports "not counted" instead of an error. Is it possible
> that we take a step back and don't try to resolve this issue and just
> report an error (maybe plus a message) to warn users that multiple same
> topdown events are not allowed? Thanks.

Hi Dapeng,

looking at:
```
$ sudo perf stat -vv -e
'cpu_core/topdown-fe-bound/,cpu_core/topdown-fe-bound/' true
Using CPUID GenuineIntel-6-B7-1
Attempt to add: cpu_core/topdown-fe-bound/
..after resolving event: cpu_core/event=0,umask=0x82/
Attempt to add: cpu_core/topdown-fe-bound/
..after resolving event: cpu_core/event=0,umask=0x82/
WARNING: events were regrouped to match PMUs
evlist after sorting/fixing:
'{cpu_core/slots/,cpu_core/topdown-fe-bound/,cpu_core/topdown-fe-bound/}'
Control descriptor is not initialized
------------------------------------------------------------
perf_event_attr:
  type                             4 (cpu_core)
  size                             136
  config                           0x400 (slots)
  sample_type                      IDENTIFIER
  read_format
TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID|GROUP
  disabled                         1
  inherit                          1
  enable_on_exec                   1
------------------------------------------------------------
sys_perf_event_open: pid 2631204  cpu -1  group_fd -1  flags 0x8 = 3
------------------------------------------------------------
perf_event_attr:
  type                             4 (cpu_core)
  size                             136
  config                           0x8200 (topdown-fe-bound)
  sample_type                      IDENTIFIER
  read_format
TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID|GROUP
  inherit                          1
------------------------------------------------------------
sys_perf_event_open: pid 2631204  cpu -1  group_fd 3  flags 0x8 = 4
------------------------------------------------------------
perf_event_attr:
  type                             4 (cpu_core)
  size                             136
  config                           0x8200 (topdown-fe-bound)
  sample_type                      IDENTIFIER
  read_format
TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID|GROUP
  inherit                          1
------------------------------------------------------------
sys_perf_event_open: pid 2631204  cpu -1  group_fd 3  flags 0x8
sys_perf_event_open failed, error -22
switching off exclude_guest for PMU cpu_core
Using PERF_SAMPLE_READ / :S modifier is not compatible with inherit,
falling back to no-inherit.
Warning:
cpu_core/topdown-fe-bound/ event is not supported by the kernel.
failed to read counter cpu_core/slots/
failed to read counter cpu_core/topdown-fe-bound/
failed to read counter cpu_core/topdown-fe-bound/

 Performance counter stats for 'true':

     <not counted>      cpu_core/slots/
     <not counted>      cpu_core/topdown-fe-bound/
   <not supported>      cpu_core/topdown-fe-bound/

       0.000975600 seconds time elapsed

       0.001034000 seconds user
       0.000000000 seconds sys
```
We can see that we have two good perf_event_opens and then the evsel
open code is trying to disable features because of the perf event open
failing. We can improve the error message:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/evsel.c?h=perf-tools-next#n3828
so that a perf metric event that returns EINVAL warns about the broken
grouping (perhaps some arch/PMU specific EINVAL handling). Perhaps if
this is a slots event that isn't the group leader, the event can be
removed from the group at this point. There is some overlap with
breaking weak groups, when too many events are placed within a group:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/evlist.c?h=perf-tools-next#n1848

Generating the warning message looks relatively easy, making this a
case that fails or changes the evlist looks to be a larger refactor.
Should the failing perf_event_open cause the other members of the
group to break? That seems like an kernel and Intel PMU decision that
was already made. My suggestion is we create a warning message for
these perf metric event EINVAL cases but otherwise leave things
unchanged.

Thanks,
Ian