lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP-5=fXqi6Q-DW8d3eV7rSapMx1FtgeBG45o=NkGtbWtj==rvQ@mail.gmail.com>
Date: Mon, 25 Aug 2025 08:54:32 -0700
From: Ian Rogers <irogers@...gle.com>
To: "Mi, Dapeng" <dapeng1.mi@...ux.intel.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
	Arnaldo Carvalho de Melo <acme@...nel.org>, Namhyung Kim <namhyung@...nel.org>, 
	Adrian Hunter <adrian.hunter@...el.com>, 
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>, Kan Liang <kan.liang@...ux.intel.com>, 
	linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Dapeng Mi <dapeng1.mi@...el.com>, Xudong Hao <xudong.hao@...el.com>
Subject: Re: [PATCH] perf tools topdown: Fix incorrect topdown slots regroup

On Mon, Aug 25, 2025 at 8:29 AM Ian Rogers <irogers@...gle.com> wrote:
>
> On Sun, Aug 24, 2025 at 5:54 PM Mi, Dapeng <dapeng1.mi@...ux.intel.com> wrote:
> >
> >
> > On 8/23/2025 1:35 AM, Ian Rogers wrote:
> > > On Fri, Aug 22, 2025 at 1:23 AM Dapeng Mi <dapeng1.mi@...ux.intel.com> wrote:
> > >> When running the command "perf stat -e "slots,slots" -C0 sleep 1", we
> > >> see below error.
> > >>
> > >> perf stat -e "slots,slots" -C0 sleep 1
> > >> WARNING: events were regrouped to match PMUs
> > >>  Performance counter stats for 'CPU(s) 0':
> > >>      <not counted>      slots
> > >>    <not supported>      slots
> > >>
> > >>      1.002265769 seconds time elapsed
> > >>
> > >> The topdown slots events are not correctly counted. The root cause is
> > >> that perf tools incorrectly regroup these 2 slots events into a group.
> > >> If there are only topdown slots events but no topdown metrics events,
> > >> the regroup should not be done since topdown slots event can only be
> > >> programed on fixed counter 3 and multiple slots events can only be
> > >> multiplexed to run on fixed counter 3, but grouping them blocks
> > >> multiplexing.
> > >>
> > >> So avoid to regroup topdown slots events if there is no topdown metrics
> > >> events.
> > >>
> > >> With this change, above command can be run successfully.
> > >>
> > >> perf stat -e "slots,slots" -C0 sleep 1
> > >>  Performance counter stats for 'CPU(s) 0':
> > >>        103,973,791      slots
> > >>        106,488,170      slots
> > >>
> > >>        1.003517284 seconds time elapsed
> > >>
> > >> Besides, run perf stats/record test on SPR and PTL, both passed.
> > >>
> > >> Reported-by: Xudong Hao <xudong.hao@...el.com>
> > >> Signed-off-by: Dapeng Mi <dapeng1.mi@...ux.intel.com>
> > > I don't think we should do this and if we were to do it we shouldn't
> > > do it in the common code. The perf metrics requiring a slots event is
> > > a massive mess that never seems to end. What should we do with:
> > > ```
> > > $ perf stat -e "topdown-fe-bound,topdown-fe-bound" true
> > >
> > > Performance counter stats for 'true':
> > >
> > >     <not counted>      slots
> > >     <not counted>      topdown-fe-bound
> > >   <not supported>      topdown-fe-bound
> > >
> > >       0.000960472 seconds time elapsed
> > >
> > >       0.001060000 seconds user
> > >       0.000000000 seconds sys
> > >
> > >
> > > Some events weren't counted. Try disabling the NMI watchdog:
> > >        echo 0 > /proc/sys/kernel/nmi_watchdog
> > >        perf stat ...
> > >        echo 1 > /proc/sys/kernel/nmi_watchdog
> > > ```
> > >
> > > We've injected the slots event but the repeated topdown-fe-bound
> > > causes the group to fail in a similar way. Why is repeating slots a
> > > case we care about more?
> > > Were we to say, okay slots is more special than the other perf metric
> > > events then I'd prefer arch_evsel__must_be_in_group to return false
> > > for the slots event when there are no other perf metric events in the
> > > evlist. But then what do you do if the slots event is in a different
> > > group like:
> > > ```
> > > $ perf stat -e "slots,{slots,topdown-fe-bound}" true
> > > ```
> > > It is pretty easy to teach the code to deduplicate events, but then
> > > again, what about the group behavior?
> > > It is not clear to me we can ever clean up all the mess that the perf
> > > metric events on p-cores create and I'm in favor of having the code be
> > > no more complex than it needs to be, so I'm not sure we should be
> > > solving this problem.
> >
> > Ian, thanks for reviewing. Yeah, I agree what you said "topdown events are
> > a massive mess", we could never solve all issues. But it's annoying for
> > users that it reports "not counted" instead of an error. Is it possible
> > that we take a step back and don't try to resolve this issue and just
> > report an error (maybe plus a message) to warn users that multiple same
> > topdown events are not allowed? Thanks.
>
> Hi Dapeng,
>
> looking at:
> ```
> $ sudo perf stat -vv -e
> 'cpu_core/topdown-fe-bound/,cpu_core/topdown-fe-bound/' true
> Using CPUID GenuineIntel-6-B7-1
> Attempt to add: cpu_core/topdown-fe-bound/
> ..after resolving event: cpu_core/event=0,umask=0x82/
> Attempt to add: cpu_core/topdown-fe-bound/
> ..after resolving event: cpu_core/event=0,umask=0x82/
> WARNING: events were regrouped to match PMUs
> evlist after sorting/fixing:
> '{cpu_core/slots/,cpu_core/topdown-fe-bound/,cpu_core/topdown-fe-bound/}'
> Control descriptor is not initialized
> ------------------------------------------------------------
> perf_event_attr:
>   type                             4 (cpu_core)
>   size                             136
>   config                           0x400 (slots)
>   sample_type                      IDENTIFIER
>   read_format
> TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID|GROUP
>   disabled                         1
>   inherit                          1
>   enable_on_exec                   1
> ------------------------------------------------------------
> sys_perf_event_open: pid 2631204  cpu -1  group_fd -1  flags 0x8 = 3
> ------------------------------------------------------------
> perf_event_attr:
>   type                             4 (cpu_core)
>   size                             136
>   config                           0x8200 (topdown-fe-bound)
>   sample_type                      IDENTIFIER
>   read_format
> TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID|GROUP
>   inherit                          1
> ------------------------------------------------------------
> sys_perf_event_open: pid 2631204  cpu -1  group_fd 3  flags 0x8 = 4
> ------------------------------------------------------------
> perf_event_attr:
>   type                             4 (cpu_core)
>   size                             136
>   config                           0x8200 (topdown-fe-bound)
>   sample_type                      IDENTIFIER
>   read_format
> TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID|GROUP
>   inherit                          1
> ------------------------------------------------------------
> sys_perf_event_open: pid 2631204  cpu -1  group_fd 3  flags 0x8
> sys_perf_event_open failed, error -22
> switching off exclude_guest for PMU cpu_core
> Using PERF_SAMPLE_READ / :S modifier is not compatible with inherit,
> falling back to no-inherit.
> Warning:
> cpu_core/topdown-fe-bound/ event is not supported by the kernel.
> failed to read counter cpu_core/slots/
> failed to read counter cpu_core/topdown-fe-bound/
> failed to read counter cpu_core/topdown-fe-bound/
>
>  Performance counter stats for 'true':
>
>      <not counted>      cpu_core/slots/
>      <not counted>      cpu_core/topdown-fe-bound/
>    <not supported>      cpu_core/topdown-fe-bound/
>
>        0.000975600 seconds time elapsed
>
>        0.001034000 seconds user
>        0.000000000 seconds sys
> ```
> We can see that we have two good perf_event_opens and then the evsel
> open code is trying to disable features because of the perf event open
> failing. We can improve the error message:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/evsel.c?h=perf-tools-next#n3828
> so that a perf metric event that returns EINVAL warns about the broken
> grouping (perhaps some arch/PMU specific EINVAL handling). Perhaps if
> this is a slots event that isn't the group leader, the event can be
> removed from the group at this point. There is some overlap with
> breaking weak groups, when too many events are placed within a group:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/evlist.c?h=perf-tools-next#n1848
>
> Generating the warning message looks relatively easy, making this a
> case that fails or changes the evlist looks to be a larger refactor.
> Should the failing perf_event_open cause the other members of the
> group to break? That seems like an kernel and Intel PMU decision that
> was already made. My suggestion is we create a warning message for
> these perf metric event EINVAL cases but otherwise leave things
> unchanged.

I also think it'd be worthwhile to add a command line option to
disable parse_events__sort_events_and_fix_groups code so that things
like "slots,slots" can be just passed through and not get broken. This
may be unfortunate for uncore things like:
```
$ sudo perf stat -v -e '{data_read,data_write}' -a -A true
Using CPUID GenuineIntel-6-B7-1
data_read -> uncore_imc_free_running_0/data_read/
data_read -> uncore_imc_free_running_1/data_read/
data_write -> uncore_imc_free_running_0/data_write/
data_write -> uncore_imc_free_running_1/data_write/
WARNING: events were regrouped to match PMUs
evlist after sorting/fixing: '{data_read,data_write},{data_read,data_write}'
Control descriptor is not initialized

Performance counter stats for 'system wide':

CPU0                 1.85 MiB  uncore_imc_free_running_0/data_read/

CPU0                 0.08 MiB  uncore_imc_free_running_0/data_write/

CPU0                 1.87 MiB  uncore_imc_free_running_1/data_read/

CPU0                 0.07 MiB  uncore_imc_free_running_1/data_write/


      0.000851335 seconds time elapsed
```
where the different uncore PMUs for the data_read and data_write
events won't get resorted and the groups broken, etc. Perhaps rather
than a command line option for the evlist it can be an option on the
event, so something like:
```
$ perf stat -e 'slots:X,slots:X,{data_read,data_write}' ...
```
where the X modifier is indicating not to change the group of the event.

Thanks,
Ian

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ