linux-kernel - Re: [PATCH v1 1/4] perf vendor events intel: Alderlake/rocketlake metric fixes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAP-5=fUTyNkndZMKROLMVYh=C-oNRxjWPVaS=B2MsV6+Bvktmg@mail.gmail.com>
Date: Thu, 4 Jan 2024 15:31:40 -0800
From: Ian Rogers <irogers@...gle.com>
To: "Liang, Kan" <kan.liang@...ux.intel.com>
Cc: "Wang, Weilin" <weilin.wang@...el.com>, Arnaldo Carvalho de Melo <acme@...nel.org>, 
	Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
	Mark Rutland <mark.rutland@....com>, 
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>, Jiri Olsa <jolsa@...nel.org>, 
	Namhyung Kim <namhyung@...nel.org>, Adrian Hunter <adrian.hunter@...el.com>, 
	linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Edward Baker <edward.baker@...el.com>
Subject: Re: [PATCH v1 1/4] perf vendor events intel: Alderlake/rocketlake
 metric fixes

On Thu, Jan 4, 2024 at 11:30 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
>
>
>
> On 2024-01-04 12:51 p.m., Ian Rogers wrote:
> > On Thu, Jan 4, 2024 at 6:30 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
> >>
> >>
> >>
> >> On 2024-01-04 8:56 a.m., Ian Rogers wrote:
> >>>> Testing tma_slow_pause
> >>>> Metric 'tma_slow_pause' not printed in:
> >>>> # Running 'internals/synthesize' benchmark:
> >>>> Computing performance of single threaded perf event synthesis by
> >>>> synthesizing events on the perf process itself:
> >>>>   Average synthesis took: 49.987 usec (+- 0.049 usec)
> >>>>   Average num. events: 47.000 (+- 0.000)
> >>>>   Average time per event 1.064 usec
> >>>>   Average data synthesis took: 53.490 usec (+- 0.033 usec)
> >>>>   Average num. events: 245.000 (+- 0.000)
> >>>>   Average time per event 0.218 usec
> >>>>
> >>>>  Performance counter stats for 'perf bench internals synthesize':
> >>>>
> >>>>      <not counted>      cpu_core/TOPDOWN.SLOTS/                                                 (0.00%)
> >>>>      <not counted>      cpu_core/topdown-retiring/                                              (0.00%)
> >>>>      <not counted>      cpu_core/topdown-mem-bound/                                             (0.00%)
> >>>>      <not counted>      cpu_core/topdown-bad-spec/                                              (0.00%)
> >>>>      <not counted>      cpu_core/topdown-fe-bound/                                              (0.00%)
> >>>>      <not counted>      cpu_core/topdown-be-bound/                                              (0.00%)
> >>>>      <not counted>      cpu_core/RESOURCE_STALLS.SCOREBOARD/                                        (0.00%)
> >>>>      <not counted>      cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/                                        (0.00%)
> >>>>      <not counted>      cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/                                        (0.00%)
> >>>>      <not counted>      cpu_core/CPU_CLK_UNHALTED.PAUSE/                                        (0.00%)
> >>>>      <not counted>      cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/                                        (0.00%)
> >>>>      <not counted>      cpu_core/CPU_CLK_UNHALTED.THREAD/                                        (0.00%)
> >>>>      <not counted>      cpu_core/ARITH.DIV_ACTIVE/                                              (0.00%)
> >>>>      <not counted>      cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/                                        (0.00%)
> >>>>      <not counted>      cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/                                        (0.00%)
> >>>>
> >>>>        1.186254766 seconds time elapsed
> >>>>
> >>>>        0.427220000 seconds user
> >>>>        0.752217000 seconds sys
> >>>> Testing smi_cycles
> >>>> Testing smi_num
> >>>> Testing tsx_aborted_cycles
> >>>> Testing tsx_cycles_per_elision
> >>>> Testing tsx_cycles_per_transaction
> >>>> Testing tsx_transactional_cycles
> >>>> test child finished with -1
> >>>> ---- end ----
> >>>> perf all metrics test: FAILED!
> >>>> root@...ber:~#
> >>> Have a try disabling the NMI watchdog. Agreed that there is more to
> >>> fix here but I think the PMU driver is in part to blame because
> >>> manually breaking the weak group of events is a fix.
> >>
> >> I think we have a NO_GROUP_EVENTS_NMI metric constraint to mark a group
> >> which require disabling of the NMI watchdog.
> >> Maybe we should mark the group a NO_GROUP_EVENTS_NMI metric.
> >
> > +Weilin due to the affects of event grouping.
> >
> > Thanks Kan, NO_GROUP_EVENTS_NMI would be good. Something I see for
> > tma_ports_utilized_1 that may be worsening things is:
> >
> > ```
> > Testing tma_ports_utilized_1
> > Metric 'tma_ports_utilized_1' not printed in:
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of single threaded perf event synthesis by
> > synthesizing events on the perf process itself:
> >   Average synthesis took: 49.581 usec (+- 0.030 usec)
> >   Average num. events: 47.000 (+- 0.000)
> >   Average time per event 1.055 usec
> >   Average data synthesis took: 53.367 usec (+- 0.032 usec)
> >   Average num. events: 246.000 (+- 0.000)
> >   Average time per event 0.217 usec
> >
> >  Performance counter stats for 'perf bench internals synthesize':
> >
> >      <not counted>      cpu_core/TOPDOWN.SLOTS/
> >                          (0.00%)
> >      <not counted>      cpu_core/topdown-retiring/
> >                          (0.00%)
> >      <not counted>      cpu_core/topdown-mem-bound/
> >                          (0.00%)
> >      <not counted>      cpu_core/topdown-bad-spec/
> >                          (0.00%)
> >      <not counted>      cpu_core/topdown-fe-bound/
> >                          (0.00%)
> >      <not counted>      cpu_core/topdown-be-bound/
> >                          (0.00%)
> >      <not counted>      cpu_core/RESOURCE_STALLS.SCOREBOARD/
> >                              (0.00%)
> >      <not counted>      cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/
> >                             (0.00%)
> >      <not counted>      cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/
> >                               (0.00%)
> >      <not counted>      cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/
> >                             (0.00%)
> >      <not counted>      cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/
> >                               (0.00%)
> >      <not counted>      cpu_core/CPU_CLK_UNHALTED.THREAD/
> >                           (0.00%)
> >      <not counted>      cpu_core/ARITH.DIV_ACTIVE/
> >                          (0.00%)
> >      <not counted>      cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/
> >                                       (0.00%)
> >      <not counted>      cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/
> >                                        (0.00%)
> >
> >        1.180394056 seconds time elapsed
> >
> >        0.409881000 seconds user
> >        0.764134000 seconds sys
> > ```
> >
> > The event EXE_ACTIVITY.1_PORTS_UTIL is repeated, this is because the
> > metric code deduplicates events based purely on their name and so
> > doesn't realize EXE_ACTIVITY.1_PORTS_UTIL is the same as
> > cpu_core@..._ACTIVITY.1_PORTS_UTIL@. This is a hybrid only glitch as
> > we only prefix with a PMU for hybrid metrics, and I should find and
> > remove why there's no PMU for the 1 case of EXE_ACTIVITY.1_PORTS_UTIL.
> >
> > This problem doesn't occur for tma_slow_pause and I wondered if you
> > could give insight. That metric has the counters below:
> > ```
> > $ perf stat -M tma_slow_pause -a sleep 0.1
> >
> > Performance counter stats for 'system wide':
> >
> >     <not counted>      cpu_core/TOPDOWN.SLOTS/
> >                         (0.00%)
> >     <not counted>      cpu_core/topdown-retiring/
> >                         (0.00%)
> >     <not counted>      cpu_core/topdown-mem-bound/
> >                         (0.00%)
> >     <not counted>      cpu_core/topdown-bad-spec/
> >                         (0.00%)
> >     <not counted>      cpu_core/topdown-fe-bound/
> >                         (0.00%)
> >     <not counted>      cpu_core/topdown-be-bound/
> >                         (0.00%)
> >     <not counted>      cpu_core/RESOURCE_STALLS.SCOREBOARD/
> >                             (0.00%)
> >     <not counted>      cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/
> >                            (0.00%)
> >     <not counted>      cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/
> >                              (0.00%)
> >     <not counted>      cpu_core/CPU_CLK_UNHALTED.PAUSE/
> >                         (0.00%)
> >     <not counted>      cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/
> >                              (0.00%)
> >     <not counted>      cpu_core/CPU_CLK_UNHALTED.THREAD/
> >                          (0.00%)
> >     <not counted>      cpu_core/ARITH.DIV_ACTIVE/
> >                         (0.00%)
> >     <not counted>      cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/
> >                                      (0.00%)
> >     <not counted>      cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/
> >                                       (0.00%)
> >
> >       0.102074888 seconds time elapsed
> > ```
> >
> > With -vv I see the event string is:
> > '{RESOURCE_STALLS.SCOREBOARD/metric-id=RESOURCE_STALLS.SCOREBOARD/,cpu_core/EXE_ACTIVITY.1_PORTS_UTIL,metric-id=cpu_core!3EXE_ACTIVITY.1_PORTS_UTIL!3/,cpu_core/TOPDOWN.SLOTS,metric-id=cpu_core!3TOPDOWN.SLOTS!3/,cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS,metric-id=cpu_core!3EXE_ACTIVITY.BOUND_ON_LOADS!3/,cpu_core/topdown-retiring,metric-id=cpu_core!3topdown!1retiring!3/,cpu_core/topdown-mem-bound,metric-id=cpu_core!3topdown!1mem!1bound!3/,cpu_core/topdown-bad-spec,metric-id=cpu_core!3topdown!1bad!1spec!3/,CPU_CLK_UNHALTED.PAUSE/metric-id=CPU_CLK_UNHALTED.PAUSE/,cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL,metric-id=cpu_core!3CYCLE_ACTIVITY.STALLS_TOTAL!3/,cpu_core/CPU_CLK_UNHALTED.THREAD,metric-id=cpu_core!3CPU_CLK_UNHALTED.THREAD!3/,cpu_core/ARITH.DIV_ACTIVE,metric-id=cpu_core!3ARITH.DIV_ACTIVE!3/,cpu_core/topdown-fe-bound,metric-id=cpu_core!3topdown!1fe!1bound!3/,cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc,metric-id=cpu_core!3EXE_ACTIVITY.2_PORTS_UTIL!0umask!20xc!3/,cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80,metric-id=cpu_core!3EXE_ACTIVITY.3_PORTS_UTIL!0umask!20x80!3/,cpu_core/topdown-be-bound,metric-id=cpu_core!3topdown!1be!1bound!3/}:W'
> >
> > which without the metric-ids becomes:
> > '{RESOURCE_STALLS.SCOREBOARD,cpu_core/EXE_ACTIVITY.1_PORTS_UTIL/,cpu_core/TOPDOWN.SLOTS/,cpu_core/EXE_ACTIVITY.BOUND_ON_LOADS/,cpu_core/topdown-retiring/,cpu_core/topdown-mem-bound/,cpu_core/topdown-bad-spec/,CPU_CLK_UNHALTED.PAUSE,cpu_core/CYCLE_ACTIVITY.STALLS_TOTAL/,cpu_core/CPU_CLK_UNHALTED.THREAD/,cpu_core/ARITH.DIV_ACTIVE/,cpu_core/topdown-fe-bound/,cpu_core/EXE_ACTIVITY.2_PORTS_UTIL,umask=0xc/,cpu_core/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/,cpu_core/topdown-be-bound/}:W'
> >
> > I count 9 none slots/top-down counters there, but I see
> > CPU_CLK_UNHALTED.THREAD can use fixed counter 1. Should
> > perf_event_open fail for a CPU that has a pinned use of a fixed
> > counter and the group needs the fixed counter?
>
> I tried, but the idea was rejected.
>
> > I'm guessing you don't
> > want this as CPU_CLK_UNHALTED.THREAD can also go on a generic counter
> > and the driver doesn't want to count counter usage, it seems feasible
> > to add it though. I guess we need a NO_GROUP_EVENTS_NMI whenever
> > CPU_CLK_UNHALTED.THREAD is an event and 8 generic counters are in use.
>
> Yes, it looks good to me.

Fixes all sent out, see and its links:
https://lore.kernel.org/lkml/20240104231903.775717-1-irogers@google.com/

Thanks,
Ian

> >
> > Checking on Tigerlake I see:
> > ```
> > $ perf stat -M tma_slow_pause -a sleep 0.1
> >
> > Performance counter stats for 'system wide':
> >
> >       105,210,913      TOPDOWN.SLOTS                    #      0.1 %
> > tma_slow_pause           (72.65%)
> >         6,701,129      topdown-retiring
> >                         (72.65%)
> >        52,359,712      topdown-fe-bound
> >                         (72.65%)
> >        32,904,532      topdown-be-bound
> >                         (72.65%)
> >        14,117,814      topdown-bad-spec
> >                         (72.65%)
> >         6,602,391      RESOURCE_STALLS.SCOREBOARD
> >                         (76.17%)
> >         4,220,773      cpu/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/
> >                                  (76.73%)
> >           421,812      EXE_ACTIVITY.BOUND_ON_STORES
> >                         (76.69%)
> >         5,164,088      EXE_ACTIVITY.1_PORTS_UTIL
> >                         (76.70%)
> >           299,681      cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/
> >                                   (76.69%)
> >               245      MISC_RETIRED.PAUSE_INST
> >                         (76.67%)
> >        58,403,687      CPU_CLK_UNHALTED.THREAD
> >                         (76.72%)
> >        25,297,841      CYCLE_ACTIVITY.STALLS_MEM_ANY
> >                         (76.67%)
> >         3,788,772      EXE_ACTIVITY.2_PORTS_UTIL
> >                         (62.69%)
> >        20,973,875      CYCLE_ACTIVITY.STALLS_TOTAL
> >                         (62.16%)
> >            68,053      ARITH.DIVIDER_ACTIVE
> >                         (62.18%)
> >
> >       0.102624327 seconds time elapsed
> > ```
> > so 10 generic counters which would never fit and the weak group is
> > broken - the difference in the metric explaining why I've not been
> > seeing the issue. I think I need to add alderlake/sapphirerapids
> > constraints here:
> > https://github.com/captain5050/perfmon/blob/main/scripts/create_perf_json.py#L1382
> > Ideally we'd automate the constraint generation (or the PMU driver
> > would help us out by failing to open the weak group).
>
> Yes, an automation will be great. The NO_GROUP_EVENTS_NMI can be set for
> a group which has CPU_CLK_UNHALTED.THREAD and the number of core events
> (expect topdown) == the max number of GP counters + 1.
>
> Thanks,
> Kan
> >
> > Thanks,
> > Ian
> >
> >
> >> Thanks,
> >> Kan
> >>
> >>> Fwiw, if we
> >>> switch to the buddy watchdog mechanism then we'll no longer need to
> >>> disable the NMI watchdog:
> >>> https://lore.kernel.org/lkml/20230421155255.1.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeid/
> >>
> >