linux-kernel - Re: [PATCH v4 00/23] Intel vendor events and TMA 5.01 metrics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAP-5=fW7gaskkJKnJcsfe0UO0V0wDsQaa-2NSNr3ho9urUeKBw@mail.gmail.com>
Date: Wed, 5 Feb 2025 08:35:00 -0800
From: Ian Rogers <irogers@...gle.com>
To: "Liang, Kan" <kan.liang@...ux.intel.com>
Cc: "Falcon, Thomas" <thomas.falcon@...el.com>, "Baker, Edward" <edward.baker@...el.com>, 
	"alexander.shishkin@...ux.intel.com" <alexander.shishkin@...ux.intel.com>, 
	"Biggers, Caleb" <caleb.biggers@...el.com>, "mpetlan@...hat.com" <mpetlan@...hat.com>, 
	"Taylor, Perry" <perry.taylor@...el.com>, "Hunter, Adrian" <adrian.hunter@...el.com>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "mingo@...hat.com" <mingo@...hat.com>, 
	"linux-perf-users@...r.kernel.org" <linux-perf-users@...r.kernel.org>, 
	"manivannan.sadhasivam@...aro.org" <manivannan.sadhasivam@...aro.org>, 
	"peterz@...radead.org" <peterz@...radead.org>, "Alt, Samantha" <samantha.alt@...el.com>, 
	"mark.rutland@....com" <mark.rutland@....com>, "Wang, Weilin" <weilin.wang@...el.com>, 
	"acme@...nel.org" <acme@...nel.org>, "afaerber@...e.de" <afaerber@...e.de>, 
	"jolsa@...nel.org" <jolsa@...nel.org>, "namhyung@...nel.org" <namhyung@...nel.org>
Subject: Re: [PATCH v4 00/23] Intel vendor events and TMA 5.01 metrics

On Wed, Feb 5, 2025 at 7:47 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
>
> On 2025-02-04 11:58 p.m., Ian Rogers wrote:
> > On Tue, Feb 4, 2025 at 8:28 PM Falcon, Thomas <thomas.falcon@...el.com> wrote:
> >>
> >> On Tue, 2025-02-04 at 13:35 -0800, Ian Rogers wrote:
> >>> On Tue, Feb 4, 2025 at 1:33 PM Ian Rogers <irogers@...gle.com> wrote:
> >>>>
> >>>> Update the Intel vendor events to the latest.
> >>>> Update the metrics to TMA 5.01.
> >>>> Add Arrowlake and Clearwaterforest support.
> >>>> Add metrics for LNL and GNR.
> >>>> Address IIO uncore issue spotted on EMR, GRR, GNR, SPR and SRF.
> >>>>
> >>>> The perf json was generated using the script:
> >>>> https://github.com/intel/perfmon/blob/main/scripts/create_perf_json.py
> >>>> with the generated json being in:
> >>>> https://github.com/intel/perfmon/tree/main/scripts/perf
> >>>>
> >>>> Thanks to Perry Taylor <perry.taylor@...el.com>, Caleb Biggers
> >>>> <caleb.biggers@...el.com>, Edward Baker <edward.baker@...el.com>
> >>>> and
> >>>> Weilin Wang <weilin.wang@...el.com> for helping get this patch
> >>>> series
> >>>> together.
> >>>>
> >>>> v4: Fix TSC events on hybrid mistakenly specifying the core PMU
> >>>>     inhibiting the use of the msr PMU.
> >>>> v3: Fixes for hybrid metrics that were missing PMU. Update to the
> >>>>     latest events.
> >>>> v2: Fix hybrid and Co-authored-by tag issues reported by
> >>>>     Arnaldo. Updates to Lunarlake and Meteorlake events. Addition
> >>>> of
> >>>>     Clearwaterforest.
> >>>
> >>> Sorry, forgot to add Thomas again.
> >>> https://lore.kernel.org/lkml/20250204213259.127939-1-irogers@google.com/
> >>
> >> Hi, I'm seeing some warnings like this and the all metrics test is
> >> skipped:
> >>
> >> Testing tma_info_inst_mix_iparith
> >> FP issues
> >> Cannot resolve IDs for tma_info_inst_mix_iparith:
> >> cpu_core@...T_RETIRED.ANY@ / (cpu_core@...ARITH_INST_RETIRED.SCALAR@ +
> >> cpu_core@...ARITH_INST_RETIRED.VECTOR@)
> >> Testing tma_info_inst_mix_iparith_avx128
> >> FP issues
> >> Cannot resolve IDs for tma_info_inst_mix_iparith_avx128:
> >> cpu_core@...T_RETIRED.ANY@ /
> >> (cpu_core@...ARITH_INST_RETIRED.128B_PACKED_DOUBLE@ +
> >> cpu_core@...ARITH_INST_RETIRED.128B_PACKED_SINGLE@)
> >> Testing tma_info_inst_mix_iparith_avx256
> >> FP issues
> >> Cannot resolve IDs for tma_info_inst_mix_iparith_avx256:
> >> cpu_core@...T_RETIRED.ANY@ /
> >> (cpu_core@...ARITH_INST_RETIRED.256B_PACKED_DOUBLE@ +
> >> cpu_core@...ARITH_INST_RETIRED.256B_PACKED_SINGLE@)
> >> Testing tma_info_inst_mix_iparith_scalar_dp
> >> FP issues
> >> Cannot resolve IDs for tma_info_inst_mix_iparith_scalar_dp:
> >> cpu_core@...T_RETIRED.ANY@ /
> >> cpu_core@...ARITH_INST_RETIRED.SCALAR_DOUBLE@
> >> Testing tma_info_inst_mix_iparith_scalar_sp
> >> FP issues
> >> Cannot resolve IDs for tma_info_inst_mix_iparith_scalar_sp:
> >> cpu_core@...T_RETIRED.ANY@ /
> >> cpu_core@...ARITH_INST_RETIRED.SCALAR_SINGLE@
> >
> > Thanks Tom, we've gone from a fail to skip - so progress! I think it
> > actually isn't something to worry about. These metrics are measuring
> > vector and floating point things. We run a workload, when testing the
> > metrics, that doesn't have floating point and vector operations. This
> > causes issues with metrics for these instructions as the counters
> > don't count anything. Because of this I added some logic to just skip
> > when we see these failures:
> > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/tests/shell/stat_all_metrics.sh?h=perf-tools-next#n51
> > but a better fix would be to have a workload with FP and AMX operations.
> >
> > You could test these metrics work manually, by running something like:
> > $ perf stat -M tma_info_inst_mix_iparith <benchmark>
> > where <benchmark> would need to contain FP or AMX instructions.
> >
>
> It should be OK to skip the "FP issues", but the "Cannot resolve IDs"
> seems a different issue.
>
> I found the similar error when I run perf stat on my Arrow Lake machine.
>
> $ sudo ./perf stat
> Cannot resolve IDs for tma_memory_bound: topdown\-mem\-bound /
> (topdown\-fe\-bound + topdown\-bad\-spec + topdown\-retiring +
> topdown\-be\-bound) + 0 * slots
>
> I think the warning is because perf doesn't find all the matched events.
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/metricgroup.c?h=tmp.perf-tools-next#n326
>
>
> Then I go to check the event list of tma_memory_bound.
> For cpu_atom, it requires slots and topdown-mem-bound, which should be
> only available on p-core.
>
> +    {
> +        "BriefDescription": "This metric represents fraction of slots
> the Memory subsystem within the Backend was a bottleneck",
> +        "DefaultMetricgroupName": "TopdownL2",
> +        "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound +
> topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 *
> slots",
> +        "MetricGroup":
> "Backend;Default;Slots;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> +        "MetricName": "tma_memory_bound",
> +        "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound
> > 0.2",
> +        "MetricgroupNoGroup": "TopdownL2;Default",
> +        "PublicDescription": "This metric represents fraction of slots
> the Memory subsystem within the Backend was a bottleneck.  Memory Bound
> estimates fraction of slots where pipeline is likely stalled due to
> demand load or store instructions. This accounts mainly for (1)
> non-completed in-flight memory demand loads which coincides with
> execution units starvation; in addition to (2) cases where stores could
> impose backpressure on the pipeline when many of them get buffered at
> the same time (less common out of the two)",
> +        "ScaleUnit": "100%",
> +        "Unit": "cpu_atom"
> +    },
>
> I didn't check the tma_info_inst_mix_iparith which Thomas mentioned
> above yet. But I suspect it should be the same issue.
>
> The perf test may have to error out when the "Cannot resolve IDs"
> message is detected.

Agreed. I think I see the issue. We're adding the extra/Valkyrie
metrics to both core types during conversion, a quick fix is to just
skip doing this for atom. I'll cut a v4 with this.

Thanks,
Ian