linux-kernel - Re: [PATCH v4 09/22] perf jevents: Add ports metric group giving utilization on Intel

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP-5=fWwts12ucSAXBHW7-Q+Asm1GUnT7tTPZ3_D5xhHMZ-8ig@mail.gmail.com>
Date: Thu, 7 Nov 2024 13:00:23 -0800
From: Ian Rogers <irogers@...gle.com>
To: "Liang, Kan" <kan.liang@...ux.intel.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
	Arnaldo Carvalho de Melo <acme@...nel.org>, Namhyung Kim <namhyung@...nel.org>, 
	Mark Rutland <mark.rutland@....com>, 
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>, Jiri Olsa <jolsa@...nel.org>, 
	Adrian Hunter <adrian.hunter@...el.com>, linux-perf-users@...r.kernel.org, 
	linux-kernel@...r.kernel.org, Perry Taylor <perry.taylor@...el.com>, 
	Samantha Alt <samantha.alt@...el.com>, Caleb Biggers <caleb.biggers@...el.com>, 
	Weilin Wang <weilin.wang@...el.com>, Edward Baker <edward.baker@...el.com>
Subject: Re: [PATCH v4 09/22] perf jevents: Add ports metric group giving
 utilization on Intel

On Thu, Nov 7, 2024 at 11:36 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
>
> On 2024-11-07 12:12 p.m., Ian Rogers wrote:
> > On Thu, Nov 7, 2024 at 7:00 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
> >>
> >>
> >> On 2024-09-26 1:50 p.m., Ian Rogers wrote:
> >>> The ports metric group contains a metric for each port giving its
> >>> utilization as a ratio of cycles. The metrics are created by looking
> >>> for UOPS_DISPATCHED.PORT events.
> >>>
> >>> Signed-off-by: Ian Rogers <irogers@...gle.com>
> >>> ---
> >>>  tools/perf/pmu-events/intel_metrics.py | 33 ++++++++++++++++++++++++--
> >>>  1 file changed, 31 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/tools/perf/pmu-events/intel_metrics.py b/tools/perf/pmu-events/intel_metrics.py
> >>> index f4707e964f75..3ef4eb868580 100755
> >>> --- a/tools/perf/pmu-events/intel_metrics.py
> >>> +++ b/tools/perf/pmu-events/intel_metrics.py
> >>> @@ -1,12 +1,13 @@
> >>>  #!/usr/bin/env python3
> >>>  # SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
> >>>  from metric import (d_ratio, has_event, max, CheckPmu, Event, JsonEncodeMetric,
> >>> -                    JsonEncodeMetricGroupDescriptions, LoadEvents, Metric,
> >>> -                    MetricGroup, MetricRef, Select)
> >>> +                    JsonEncodeMetricGroupDescriptions, Literal, LoadEvents,
> >>> +                    Metric, MetricGroup, MetricRef, Select)
> >>>  import argparse
> >>>  import json
> >>>  import math
> >>>  import os
> >>> +import re
> >>>  from typing import Optional
> >>>
> >>>  # Global command line arguments.
> >>> @@ -260,6 +261,33 @@ def IntelBr():
> >>>                       description="breakdown of retired branch instructions")
> >>>
> >>>
> >>> +def IntelPorts() -> Optional[MetricGroup]:
> >>> +  pipeline_events = json.load(open(f"{_args.events_path}/x86/{_args.model}/pipeline.json"))
> >>> +
> >>> +  core_cycles = Event("CPU_CLK_UNHALTED.THREAD_P_ANY",
> >>> +                      "CPU_CLK_UNHALTED.DISTRIBUTED",
> >>> +                      "cycles")
> >>> +  # Number of CPU cycles scaled for SMT.
> >>> +  smt_cycles = Select(core_cycles / 2, Literal("#smt_on"), core_cycles)
> >>> +
> >>> +  metrics = []
> >>> +  for x in pipeline_events:
> >>> +    if "EventName" in x and re.search("^UOPS_DISPATCHED.PORT", x["EventName"]):
> >>> +      name = x["EventName"]
> >>> +      port = re.search(r"(PORT_[0-9].*)", name).group(0).lower()
> >>> +      if name.endswith("_CORE"):
> >>> +        cyc = core_cycles
> >>> +      else:
> >>> +        cyc = smt_cycles
> >>> +      metrics.append(Metric(port, f"{port} utilization (higher is better)",
> >>> +                            d_ratio(Event(name), cyc), "100%"))
> >>
> >> The generated metric highly depends on the event name, which is very
> >> fragile. We will probably have the same event in a new generation, but
> >> with a different name. Long-term maintenance could be a problem.
> >> Is there an idea regarding how to sync the event names for new generations?
> > I agree with the idea that it is fragile, it is also strangely robust
> > as you say, new generations will gain support if they follow the same
> > naming convention. We have tests that load bearing metrics exists on
> > our platforms so maybe the appropriate place to test for existence is
> > in Weilin's metrics test.
> >
> >
> >> Maybe we should improve the event generation script and do an automatic
> >> check to tell which metrics are missed. Then we may decide if updating
> >> the new event name, dropping the metric or adding a different metric.
> > So I'm not sure it is a bug to not have the metric, if it were we
> > could just throw rather than return None. We're going to run the
> > script for every model including old models like nehalem, so I've
> > generally kept it as None. I think doing future work on testing is
> > probably best. It would also indicate use of the metric if people
> > notice it missing (not that the script aims for that 🙂 ).
>
> The maintenance is still a concern, even if we have a way to test it
> out. There is already an "official" metric published in GitHub, which is
> maintained by Intel. To be honest, I don't think there is more energy to
> maintain these "non-official" metrics.
>
> I don't think it should be a bug without these metrics. So it's very
> likely that the issue will not be addressed right away. If we cannot
> keep these metrics updated for the future platforms, I couldn't find a
> reason to have them.

So I think there are a few things:
1) I'd like there to be a non-json infrastructure for events that can
handle multiple models. Some failings of json are its inability to
validate events, long lines, lack of comments, metric expression
strings that aren't inherently sound, etc.  I'd like to make it so we
can have json metrics for everything, ie remove the hardcoded metrics
that play badly with event sharing, etc. Doing this by updating every
json would be tedious and excessively noisy.
2) There are "official" metrics from Intel and I've worked for the
establishment of that. That doesn't mean every Intel metric is in the
official metrics. Servers are better served than client machines. Core
TMA metrics are well served but uncore less so.
3) Are perf metrics perfect with some kind of warranty? Well no, as
your reviews in this thread have shown SMI cost on hybrid is likely
broken. We don't intentionally try to have broken metrics and fix them
asap when they come up. GPLv2 has an explicit "no warranty" section.
Now Intel have experimental and non-experimental events, we update the
comments of metrics using those to highlight that the underlying
events are experimental:
https://github.com/googleprodkernel/linux-perf/blob/google_tools_master/tools/perf/pmu-events/metric.py#L598
If there are bugs in the metrics then open source, sharing and fixing
benefits everyone.
4) Am I looking for energy from Intel to maintain these metrics? No.
I'm trying to stop carrying the patches just inside of Google's tree.

Thanks,
Ian