linux-kernel - Re: [PATCH v5 4/4] perf parse-events: Reapply "Prefer sysfs/JSON hardware events over legacy"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP-5=fWzzWqNAgmrDHav63Z+HMnSP0RZJ3Q7PQpuzP7Tf_HP7g@mail.gmail.com>
Date: Sat, 1 Feb 2025 00:45:04 -0800
From: Ian Rogers <irogers@...gle.com>
To: Namhyung Kim <namhyung@...nel.org>
Cc: Atish Kumar Patra <atishp@...osinc.com>, Peter Zijlstra <peterz@...radead.org>, 
	Ingo Molnar <mingo@...hat.com>, Arnaldo Carvalho de Melo <acme@...nel.org>, Mark Rutland <mark.rutland@....com>, 
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>, Jiri Olsa <jolsa@...nel.org>, 
	Adrian Hunter <adrian.hunter@...el.com>, Kan Liang <kan.liang@...ux.intel.com>, 
	James Clark <james.clark@...aro.org>, Ze Gao <zegao2021@...il.com>, 
	Weilin Wang <weilin.wang@...el.com>, Dominique Martinet <asmadeus@...ewreck.org>, 
	Jean-Philippe Romain <jean-philippe.romain@...s.st.com>, Junhao He <hejunhao3@...wei.com>, 
	linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org, 
	bpf@...r.kernel.org, Aditya Bodkhe <Aditya.Bodkhe1@....com>, Leo Yan <leo.yan@....com>, 
	Beeman Strong <beeman@...osinc.com>, Arnaldo Carvalho de Melo <acme@...hat.com>
Subject: Re: [PATCH v5 4/4] perf parse-events: Reapply "Prefer sysfs/JSON
 hardware events over legacy"

On Fri, Jan 31, 2025 at 2:42 PM Namhyung Kim <namhyung@...nel.org> wrote:
>
> On Wed, Jan 29, 2025 at 10:12:14PM -0800, Atish Kumar Patra wrote:
> > On Wed, Jan 29, 2025 at 1:55 PM Namhyung Kim <namhyung@...nel.org> wrote:
> > >
> > > On Wed, Jan 15, 2025 at 01:20:32PM -0800, Ian Rogers wrote:
> > > > On Wed, Jan 15, 2025 at 9:59 AM Namhyung Kim <namhyung@...nel.org> wrote:
> > > > >
> > > > > > On Mon, Jan 13, 2025 at 2:51 PM Ian Rogers <irogers@...gle.com> wrote:
> > > > > > > There was an explicit, and reviewed by Jiri and Arnaldo, intent with
> > > > > > > the hybrid work that using a legacy event with a hybrid PMU, even
> > > > > > > though the PMU doesn't advertise through json or sysfs the legacy
> > > > > > > event, the perf tool supports it.
> > > > >
> > > > > I thought legacy events on hybrid were converted to PMU events.
> > > >
> > > > No, when BIG.little was created nothing changed in perf events but
> > > > when Intel did hybrid they wanted to make the hybrid CPUs (atom and
> > > > performance) appear as if they were one type. The PMU event encodings
> > > > vary a lot for this on Intel, ARM has standards for the encoding.
> > > > Intel extended the legacy format to take a PMU type id:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/include/uapi/linux/perf_event.h?h=perf-tools-next#n41
> > > > "EEEEEEEE: PMU type ID"
> > > > that is in the top 32-bits of the config.
> > >
> > > Oh right, I forgot the extended type thing.  Then we can keep the legacy
> > > encoding with it on hybrid systems when users give well-known names (w/o
> > > PMU) for legacy event.
> > >
> > > >
> > > > > > >
> > > > > > > Making it so that events without PMUs are only legacy events just
> > > > > > > doesn't work. There are far too many existing uses of non-legacy
> > > > > > > events without PMU, the metrics contain 100s of examples.
> > > > >
> > > > > That's unfortunate.  It'd be nice if metrics were written with PMU
> > > > > names.
> > > >
> > > > But then we'd end up with things like on Intel:
> > > > UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD
> > > > becoming:
> > > > uncore_cha/UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD/
> > > > or just:
> > > > cha/UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD/
> > > > As a user the first works for me and doesn't have any ambiguity over
> > > > PMUs as the event name already encodes the PMU. AMD similarly place
> > > > the part of a pipeline into event names. Were we to break everybody by
> > > > requiring the PMU we'd also need to explain which PMU to use. Sites
> > > > with event lists (like https://perfmon-events.intel.com/) don't
> > > > explain the PMU and it'd be messy as on Intel you have a CHA PMU for
> > > > server chips but a CBOX on client chips, etc.
> > >
> > > While I prefer having PMU names in the JSON events/metrics, it may not
> > > be pratical to change them all.  Probably we can allow them without PMU
> > > and hope that they have unique prefixes.
> > >
> > > >
> > > > > I have a question.  What if an event name in a metric matches to
> > > > > multiple unrelated PMUs?
> > > >
> > > > The metric may break or we'd aggregate the unrelated counts together.
> > >
> > > Ok, then they should use unique names.
> > >
> > >
> > > > Take a metric like IPC as "instructions/cycles", that metric should
> > > > work on a hybrid system as they have instructions and cycles. If you
> > > > used an event for instructions like inst_retired.any then maybe the
> > > > metric will fail on one kind of core that didn't have that event. Now
> > >
> > > The metrics is for specific CPU model then the vendor should be
> > > responsible to provide accurate metrics using approapriate PMU/events
> > > IMHO.
> > >
> > >
> > > > if we have accelerators advertising instructions and cycles events, we
> > > > should be able to compute the metric for the accelerator. What could
> > > > happen today is that the accelerator will have a cpumask of a single
> > > > CPU, we could aggregate the accelerator counter into the CPU event
> > > > with the same CPU as the cpumask, we'd end up with a weird quasi CPU
> > > > and accelerator IPC metric for that CPU. What should happen is that we
> > > > get an IPC for the accelerator and IPC for each hybrid core
> > > > independently, but the way we handle evsels, CPUs, PMUs is not really
> > > > set up for that. Hopefully getting a set of PMUs into the evsel will
> > > > clear that up. Assuming all of that is cleared up, is it wrong if the
> > > > IPC metric is computed for the accelerator if it was originally
> > > > written as a CPU metric? Not really. Could there be metrics where that
> > > > is the case?
> > >
> > > Yes, I think there should be separate metrics for the accelerators.
> > >
> > >
> > > > Probably, and specifying PMUs in the event names would be
> > > > a fix. There have also been proposals that we restrict the PMUs for
> > > > certain metrics. As event names are currently so distinct it isn't a
> > > > problem we've faced yet and it is not clear it is a problem other than
> > > > highlighting tech debt in areas of the tool like aggregation.
> > > >
> > > > > > >
> > > > > > > Prior to switching json/sysfs to being the priority when a PMU is
> > > > > > > specified, it was the case that all encodings were the same, with or
> > > > > > > without a PMU.
> > > > > > >
> > > > > > > I don't think there is anything natural about assuming things about
> > > > > > > event names. Take cycles, cpu-cycles and cpu_cycles:
> > > > > > >  - cycles on x86 is only encoded via a legacy event;
> > > > > > >  - cpu-cycles on Intel exists as a sysfs event, but cpu-cycles is also
> > > > > > > a legacy event name;
> > > > > > >  - cpu_cycles exists as a sysfs event on ARM but doesn't have a
> > > > > > > corresponding legacy event name.
> > > > >
> > > > > I think the behavior should be:
> > > > >
> > > > >   cycles -> PERF_COUNT_HW_CPU_CYCLES
> > > > >   cpu-cycles -> PERF_COUNT_HW_CPU_CYCLES
> > > > >   cpu_cycles -> no legacy -> sysfs or json
> > > > >   cpu/cycles/ -> sysfs or json
> > > > >   cpu/cpu-cycles/ -> sysfs or json
> > > >
> > > > So I disagree as if you add a PMU to an event name the encoding
> > > > shouldn't change:
> > > > 1) This historically was perf's behavior.
> > >
> > > Well.. I'm not sure about the history.  I believe the logic I said above
> > > is the historic and (I think) right behavior.
> > >
> > > > 2) Different event encodings can have different behaviors (broken in
> > > > some notable cases).
> > >
> > > Yep, let's make it clear.
> > >
> > > > 3) Intuitively what wildcarding does is try to open "*/event/" where *
> > > > is every possible PMU name. Having different event encodings is
> > > > breaking that intuition it could also break situations where you try
> > > > to assert equivalence based on type/config.
> > >
> > > While I don't like the wildcard matching, I think it doesn't matter as
> > > long as we keep the above behavior.  If it can find a legacy name, then
> > > go with it, done.  If not, try all PMUs as if it's given with PMU name
> > > in the event.
> > >
> > > > 4) The legacy encodings were (are?) broken on ARM Apple M? CPUs,
> > > > that's why the priority was changed.
> > >
> > > I guess that why they use cpu_cycles.
> > >
> > > > 5) RISC-V would like the tool tackle the legacy to config mapping
> > > > challenge, rather than the PMU driver, given the potential diversity
> > > > of hardware implementations.
> > >
> > > I hope they can find a better solution. :)
> > >
> >
> > Sorry for reposing. Gmail converted it to html for some reason.
> >
> > I have posted the latest support here.
> > https://lore.kernel.org/kvm/20250127-counter_delegation-v3-12-64894d7e16d5@rivosinc.com/T/
> >
> > As of now, we have adopted a hybrid approach where a vendor can decide
> > whether to encode the legacy events
> > in the json or in the driver (if this series is merged). In absence of
> > that, every vendor has to define it in the driver.
> > We will deal with the fall out of the exploding driver when the
> > situation arrives.
>
> I don't know how hard it'd be cause I'm not familiar with RISC-V.  But
> basically you only need to maintain 9 legacy encodings (PERF_COUNT_HW_*)
> and a few dozen combinations of supported cache events (PERF_COUNT_HW_
> CACHE_*) for each vendor.  All others can go to json anyway.
>
> I think this is what all other archs (including x86) do.

This is well known to the people involved.

While the PMU driver needs to encode or avoid these event names, they
become special "legacy" names inside the perf tool. Magically a name
like cpu_cycles will wildcard match (match on >1 PMU) whilst a name
like cpu-cycles won't (only matching on core PMUs). This is completely
confusing to users. It is even more confusing when you are saying the
tool should intentionally use two different encodings.

The perf event enum types are limited but the tool recognizes more
event names and then uses legacy encodings. I have yet to hear a
sensible list of what are legacy event names, is cpu-cycles in there
or just cycles? Why on earth would you want to keep synonyms like LLC
meaning L2 cache?

The intention with "pmu syntax" for events is that the PMU clarifies
the type in the perf_event_attr. Previously it was assumed that the
PMU type would be raw (4), and the x86 PMUs even use that as their
type number. Pretending these days we don't now have hybrid core PMUs,
10s of uncore PMUs. Doing that work had to reinvent event parsing and
encoding.

If you look at the matching as it is today:
cpu_cycles -> tries to match on all PMUs
*/cpu_cycles/ -> tries to match on all PMUs
arm*/cpu_cycles/ -> tries matches on all PMUs that have arm at the start
armv8_pmuv3/cpu_cycles/ -> matches only the armv8_pmuv3 PMU

I don't see why it isn't obvious that the behavior of no PMU and the
PMU being * is expected to be exactly the same - it really is today
and that is what the code does, please try it. There just isn't a
notion of not having a PMU because even for legacy events we have to
reinvent the PMUs to inject the correct extended type information
otherwise we'd profile just a fraction of the cores. We add PMUs when
we display events to make the events more readable. There isn't a
notion of these events being legacy and not, they are just assumed to
be the same, PMU or not.

As I've explained to you, I plan to transition the metric code to use
event parsing and to union evlists rather than use strings and hash
tables. This is to fix tracepoints appearing incorrectly to always
have suffixes in the "metric-id" calculation. Recognizing modifiers
properly would end up reinventing event parsing, so let's just make
use of what we have and parse events early. It makes sense when
unioning evsels in an evlist to do it off of the perf_event_attr, this
will allow Intel's slots and topdown.slots to be correctly detected as
aliases in metrics, something of a pain in formulas today. Why would
the behavior of an event like cycles be different in non-hybrid
metrics (where PMUs generally aren't specified) and in hybrid metrics
(where PMUs generally are specified)? Events may not be recognized as
aliases because ones without a PMU in the metric will get a legacy
encoding. In your change:
https://lore.kernel.org/r/20221018020227.85905-16-namhyung@kernel.org
you assume all events with the same name are in fact the same event,
but that is making wild assumptions about what is placed in the evsel
name and I am trying to fix it in:
https://lore.kernel.org/lkml/20250201074320.746259-1-irogers@google.com/
You did similar with your proposal for hwmon events and I rejected it.
The fact that the name term in an event configuration clobbers an
evsel's name, its just the intent of the thing and the name was never
supposed to have some sacred legacy or whatever meaning.

I still see no sense in:
perf stat -e cpu_cycles ...
meaning:
perf stat -e */cpu_cycles/ ...
and:
perf stat -e cpu-cycles ...
trying to mean close to:
perf stat -e cpu/cpu-cycles/ ...
why one is implicitly a * and the other a core PMU, I mean it is the
definition of confusing. And in the latter cpu-cycles case you want
those two events to be encoded differently.

All of this is overlooking that we have 1 event that is a problem on 1
PMU on 1 architecture. If it weren't for that event we'd already have
this patch landed and consistent event encodings. By not taking the
patch it hurts Apple M, RISC-V users and my own work.

Please can you explain why keeping the current encoding is good and if
we like legacy events so much, can we revert the changes to prioritize
sysfs/json when a PMU name is present. I'm afraid what you are
explaining makes no sense to me, breaks existing platforms (Apple M)
and is a blockage to future work. Saying everyone should rewrite
everything, that's not a workable solution - not least because in some
situations (old PMU drivers on Apple M) we lack a time machine.

Thanks,
Ian