[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP-5=fWk-eDfuRH-tL5TWU8dXumOnCTKby5VKonOfjGad4TG=Q@mail.gmail.com>
Date: Sat, 25 May 2024 23:22:08 -0700
From: Ian Rogers <irogers@...gle.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Namhyung Kim <namhyung@...nel.org>, Arnaldo Carvalho de Melo <acme@...nel.org>, Leo Yan <leo.yan@...ux.dev>,
Mark Rutland <mark.rutland@....com>, Ingo Molnar <mingo@...nel.org>,
Thomas Gleixner <tglx@...utronix.de>, Jiri Olsa <jolsa@...nel.org>,
Adrian Hunter <adrian.hunter@...el.com>, Clark Williams <williams@...hat.com>,
Kate Carcia <kcarcia@...hat.com>, linux-kernel@...r.kernel.org,
linux-perf-users@...r.kernel.org, Anne Macedo <retpolanne@...teo.net>,
Bhaskar Chowdhury <unixbhaskar@...il.com>, Ethan Adams <j.ethan.adams@...il.com>,
James Clark <james.clark@....com>, Kan Liang <kan.liang@...ux.intel.com>,
Thomas Richter <tmricht@...ux.ibm.com>, Tycho Andersen <tycho@...ho.pizza>,
Yang Jihong <yangjihong@...edance.com>
Subject: Re: [GIT PULL] perf tools changes for v6.10
On Sat, May 25, 2024 at 10:22 PM Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> On Sat, 25 May 2024 at 16:34, Ian Rogers <irogers@...gle.com> wrote:
> >
> > So I think we still need to figure out what:
> >
> > $ perf <command> -e <event> ...
> >
> > where <event> doesn't specify a PMU means. I'll try to enumerate the options:
>
> [ snip snip ]
>
> How about make the rule be that if the event doesn't have a specified
> PMU, then that just means "legacy rules first".
>
> IOW, if you have a fully qualified event name (maybe define that as
> "event name contains a slash), then you use the sysfs lookup.
>
> But a simple event name that doesn't contain a slash shall mean "use
> legacy lookup rules".
What to do with events with no PMU like data_read? On my Intel tigerlake laptop:
```
$ ls /sys/devices/*/events/data_read
/sys/devices/uncore_imc_free_running_0/events/data_read
/sys/devices/uncore_imc_free_running_1/events/data_read
$ perf --version
perf version 6.6.15
$ sudo perf stat --no-merge -e data_read -a sleep 0.1
Performance counter stats for 'system wide':
122.84 MiB data_read [uncore_imc_free_running_0]
123.00 MiB data_read [uncore_imc_free_running_1]
0.101826301 seconds time elapsed
```
The rule for looking up an event with no PMU specified is to try it on
every PMU - the rule is about as old as perf itself. For heterogeneous
systems (BIG.little, hybrid) legacy events (cycles, instructions..)
will be opened on every "core" PMU before the change. The behavior now
is to just be consistent and say when no PMU is specified we always
search all PMUs. This is motivated by ARM, RISC-V, .. wanting legacy
events to be a last resort if sysfs or json encodings can't be found.
> Maybe in practice that ends up being the same as your option #4 ("if
> the PMU isn't specified with <event> then we only search core PMUs")?
> I don't know the perf code well enough to be able to say.
>
> But basically, the #1 rule in the kernel is that we do not break user
> workflows. I happen to think that that is a really important rule, and
> I'm disgusted at how many other open source projects ignore that rule
> and think that "in the name of improvement, we will break the world".
So the problem was that the world was broken. It still is. Find a
BIG.little ARM and try "perf stat -a sleep 1", the counts reported
will be for either the BIG or the little cores. BIG.little is a decade
old now.
The switch to prioritize sysfs/JSON events over legacy events was
specifically so that we wouldn't break Apple M1 and newer CPUs when
making ARM properly support BIG.little when an event is specified. I
think perf was broken over 2 kernel releases on Apple devices because
of that. I had to fix issues with Apple M1 and later while not having
access to such a machine. Frankly life would have been easier if ARM
had just fixed the driver.
> And as long as "perf" is maintained in the kernel sources, that kernel
> rule will guide perf too. Because the rule is not so much "kernels are
> special" as a "Linus wants people to be able to feel confident in
> updating".
Sure, me too. This wouldn't have been an issue if Ampere had chosen a
different event name than cycles. Often the driver writers don't test
or consider perf. For example, perf will treat PMUs with a number at
the end as being something to merge together in counts. On my laptop,
uncore_imc_free_running_0 and uncore_imc_free_running_1 were merged in
the data_read example above. ARM decided to start a new convention of
putting hex addresses as suffixes many Linux releases ago, I've sent
patches to make these encodings work like the numeric case and we may
get it landed in v6.11 (a big problem being S390 already had PMUs like
cpum_cf and cf is a valid hex suffix):
https://lore.kernel.org/lkml/20240515060114.3268149-1-irogers@google.com/
In doing that work the code was tested by IBM for S390 and by Intel,
but it was fixing an ARM created problem. ARM were the first to have
BIG.little systems but contributed nothing to the perf tool to handle
it, even though each core type has a different PMU. ARM BIG.little
remains broken with the perf tool and when I fix it for them they
don't review or test the code. ARM changed and left unworking uncore
PMU naming conventions. ARM don't fix tests for their platform. ARM
don't help make perf's tests cover their different way of naming PMUs.
No one is trying to break ARM machines, but when ARM fails to do
anything other than review their own changes in the perf tree it is
something of an inevitability.
Fwiw, I am working on making perf record, perf top, etc. skip events
on non-core PMUs when they fail to open. It is a rather large and ugly
change. It is also a holiday weekend and I'm spending a lot of my time
in it addressing latent ARM problems.
Thanks,
Ian
> Linus
>
Powered by blists - more mailing lists