[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP-5=fW8exsmUg_9K09Oy6T4ZAvvD7ZbZN2sxODdqisZOR6mUA@mail.gmail.com>
Date: Thu, 23 Nov 2023 07:14:21 -0800
From: Ian Rogers <irogers@...gle.com>
To: Mark Rutland <mark.rutland@....com>
Cc: Marc Zyngier <maz@...nel.org>, Hector Martin <marcan@...can.st>,
Arnaldo Carvalho de Melo <acme@...hat.com>,
James Clark <james.clark@....com>,
linux-perf-users@...r.kernel.org,
LKML <linux-kernel@...r.kernel.org>,
Asahi Linux <asahi@...ts.linux.dev>
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5
On Thu, Nov 23, 2023 at 6:23 AM Mark Rutland <mark.rutland@....com> wrote:
>
> On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> > On Tue, 21 Nov 2023 13:40:31 +0000,
> > Marc Zyngier <maz@...nel.org> wrote:
> > >
> > > [Adding key people on Cc]
> > >
> > > On Tue, 21 Nov 2023 12:08:48 +0000,
> > > Hector Martin <marcan@...can.st> wrote:
> > > >
> > > > Perf broke on all Apple ARM64 systems (tested almost everything), and
> > > > according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> > >
> > > I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > > asymmetric ARM platform. It isn't clear what criteria is used to pick
> > > the PMU, but nothing works anymore.
> > >
> > > The saving grace in my case is that Debian still ships a 6.1 perftool
> > > package, but that's obviously not going to last.
> > >
> > > I'm happy to test potential fixes.
> >
> > At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> > -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> > CPU):
>
> Looking at this with fresh(er) eyes, I think there's a userspace bug here,
> regardless of whether one believes it's correct to convert a named-pmu event to
> a PERF_TYPE_HARDWARE event directed at that PMU.
>
> It looks like the userspace tool is dropping the extended type ID after an
> initial probe, and requests events with plain PERF_TYPE_HARDWARE (without an
> extended type ID), which explains why we seem to get events from one PMU only.
>
> More detail below...
>
> Marc, if you have time, could you run the same commands (on the same kernel)
> with a perf tool build from v6.4?
>
> > <quote>
> > maz@...ley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 0 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
> > apple_firestorm_pmu/cycles/ -e cycles ls
> > Using CPUID 0x00000000612f0280
> > Attempt to add: apple_icestorm_pmu/cycles=0/
> > ..after resolving event: apple_icestorm_pmu/cycles=0/
> > Opening: unknown-hardware:HG
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > config 0xb00000000
> > disabled 1
> > ------------------------------------------------------------
>
> Here config[31:0] is 0 (PERF_COUNT_HW_CPU_CYCLES), and config[63:32] is 0xb,
> which is presumably the PMU ID for the apple_icestorm_pmu.
>
> The attr doesn't contain exclude_guest=1, so this will be rejected by the PMU
> driver due to its mode exclusion requirements.
>
> > sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
> > sys_perf_event_open failed, error -95
>
> ... which is what we see here (this is EOPNOTSUPP, which __hw_perf_event_init()
> in drivers/perf/arm_pmu.c returns when the mode requested mode exclusion
> options aren't supported).
>
> So far, so good...
>
> > Attempt to add: apple_firestorm_pmu/cycles=0/
> > ..after resolving event: apple_firestorm_pmu/cycles=0/
> > Control descriptor is not initialized
> > Opening: apple_icestorm_pmu/cycles/
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
>
> ... but here, the extended type ID has been dropped, and this event is no
> longer directed towards the apple_firestorm_pmu PMU, so the kernel can direct
> this to *any* CPU PMU...
>
> > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 3
>
> ... and *some* PMU accepts it.
>
> > Opening: apple_firestorm_pmu/cycles/
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
>
> Likewise here, no extended type ID...
>
> > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> > Opening: cycles
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
>
> Likewise here, no extended type ID...
>
> > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 5
> > arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
> > bench builtin-evlist.c builtin-probe.c CREDITS perf.h
> > Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
> > builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
> > builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
> > builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
> > builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
> > builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
> > builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
> > builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
> > builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
> > builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
> > builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
> > builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
> > builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
> > builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
> > builtin-daemon.o builtin-list.c builtin-version.c perf ui
> > builtin-data.c builtin-list.o builtin-version.o perf-archive util
> > builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
> > builtin-diff.c builtin-mem.c command-list.txt perf.c
> > apple_icestorm_pmu/cycles/: -1: 0 873709 0
> > apple_firestorm_pmu/cycles/: -1: 0 873709 0
> > cycles: -1: 0 873709 0
> > apple_icestorm_pmu/cycles/: 0 873709 0
> > apple_firestorm_pmu/cycles/: 0 873709 0
> > cycles: 0 873709 0
> >
> > Performance counter stats for 'ls':
> >
> > <not counted> apple_icestorm_pmu/cycles/ (0.00%)
> > <not counted> apple_firestorm_pmu/cycles/ (0.00%)
> > <not counted> cycles (0.00%)
> >
> > 0.000002250 seconds time elapsed
> >
> > 0.000000000 seconds user
> > 0.000000000 seconds sys
>
> So it looks like the tool has expanded the requested
> 'apple_icestorm_pmu/cycles/' event into three cycles events, each opened
> without an extended type ID.
>
> AFAICT, the kernel has done exactly what it has always done for
> PERF_TYPE_HARDWARE/PERF_COUNT_HW_CPU_CYCLES events: pick the first PMU which
> said it can handle them.
>
> > If I run the same thing on another CPU cluster (firestorm), I get
> > this:
> >
> > <quote>
> > maz@...ley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 2 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
> > apple_firestorm_pmu/cycles/ -e cycles ls
> > Using CPUID 0x00000000612f0280
> > Attempt to add: apple_icestorm_pmu/cycles=0/
> > ..after resolving event: apple_icestorm_pmu/cycles=0/
> > Opening: unknown-hardware:HG
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > config 0xb00000000
> > disabled 1
> > ------------------------------------------------------------
> > sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
> > sys_perf_event_open failed, error -95
>
> Again, we see one request with an extended type ID, which fails due to mode exclusion requirements...
>
> > Attempt to add: apple_firestorm_pmu/cycles=0/
> > ..after resolving event: apple_firestorm_pmu/cycles=0/
> > Control descriptor is not initialized
> > Opening: apple_icestorm_pmu/cycles/
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
> > sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 3
> > Opening: apple_firestorm_pmu/cycles/
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
> > sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 4
> > Opening: cycles
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
>
> ... but all subsequent requests do not have an extended type ID, and the kernel
> directs these to whichever PMU accepts the event first...
>
> > sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 5
> > arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
> > bench builtin-evlist.c builtin-probe.c CREDITS perf.h
> > Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
> > builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
> > builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
> > builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
> > builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
> > builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
> > builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
> > builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
> > builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
> > builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
> > builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
> > builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
> > builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
> > builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
> > builtin-daemon.o builtin-list.c builtin-version.c perf ui
> > builtin-data.c builtin-list.o builtin-version.o perf-archive util
> > builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
> > builtin-diff.c builtin-mem.c command-list.txt perf.c
> > apple_icestorm_pmu/cycles/: -1: 1035101 469125 469125
> > apple_firestorm_pmu/cycles/: -1: 1035035 469125 469125
> > cycles: -1: 1034653 469125 469125
> > apple_icestorm_pmu/cycles/: 1035101 469125 469125
> > apple_firestorm_pmu/cycles/: 1035035 469125 469125
> > cycles: 1034653 469125 469125
> >
> > Performance counter stats for 'ls':
> >
> > 1,035,101 apple_icestorm_pmu/cycles/
> > 1,035,035 apple_firestorm_pmu/cycles/
> > 1,034,653 cycles
> >
> > 0.000001333 seconds time elapsed
> >
> > 0.000000000 seconds user
> > 0.000000000 seconds sys
> > </quote>
>
> ... and in this case the workload was run on a CPU affine ot that arbitrary
> PMU, hence we managed to count.
>
> So AFAICT, this is a userspace bug, maybe related to the way we probe for
> supported PMU features?
Probing PMU features is done by trying to perf_event_open events. For
extended types it is a cycles event on each core PMU:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmus.c?h=perf-tools-next#n532
The is_event_supported logic is here:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print-events.c?h=perf-tools-next#n232
There is the following comment:
if (open_return == -EACCES) {
/*
* This happens if the paranoid value
* /proc/sys/kernel/perf_event_paranoid is set to 2
* Re-run with exclude_kernel set; we don't do that
* by default as some ARM machines do not support it.
*
*/
Thanks,
Ian
> Thanks,
> Mark.
Powered by blists - more mailing lists