[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250602192702.2125115-1-coltonlewis@google.com>
Date: Mon, 2 Jun 2025 19:26:45 +0000
From: Colton Lewis <coltonlewis@...gle.com>
To: kvm@...r.kernel.org
Cc: Paolo Bonzini <pbonzini@...hat.com>, Jonathan Corbet <corbet@....net>,
Russell King <linux@...linux.org.uk>, Catalin Marinas <catalin.marinas@....com>,
Will Deacon <will@...nel.org>, Marc Zyngier <maz@...nel.org>, Oliver Upton <oliver.upton@...ux.dev>,
Joey Gouly <joey.gouly@....com>, Suzuki K Poulose <suzuki.poulose@....com>,
Zenghui Yu <yuzenghui@...wei.com>, Mark Rutland <mark.rutland@....com>,
Shuah Khan <shuah@...nel.org>, linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-arm-kernel@...ts.infradead.org, kvmarm@...ts.linux.dev,
linux-perf-users@...r.kernel.org, linux-kselftest@...r.kernel.org,
Colton Lewis <coltonlewis@...gle.com>
Subject: [PATCH 00/17] ARM64 PMU Partitioning
Overview:
This series implements a new PMU scheme on ARM, a partitioned PMU
that exists alongside the existing emulated PMU and may be enabled by
the kernel command line kvm.reserved_host_counters or by the vcpu
ioctl KVM_ARM_PARTITION_PMU. This is a continuation of the RFC posted
earlier this year. [1]
The high level overview and reason for the name is that this
implementation takes advantage of recent CPU features to partition the
PMU counters into a host-reserved set and a guest-reserved set. Guests
are allowed untrapped hardware access to the most frequently used PMU
registers and features for the guest-reserved counters only.
This untrapped hardware access significantly reduces the overhead of
using performance monitoring capabilities such as the `perf` tool
inside a guest VM. Register accesses that aren't trapping to KVM mean
less time spent in the host kernel and more time on the workloads
guests care about. This optimization especially shines during high
`perf` sample rates or large numbers of events that require
multiplexing hardware counters.
Performance:
For example, the following tests were carried out on identical ARM
machines with 10 general purpose counters with identical guest images
run on QEMU, the only difference being my PMU implementation or the
existing one. Some arguments have been simplified here to clarify the
purpose of the test:
1) time perf record -e ${FIFTEEN_HW_EVENTS} -F 1000 -- \
gzip -c tmpfs/random.64M.img >/dev/null
On emulated PMU this command took 4.143s real time with 0.159s system
time. On partitioned PMU this command took 3.139s real time with
0.110s system time, runtime reductions of 24.23% and 30.82%.
2) time perf stat -dd -- \
automated_specint2017.sh
On emulated PMU this benchmark completed in 3789.16s real time with
224.45s system time and a final benchmark score of 4.28. On
partitioned PMU this benchmark completed in 3525.67s real time with
15.98s system time and a final benchmark score of 4.56. That is a
6.95% reduction in runtime, 92.88% reduction in system time, and
6.54% improvement in overall benchmark score.
Seeing these improvements on something as lightweight as perf stat is
remarkable and implies there would have been a much greater
improvement with perf record. I did not test that because I was not
confident it would even finish in a reasonable time on the emulated
PMU
Test 3 was slightly different, I ran the workload in a VM with a
single VCPU pinned to a physical CPU and analyzed from the host where
the physical CPU spent its time using mpstat.
3) perf record -e ${FIFTEEN_HW_EVENTS} -F 4000 -- \
stress-ng --cpu 0 --timeout 30
Over a period of 30s the cpu running with the emulated PMU spent
34.96% of the time in the host kernel and 55.85% of the time in the
guest. The cpu running the partitioned PMU spent 0.97% of its time in
the host kernel and 91.06% of its time in the guest.
Taken together, these tests represent a remarkable performance
improvement for anything perf related using this new PMU
implementation.
Caveats:
Because the most consistent and performant thing to do was untrap
PMCR_EL0, the number of counters visible to the guest via PMCR_EL0.N
is always equal to the value KVM sets for MDCR_EL2.HPMN. Previously
allowed writes to PMCR_EL0.N via {GET,SET}_ONE_REG no longer affect
the guest.
These improvements come at a cost to 7-35 new registers that must be
swapped at every vcpu_load and vcpu_put if the feature is enabled. I
have been informed KVM would like to avoid paying this cost when
possible.
One solution is to make the trapping changes and context swapping lazy
such that the trapping changes and context swapping only take place
after the guest has actually accessed the PMU so guests that never
access the PMU never pay the cost.
This is not done here because it is not crucial to the primary
functionality and I thought review would be more productive as soon as
I had something complete enough for reviewers to easily play with.
However, this or any better ideas are on the table for inclusion in
future re-rolls.
[1] https://lore.kernel.org/kvmarm/20250213180317.3205285-1-coltonlewis@google.com/
Colton Lewis (16):
arm64: cpufeature: Add cpucap for HPMN0
arm64: Generate sign macro for sysreg Enums
arm64: cpufeature: Add cpucap for PMICNTR
KVM: arm64: Reorganize PMU functions
KVM: arm64: Introduce method to partition the PMU
perf: arm_pmuv3: Generalize counter bitmasks
perf: arm_pmuv3: Keep out of guest counter partition
KVM: arm64: Set up FGT for Partitioned PMU
KVM: arm64: Writethrough trapped PMEVTYPER register
KVM: arm64: Use physical PMSELR for PMXEVTYPER if partitioned
KVM: arm64: Writethrough trapped PMOVS register
KVM: arm64: Context switch Partitioned PMU guest registers
perf: pmuv3: Handle IRQs for Partitioned PMU guest counters
KVM: arm64: Inject recorded guest interrupts
KVM: arm64: Add ioctl to partition the PMU when supported
KVM: arm64: selftests: Add test case for partitioned PMU
Marc Zyngier (1):
KVM: arm64: Cleanup PMU includes
Documentation/virt/kvm/api.rst | 16 +
arch/arm/include/asm/arm_pmuv3.h | 24 +
arch/arm64/include/asm/arm_pmuv3.h | 36 +-
arch/arm64/include/asm/kvm_host.h | 208 +++++-
arch/arm64/include/asm/kvm_pmu.h | 82 +++
arch/arm64/kernel/cpufeature.c | 15 +
arch/arm64/kvm/Makefile | 2 +-
arch/arm64/kvm/arm.c | 24 +-
arch/arm64/kvm/debug.c | 13 +-
arch/arm64/kvm/hyp/include/hyp/switch.h | 65 +-
arch/arm64/kvm/pmu-emul.c | 629 +----------------
arch/arm64/kvm/pmu-part.c | 358 ++++++++++
arch/arm64/kvm/pmu.c | 630 ++++++++++++++++++
arch/arm64/kvm/sys_regs.c | 54 +-
arch/arm64/tools/cpucaps | 2 +
arch/arm64/tools/gen-sysreg.awk | 1 +
arch/arm64/tools/sysreg | 6 +-
drivers/perf/arm_pmuv3.c | 55 +-
include/kvm/arm_pmu.h | 199 ------
include/linux/perf/arm_pmu.h | 15 +-
include/linux/perf/arm_pmuv3.h | 14 +-
include/uapi/linux/kvm.h | 4 +
tools/include/uapi/linux/kvm.h | 2 +
.../selftests/kvm/arm64/vpmu_counter_access.c | 40 +-
virt/kvm/kvm_main.c | 1 +
25 files changed, 1616 insertions(+), 879 deletions(-)
create mode 100644 arch/arm64/include/asm/kvm_pmu.h
create mode 100644 arch/arm64/kvm/pmu-part.c
delete mode 100644 include/kvm/arm_pmu.h
base-commit: 1b85d923ba8c9e6afaf19e26708411adde94fba8
--
2.49.0.1204.g71687c7c1d-goog
Powered by blists - more mailing lists