linux-kernel - Re: [PATCH RFC] KVM: arm64: PMU: Use multiple host PMUs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <86plidmjwh.wl-maz@kernel.org>
Date: Wed, 19 Mar 2025 09:47:42 +0000
From: Marc Zyngier <maz@...nel.org>
To: Akihiko Odaki <akihiko.odaki@...nix.com>
Cc: Oliver Upton <oliver.upton@...ux.dev>,
	Joey Gouly <joey.gouly@....com>,
	Suzuki K Poulose <suzuki.poulose@....com>,
	Zenghui Yu
 <yuzenghui@...wei.com>,
	Catalin Marinas <catalin.marinas@....com>,
	Will Deacon <will@...nel.org>,
	Kees Cook <kees@...nel.org>,
	"Gustavo A. R. Silva" <gustavoars@...nel.org>,
	linux-arm-kernel@...ts.infradead.org,
	kvmarm@...ts.linux.dev,
	linux-kernel@...r.kernel.org,
	linux-hardening@...r.kernel.org,
	devel@...nix.com
Subject: Re: [PATCH RFC] KVM: arm64: PMU: Use multiple host PMUs

On Wed, 19 Mar 2025 08:37:29 +0000,
Akihiko Odaki <akihiko.odaki@...nix.com> wrote:
> 
> On 2025/03/19 16:34, Oliver Upton wrote:
> > Hi Akihiko,
> > 
> > On Wed, Mar 19, 2025 at 03:33:46PM +0900, Akihiko Odaki wrote:
> >> Problem
> >> -------
> >> 
> >> arch/arm64/kvm/pmu-emul.c used to have a comment saying the follows:
> >>> The observant among you will notice that the supported_cpus
> >>> mask does not get updated for the default PMU even though it
> >>> is quite possible the selected instance supports only a
> >>> subset of cores in the system. This is intentional, and
> >>> upholds the preexisting behavior on heterogeneous systems
> >>> where vCPUs can be scheduled on any core but the guest
> >>> counters could stop working.
> >> 
> >> Despite the reference manual says counters may not continuously
> >> incrementing, Windows is not robust enough to handle stopped PMCCNTR_EL0
> >> and crashes with a division-by-zero error and it also crashes when the
> >> PMU is not present.
> >> 
> >> To avoid such a problem, the userspace should pin the vCPU threads to
> >> pCPUs supported by one host PMU when initializing the vCPUs or specify
> >> the host PMU to use with KVM_ARM_VCPU_PMU_V3_SET_PMU after the
> >> initialization. However, QEMU/libvirt can pin vCPU threads only after the
> >> vCPUs are initialized. It also limits the pCPUs the guest can use even
> >> for VMMs that support proper pinning.
> >> 
> >> Solution
> >> --------
> >> 
> >> Ideally, Windows should fix the division-by-zero error and QEMU/libvirt
> >> should support pinning better, but neither of them are going to happen
> >> anytime soon.
> >> 
> >> To allow running Windows on QEMU/libvirt or with heterogeneous cores,
> >> combine all host PMUs necessary to cover the cores vCPUs can run and
> >> keep PMCCNTR_EL0 working.
> > > I'm extremely uneasy about making this a generalized solution. PMUs are
> > deeply tied to the microarchitecture of a particular implementation, and
> > that isn't something we can abstract away from the guest in KVM.
> > 
> > For example, you could have an event ID that counts on only a subset of
> > cores, or better yet an event that counts something completely different
> > depending on where a vCPU lands.> > I do appreciate the issue that you're trying to solve.
> > 
> > The good news though is that the fixed PMU cycle counter is the only
> > thing guaranteed to be present in any PMUv3 implementation. Since
> > that's the only counter Windows actually needs, perhaps we could
> > special-case this in KVM.
> > 
> > I have the following (completely untested) patch, do you want to give it
> > a try? There's still going to be observable differences between PMUs
> > (e.g. CPU frequency) but at least it should get things booting.
> 
> I don't think it will work, unfortunately. perf_init_event() binds a
> perf event to a particular PMU so the event will stop working when the
> thread migrates away.
> 
> It should also be the reason why the perf program creates an event for
> each PMU. tools/perf/Documentation/intel-hybrid.txt has more
> descriptions.

But perf on non-Intel behaves pretty differently. ARM PMUs behaves
pretty differently, because there is no guarantee of homogeneous
events.

> 
> Allowing to enable more than one counter and/or an event type other
> than the cycle counter is not the goal. Enabling another event type
> may result in a garbage value, but I don't think it's worse than the
> current situation where the count stays zero; please tell me if I miss
> something.
> 
> There is still room for improvement. Returning a garbage value may not
> be worse than returning zero, but counters and event types not
> supported by some cores shouldn't be advertised as available in the
> first place. More concretely:
> 
> - The vCPU should be limited to run only on cores covered by PMUs when
> KVM_ARM_VCPU_PMU_V3 is set.

That's userspace's job. Bind to the desired PMU, and run. KVM will
actively prevent you from running on the wrong CPU.

> - PMCR_EL0.N advertised to the guest should be the minimum of ones of
> host PMUs.

How do you find out? CPUs can be hot-plugged on long after a VM has
started, bringing in a new PMU, with a different number of counters.

> - PMCEID0_EL0 and PMCEID1_EL0 advertised to the guest should be the
> result of the AND operations of ones of host PMUs.

Same problem.

> 
> Special-casing the cycle counter may make sense if we are going to fix
> the advertised values of PMCR_EL0.N, PMCEID0_EL0, and
> PMCEID1_EL0. PMCR_EL0.N as we can simply return zero for these
> registers. We can also prevent enabling a counter that returns zero or
> a garbage value.
>
> Do you think it's worth fixing these registers? If so, I'll do that by
> special-casing the cycle counter.

I think this is really going in the wrong direction.

The whole design of the PMU emulation is that we expose a single,
architecturally correct PMU implementation. This is clearly
documented.

Furthermore, userspace is being given all the relevant information to
place vcpus on the correct physical CPUs. Why should we add this sort
of hack in the kernel, creating a new userspace ABI that we will have
to support forever, when usespace can do the correct thing right now?

Worse case, this is just a 'taskset' away, and everything will work.

Frankly, I'm not prepared to add more hacks to KVM for the sake of the
combination of broken userspace and broken guest.

	M.

-- 
Without deviation from the norm, progress is not possible.