linux-kernel - Re: [PATCH RFC] KVM: arm64: PMU: Use multiple host PMUs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e8324d9d-3756-41cf-a102-28572e302368@daynix.com>
Date: Wed, 19 Mar 2025 17:37:29 +0900
From: Akihiko Odaki <akihiko.odaki@...nix.com>
To: Oliver Upton <oliver.upton@...ux.dev>
Cc: Marc Zyngier <maz@...nel.org>, Joey Gouly <joey.gouly@....com>,
 Suzuki K Poulose <suzuki.poulose@....com>, Zenghui Yu
 <yuzenghui@...wei.com>, Catalin Marinas <catalin.marinas@....com>,
 Will Deacon <will@...nel.org>, Kees Cook <kees@...nel.org>,
 "Gustavo A. R. Silva" <gustavoars@...nel.org>,
 linux-arm-kernel@...ts.infradead.org, kvmarm@...ts.linux.dev,
 linux-kernel@...r.kernel.org, linux-hardening@...r.kernel.org,
 devel@...nix.com
Subject: Re: [PATCH RFC] KVM: arm64: PMU: Use multiple host PMUs

On 2025/03/19 16:34, Oliver Upton wrote:
> Hi Akihiko,
> 
> On Wed, Mar 19, 2025 at 03:33:46PM +0900, Akihiko Odaki wrote:
>> Problem
>> -------
>>
>> arch/arm64/kvm/pmu-emul.c used to have a comment saying the follows:
>>> The observant among you will notice that the supported_cpus
>>> mask does not get updated for the default PMU even though it
>>> is quite possible the selected instance supports only a
>>> subset of cores in the system. This is intentional, and
>>> upholds the preexisting behavior on heterogeneous systems
>>> where vCPUs can be scheduled on any core but the guest
>>> counters could stop working.
>>
>> Despite the reference manual says counters may not continuously
>> incrementing, Windows is not robust enough to handle stopped PMCCNTR_EL0
>> and crashes with a division-by-zero error and it also crashes when the
>> PMU is not present.
>>
>> To avoid such a problem, the userspace should pin the vCPU threads to
>> pCPUs supported by one host PMU when initializing the vCPUs or specify
>> the host PMU to use with KVM_ARM_VCPU_PMU_V3_SET_PMU after the
>> initialization. However, QEMU/libvirt can pin vCPU threads only after the
>> vCPUs are initialized. It also limits the pCPUs the guest can use even
>> for VMMs that support proper pinning.
>>
>> Solution
>> --------
>>
>> Ideally, Windows should fix the division-by-zero error and QEMU/libvirt
>> should support pinning better, but neither of them are going to happen
>> anytime soon.
>>
>> To allow running Windows on QEMU/libvirt or with heterogeneous cores,
>> combine all host PMUs necessary to cover the cores vCPUs can run and
>> keep PMCCNTR_EL0 working.
 > > I'm extremely uneasy about making this a generalized solution. PMUs are
> deeply tied to the microarchitecture of a particular implementation, and
> that isn't something we can abstract away from the guest in KVM.
> 
> For example, you could have an event ID that counts on only a subset of
> cores, or better yet an event that counts something completely different
> depending on where a vCPU lands.> > I do appreciate the issue that you're trying to solve.
> 
> The good news though is that the fixed PMU cycle counter is the only
> thing guaranteed to be present in any PMUv3 implementation. Since
> that's the only counter Windows actually needs, perhaps we could
> special-case this in KVM.
> 
> I have the following (completely untested) patch, do you want to give it
> a try? There's still going to be observable differences between PMUs
> (e.g. CPU frequency) but at least it should get things booting.

I don't think it will work, unfortunately. perf_init_event() binds a 
perf event to a particular PMU so the event will stop working when the 
thread migrates away.

It should also be the reason why the perf program creates an event for 
each PMU. tools/perf/Documentation/intel-hybrid.txt has more descriptions.

Allowing to enable more than one counter and/or an event type other than 
the cycle counter is not the goal. Enabling another event type may 
result in a garbage value, but I don't think it's worse than the current 
situation where the count stays zero; please tell me if I miss something.

There is still room for improvement. Returning a garbage value may not 
be worse than returning zero, but counters and event types not supported 
by some cores shouldn't be advertised as available in the first place. 
More concretely:

- The vCPU should be limited to run only on cores covered by PMUs when 
KVM_ARM_VCPU_PMU_V3 is set.
- PMCR_EL0.N advertised to the guest should be the minimum of ones of 
host PMUs.
- PMCEID0_EL0 and PMCEID1_EL0 advertised to the guest should be the 
result of the AND operations of ones of host PMUs.

Special-casing the cycle counter may make sense if we are going to fix 
the advertised values of PMCR_EL0.N, PMCEID0_EL0, and PMCEID1_EL0. 
PMCR_EL0.N as we can simply return zero for these registers. We can also 
prevent enabling a counter that returns zero or a garbage value.

Do you think it's worth fixing these registers? If so, I'll do that by 
special-casing the cycle counter.

Regards,
Akihiko Odaki

> 
> Thanks,
> Oliver
> 
> ---
> diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
> index a1bc10d7116a..913a7bab50b5 100644
> --- a/arch/arm64/kvm/pmu-emul.c
> +++ b/arch/arm64/kvm/pmu-emul.c
> @@ -724,14 +724,21 @@ static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc)
>   		return;
>   
>   	memset(&attr, 0, sizeof(struct perf_event_attr));
> -	attr.type = arm_pmu->pmu.type;
> +
> +	if (pmc->idx == ARMV8_PMU_CYCLE_IDX) {
> +		attr.type = PERF_TYPE_HARDWARE;
> +		attr.config = PERF_COUNT_HW_CPU_CYCLES;
> +	} else {
> +		attr.type = arm_pmu->pmu.type;
> +		attr.config = eventsel;
> +	}
> +
>   	attr.size = sizeof(attr);
>   	attr.pinned = 1;
>   	attr.disabled = !kvm_pmu_counter_is_enabled(pmc);
>   	attr.exclude_user = !kvm_pmc_counts_at_el0(pmc);
>   	attr.exclude_hv = 1; /* Don't count EL2 events */
>   	attr.exclude_host = 1; /* Don't count host events */
> -	attr.config = eventsel;
>   
>   	/*
>   	 * Filter events at EL1 (i.e. vEL2) when in a hyp context based on the