linux-kernel - Re: [PATCH RFC] KVM: arm64: PMU: Use multiple host PMUs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aabd71eb-286b-475c-a30e-d5cf5c4f2769@daynix.com>
Date: Wed, 19 Mar 2025 20:26:18 +0900
From: Akihiko Odaki <akihiko.odaki@...nix.com>
To: Marc Zyngier <maz@...nel.org>
Cc: Oliver Upton <oliver.upton@...ux.dev>, Joey Gouly <joey.gouly@....com>,
 Suzuki K Poulose <suzuki.poulose@....com>, Zenghui Yu
 <yuzenghui@...wei.com>, Catalin Marinas <catalin.marinas@....com>,
 Will Deacon <will@...nel.org>, Kees Cook <kees@...nel.org>,
 "Gustavo A. R. Silva" <gustavoars@...nel.org>,
 linux-arm-kernel@...ts.infradead.org, kvmarm@...ts.linux.dev,
 linux-kernel@...r.kernel.org, linux-hardening@...r.kernel.org,
 devel@...nix.com
Subject: Re: [PATCH RFC] KVM: arm64: PMU: Use multiple host PMUs

On 2025/03/19 20:07, Marc Zyngier wrote:
> On Wed, 19 Mar 2025 10:26:57 +0000,
> Akihiko Odaki <akihiko.odaki@...nix.com> wrote:
>>
>>>> It should also be the reason why the perf program creates an event for
>>>> each PMU. tools/perf/Documentation/intel-hybrid.txt has more
>>>> descriptions.
>>>
>>> But perf on non-Intel behaves pretty differently. ARM PMUs behaves
>>> pretty differently, because there is no guarantee of homogeneous
>>> events.
>>
>> It works in the same manner in this particular aspect (i.e., "perf
>> stat -e cycles -a" creates events for all PMUs).
> 
> But it then becomes a system-wide counter, and that's not what KVM
> needs to do.

There is also an example of program profiling:
"perf stat -e cycles \-- taskset -c 16 ./triad_loop"

This also creates events for all PMUs.

> 
>>>> Allowing to enable more than one counter and/or an event type other
>>>> than the cycle counter is not the goal. Enabling another event type
>>>> may result in a garbage value, but I don't think it's worse than the
>>>> current situation where the count stays zero; please tell me if I miss
>>>> something.
>>>>
>>>> There is still room for improvement. Returning a garbage value may not
>>>> be worse than returning zero, but counters and event types not
>>>> supported by some cores shouldn't be advertised as available in the
>>>> first place. More concretely:
>>>>
>>>> - The vCPU should be limited to run only on cores covered by PMUs when
>>>> KVM_ARM_VCPU_PMU_V3 is set.
>>>
>>> That's userspace's job. Bind to the desired PMU, and run. KVM will
>>> actively prevent you from running on the wrong CPU.
>>>
>>>> - PMCR_EL0.N advertised to the guest should be the minimum of ones of
>>>> host PMUs.
>>>
>>> How do you find out? CPUs can be hot-plugged on long after a VM has
>>> started, bringing in a new PMU, with a different number of counters.
>>>
>>>> - PMCEID0_EL0 and PMCEID1_EL0 advertised to the guest should be the
>>>> result of the AND operations of ones of host PMUs.
>>>
>>> Same problem.
>>
>> I guess special-casing the cycle counter is the only option if the
>> kernel is going to deal with this.
> 
> Indeed. I think Oliver's idea is the least bad of them all, but man,
> this is really ugly.
> 
>>>> Special-casing the cycle counter may make sense if we are going to fix
>>>> the advertised values of PMCR_EL0.N, PMCEID0_EL0, and
>>>> PMCEID1_EL0. PMCR_EL0.N as we can simply return zero for these
>>>> registers. We can also prevent enabling a counter that returns zero or
>>>> a garbage value.
>>>>
>>>> Do you think it's worth fixing these registers? If so, I'll do that by
>>>> special-casing the cycle counter.
>>>
>>> I think this is really going in the wrong direction.
>>>
>>> The whole design of the PMU emulation is that we expose a single,
>>> architecturally correct PMU implementation. This is clearly
>>> documented.
>>>
>>> Furthermore, userspace is being given all the relevant information to
>>> place vcpus on the correct physical CPUs. Why should we add this sort
>>> of hack in the kernel, creating a new userspace ABI that we will have
>>> to support forever, when usespace can do the correct thing right now?
>>>
>>> Worse case, this is just a 'taskset' away, and everything will work.
>>
>> It's surprisingly difficult to do that with libvirt; of course it is a
>> userspace problem though.
> 
> Sorry, I must admit I'm completely ignorant of libvirt. I tried it
> years ago, and concluded that 95% of what I needed was adequately done
> with a shell script...
> 
>>> Frankly, I'm not prepared to add more hacks to KVM for the sake of the
>>> combination of broken userspace and broken guest.
>>
>> The only counter argument I have in this regard is that some change is
>> also needed to expose all CPUs to Windows guest even when the
>> userspace does its best. It may result in odd scheduling, but still
>> gives the best throughput.
> 
> But that'd be a new ABI, which again would require buy-in from
> userspace.  Maybe there is scope for an all CPUs, cycle-counter only
> PMUv3 exposed to the guest, but that cannot be set automatically, as
> we would otherwise regress existing setups.
> 
> At this stage, and given that you need to change userspace, I'm not
> sure what the best course of action is.

Having an explicit flag for the userspace is fine for QEMU, which I 
care. It can flip the flag if and only if threads are not pinned to one 
PMU and the machine is a new setup.

I also wonder what regression you think setting it automatically causes.

Regards,
Akihiko Odaki

> 
> Thanks,
> 
> 	M.
>