linux-hardening - Re: [PATCH RFC v2 1/2] KVM: arm64: PMU: Introduce KVM_ARM_VCPU_PMU_V3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aJQJdElbZJ6KzQxD@linux.dev>
Date: Wed, 6 Aug 2025 19:03:32 -0700
From: Oliver Upton <oliver.upton@...ux.dev>
To: Akihiko Odaki <odaki@....ci.i.u-tokyo.ac.jp>
Cc: Marc Zyngier <maz@...nel.org>, Joey Gouly <joey.gouly@....com>,
	Suzuki K Poulose <suzuki.poulose@....com>,
	Zenghui Yu <yuzenghui@...wei.com>,
	Catalin Marinas <catalin.marinas@....com>,
	Will Deacon <will@...nel.org>, Kees Cook <kees@...nel.org>,
	"Gustavo A. R. Silva" <gustavoars@...nel.org>,
	Paolo Bonzini <pbonzini@...hat.com>,
	Jonathan Corbet <corbet@....net>, Shuah Khan <shuah@...nel.org>,
	linux-arm-kernel@...ts.infradead.org, kvmarm@...ts.linux.dev,
	linux-kernel@...r.kernel.org, linux-hardening@...r.kernel.org,
	devel@...nix.com, kvm@...r.kernel.org, linux-doc@...r.kernel.org,
	linux-kselftest@...r.kernel.org
Subject: Re: [PATCH RFC v2 1/2] KVM: arm64: PMU: Introduce
 KVM_ARM_VCPU_PMU_V3_COMPOSITION

On Thu, Aug 07, 2025 at 03:24:43AM +0900, Akihiko Odaki wrote:
> On 2025/08/07 2:20, Oliver Upton wrote:
> > Hi Akihiko,
> > 
> > This is an unreasonably large patch that needs to be broken down into
> > smaller patches, ideally one functional change per patch. We need this
> > even for an RFC for the sake of reviews.
> > 
> > On Wed, Aug 06, 2025 at 06:09:54PM +0900, Akihiko Odaki wrote:
> > > +static u64 kvm_pmu_get_pmc_value(struct kvm_vcpu *vcpu, u8 idx)
> > >   {
> > > -	struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc);
> > > +	struct kvm_pmc *pmc = *kvm_vcpu_idx_to_pmc(vcpu, idx);
> > >   	u64 counter, reg, enabled, running;
> > > +	unsigned int i;
> > > -	reg = counter_index_to_reg(pmc->idx);
> > > +	reg = counter_index_to_reg(idx);
> > >   	counter = __vcpu_sys_reg(vcpu, reg);
> > >   	/*
> > >   	 * The real counter value is equal to the value of counter register plus
> > >   	 * the value perf event counts.
> > >   	 */
> > > -	if (pmc->perf_event)
> > > -		counter += perf_event_read_value(pmc->perf_event, &enabled,
> > > -						 &running);
> > > +	if (pmc)
> > > +		for (i = 0; i < pmc->nr_perf_events; i++)
> > > +			counter += perf_event_read_value(pmc->perf_events[i],
> > > +							 &enabled, &running);
> > 
> > I'm concerned that this array of events concept you're introducing is
> > going to be error-prone. An approach that reallocates a new PMU event in
> > the case of a vCPU migrating to a new PMU implementation would be
> > desirable.
> 
> I avoided allocation at migration because I was worried with disabled
> preemption. perf_event_create_kernel_counter() internally takes a mutex so
> it cannot be used if preemption is disabled.
> 
> I wonder if it is guaranteed that kvm_arch_vcpu_load() executes with
> preemption. If so, I can add a hook there to call
> perf_event_create_kernel_counter().

You don't need to allocate the event at vcpu_load(). The only guarantee
we need is that an event has been attached to the PMC prior to entering
the guest, for which the KVM_REQ_* infrastructure is quite useful.

> > 
> > > +static void reset_sample_period(struct perf_event *perf_event)
> > > +{
> > > +	struct kvm_pmc **pmc = perf_event->overflow_handler_context;
> > > +	struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc);
> > > +	struct arm_pmu *cpu_pmu = to_arm_pmu(perf_event->pmu);
> > > +	u64 period;
> > > +
> > > +	cpu_pmu->pmu.stop(perf_event, PERF_EF_UPDATE);
> > > +
> > > +	/*
> > > +	 * Reset the sample period to the architectural limit,
> > > +	 * i.e. the point where the counter overflows.
> > > +	 */
> > > +	period = compute_period(pmc, kvm_pmu_get_pmc_value(vcpu, (*pmc)->idx));
> > > +
> > > +	local64_set(&perf_event->hw.period_left, 0);
> > > +	perf_event->attr.sample_period = period;
> > > +	perf_event->hw.sample_period = period;
> > > +
> > > +	cpu_pmu->pmu.start(perf_event, PERF_EF_RELOAD);
> > > +}
> > 
> > No, we can't start calling into the internal driver interfaces. The fact
> > that we have a pointer to the PMU is an ugly hack and shouldn't be used
> > like this.
> 
> This function was extracted from kvm_pmu_perf_overflow() and is not a new
> addition (which should have been clear if I would have split the patch as
> you noted).

Yuck, then forgive my whining.

> I thought of replacing it with perf_event_period(), but there is a catch: it
> returns -EINVAL if the 63th bit of the sample period number is set. Perhaps
> we can just specify ((1ULL << 63) - 1) if the sample period is going to be
> so long, but I conservatively avoided that. I can change it if you prefer
> that way or have an alternative idea.

Let's keep it as-is then. Sorry about the noise.

> > Can we just special-case the fixed CPU cycle counter to use
> > PERF_TYPE_HARDWARE / PERF_COUNT_HW_CPU_CYCLES? That _should_ have the
> > intended effect of opening an event on the PMU for this CPU.
> 
> I have an impression that perhaps this emulation can be more generic by
> converting eventsel to PERF_COUNT_HW_* numbers with armv8_pmuv3_perf_map in
> drivers/perf/arm_pmuv3.c, but it is not in scope of this change. The current
> code is sufficient for now.

The only cross-PMU events we will support are the fixed counters, my
strong preference is that we do not reverse-map architectural events to
generic perf events for all counters.

> > I'm not sure there's much value in preventing the user from configuring
> > the PMU event filter. Even in the case of the fixed CPU cycle counter we
> > allow userspace to filter the event.
> 
> It is possible to configure the PMU event filter, but it needs to be done
> after setting the attribute. This behavior is aligned with
> KVM_ARM_VCPU_PMU_V3_SET_PMU.

That's fine then, I missed that detail.

> They are mutually exclusive and the latest configuration takes effect.
> 
> If you set SET_PMU after COMPOSITION, SET_PMU will take effect.
> If you set COMPOSITION after SET_PMU, COMPOSITION will take effect.
> 
> I'll note that in the documentation.

This isn't what I meant. What I mean is that userspace either can use
the SET_PMU ioctl or the COMPOSITION ioctl. Once one of them has been
used the other ioctl returns an error.

We're really bad at getting ioctl ordering / interleaving right and
syzkaller has a habit of finding these mistakes. There's zero practical
value in using both of these ioctls on the same VM, let's prevent it.

> > > +	case KVM_ARM_VCPU_PMU_V3_COMPOSITION:
> > > +		return kvm_arm_pmu_v3_set_pmu_composition(vcpu);
> > 
> > I'd prefer naming this something like 'KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY'.
> > We will have the fixed instruction counter eventually which is another
> > event we could potentially provide system-wide.
> 
> The design decision to expose the cycle counter is driven by the motivation
> to satisfy FEAT_PMU for the guest rather than the host capability.
> 
> When we implement FEAT_PMUv3_ICNTR in the future, I think we will need to
> add KVM_ARM_VCPU_PMU_V3_COMPOSITION_ICNTR or something. The resultig
> emulated PMU will work only on host CPUs that have instruction counters; a
> host CPU may lack one because it doesn't implement FEAT_PMUv3_ICNTR or has a
> third-party PMU like apple-m1-cpu-pmu (this particular PMU has an
> instruction counter fortunately).

No, this is at odds with the way we handle CPU features in KVM/arm64.

Ignoring this ioctl, the goal is that after KVM_ARM_VCPU_INIT the vCPU
is initialized to the maximum possible feature set. If userspace wants
to de-feature the VM it can do so by modifying the ID registers.

OTOH, your ioctl is dealing with something structural in how we
support PMUs on the system. It should be defined as exposing all fixed
counters that we support on a given piece of hardware. On systems
without FEAT_PMUv3_ICNTR, the result should be a vPMU that only has the
CPU cycle counter.

On a system that has FEAT_PMUv3_ICNTR, userspace can still use this
ioctl and explicitly de-feature ICNTR by writing to the ID register
after initialization.

Thanks,
Oliver