linux-kernel - Re: [PATCH RFC] KVM: arm64: PMU: Use multiple host PMUs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <86a59emyxo.wl-maz@kernel.org>
Date: Fri, 21 Mar 2025 10:59:47 +0000
From: Marc Zyngier <maz@...nel.org>
To: Akihiko Odaki <akihiko.odaki@...nix.com>
Cc: Oliver Upton <oliver.upton@...ux.dev>,
	Joey Gouly <joey.gouly@....com>,
	Suzuki K Poulose
 <suzuki.poulose@....com>,
	Zenghui Yu <yuzenghui@...wei.com>,
	Catalin Marinas <catalin.marinas@....com>,
	Will Deacon <will@...nel.org>,
	Kees Cook <kees@...nel.org>,
	"Gustavo A. R. Silva" <gustavoars@...nel.org>,
	linux-arm-kernel@...ts.infradead.org,
	kvmarm@...ts.linux.dev,
	linux-kernel@...r.kernel.org,
	linux-hardening@...r.kernel.org,
	devel@...nix.com
Subject: Re: [PATCH RFC] KVM: arm64: PMU: Use multiple host PMUs

On Fri, 21 Mar 2025 06:20:49 +0000,
Akihiko Odaki <akihiko.odaki@...nix.com> wrote:
> 
> On 2025/03/21 2:14, Marc Zyngier wrote:
> > On Thu, 20 Mar 2025 09:52:59 +0000,
> > Akihiko Odaki <akihiko.odaki@...nix.com> wrote:
> >> 
> >> On 2025/03/20 18:10, Marc Zyngier wrote:
> >>> On Thu, 20 Mar 2025 06:03:35 +0000,
> >>> Akihiko Odaki <akihiko.odaki@...nix.com> wrote:
> >>>> 
> >>>> On 2025/03/20 3:51, Oliver Upton wrote:
> >>>>> On Wed, Mar 19, 2025 at 06:38:38PM +0000, Marc Zyngier wrote:
> >>>>>> On Wed, 19 Mar 2025 11:51:21 +0000, Akihiko Odaki <akihiko.odaki@...nix.com> wrote:
> >>>>>>> What about setting the flag automatically when a user fails to pin
> >>>>>>> vCPUs to CPUs that are covered by one PMU? There would be no change if
> >>>>>>> a user correctly pins vCPUs as it is. Otherwise, they will see a
> >>>>>>> correct feature set advertised to the guest and the cycle counter
> >>>>>>> working.
> >>>>>> 
> >>>>>> How do you know that the affinity is "correct"? VCPU affinity can be
> >>>>>> changed at any time. I, for one, do not want my VMs to change
> >>>>>> behaviour because I let the vcpus bounce around as the scheduler sees
> >>>>>> fit.
> >>>> 
> >>>> Checking the affinity when picking the default PMU; the vCPU affinity
> >>>> is the only thing that rules the choice of the default PMU even now.
> >>>> 
> >>>> Perhaps we may model the API as follows: introduce another "composite"
> >>>> PMU that works on any core but only exposes the cycle counter. Robust
> >>>> VMMs will choose it or one of hardware PMUs with
> >>>> KVM_ARM_VCPU_PMU_V3_SET_PMU. KVM will choose the default PMU according
> >>>> to the vCPU affinity at the point of KVM_ARM_VCPU_INIT otherwise. If
> >>>> the affinity is covered by one hardware PMU, that PMU will be chosen
> >>>> as the default. The "composite" PMU will be the default otherwise.
> >>> 
> >>> This makes no sense to me. A VCPU is always affine to a PMU, because
> >>> we do not support configurations where only some CPUs have a PMU. This
> >>> is an all-or-nothing situation.
> >> 
> >> At least isn't it fine to have the composite PMU with a new value for
> >> KVM_ARM_VCPU_PMU_V3_SET_PMU?
> > 
> > Not sure KVM_ARM_VCPU_PMU_V3_SET_PMU is the right hook (it takes a PMU
> > 'type', which is under control of the perf subsystem). But if we can
> > find a value that is guaranteed to be unique, why not.
> > 
> >>> More importantly, you keep suggesting the same "new default", and I
> >>> keep saying NO.
> >>> 
> >>> My position is clear: if you want a *new* behaviour, you *must* add a
> >>> new flag that the VMM explicitly provides to enable this CC-only PMU.
> >>> No change in default behaviour at all.
> >>> 
> >>> I'm not going to move from that.
> >> 
> >> Why not? It will not break anything guaranteed to work in the past.
> > 
> > It doesn't have to be guaranteed. It just has to *exist*. That's the
> > Linux ABI for you.
> > 
> >> Currently KVM only guarantees that the emulated PMU correctly counts
> >> only when
> >> 1) the vCPU affinity is contained by one PMU and
> >> 2) it will not expand
> >> 
> >> Breaking these conditions will make the behavior of the emulated PMU
> >> undefined. Now I'm proposing to remove 1).
> > 
> > And I'm saying no to that. I'm also getting tired of arguing the same
> > point on and on.
> > 
> > We currently have two different behaviours:
> > 
> > - either you explicitly set a PMU, and the affinity of this PMU
> >    constraints the affinity of the vcpus. The vcpus are not allowed to
> >    run outside of this affinity. Everything counts all the time.
> > 
> > - or you don't explicitly set a PMU, and a default PMU will be picked
> >    from the current affinity of the first vcpu. Your vcpus will be able
> >    to run anywhere, but the virtual PMU will *only* count when the
> >    vcpus are affine to the default PMU. When the vcpus are not affine
> >    to default PMU, *nothing* counts.
> > 
> > These two behaviours are ABI. They are not changing. They don't get
> > relaxed, they don't get tightened, they stay just as they are,
> > forever.
> 
> Is the latter one is really ABI? I see it as part of behaviors that
> are undefined and not included in ABI for the following reasons:
> 
> 1) It depends on the scheduler behavior, which cannot be ABI.

You're wrong. The task affinity is under complete control of
userspace. sched_setaffinity(2) is your friend.

> 2) It provides a broken PMU so the proposed behavioral change is
> similar a bug fix though I call it a undefined behavior instead of a
> bug as it is explicitly told that there is no assurance that the PMU
> works in such a scenario.

Call it undefined behaviour if you want, I call it ABI. This is how
the PMU has been handled on KVM/arm64 for the past 10 years, and the
fact that you (and I) don't like it doesn't make it less of an
existing behaviour that must be preserved.

> 3) The userspace could not have relied on it so the "no regressions"
> rule cannot be applied here; how can anyone have a userspace that
> relies on a kernel behavior that depends on scheduling?

Of course it has. Read the above.

Even more, I contend that your approach is just wrong. You want to
make random events count at random times, which is even worse. We can
at least pretend that the PMU hasn't recorded anything (which is true)
rather than saying "oh look, the cycle counter has ticked but we
didn't execute any instruction -- WTF were we doing?".

That approach is far worse than what we have today.

Let me give you an example:

- I have an asymmetric system with two types of CPUs (any odd
  big-little nonsense)

- I statically place my vcpus on physical CPUs so that only some vcpus
  are effectively generating events by using my knowledge of the
  "default PMU" behaviour.

- I perform system-wide profiling inside the guest with the full
  knowledge that only the vcpus I'm interested in will affect the
  event counts. This is deterministic.

- I can move vcpus around as I see fit to capture new counter sets
  with the same guarantees.

Your "count random stuff at random times on random CPUs" breaks this.

Is this a contrived example? Of course it is. Do we know for sure that
nobody in their right mind would do this? No, because we don't have a
mandate to decide what people can or cannot do with a behaviour that
has been established for that long.

> But for 3), Oliver raised a concern for the guest compatibility so I'd
> like to hear an explanation for that concern.

I've given it above, and in my reply to Oliver as well.

> > You want a *third* behaviour, go ahead. Define it the way you want.
> > But the behaviours described above will stay unchanged.
> > I'm looking forward to your patches implementing it, but I am also
> > done arguing on it.
> 
> I understand the discussion is tiring but I want to know the reasoning
> behind such a design decision before sending an RFC patch to a VMM
> (QEMU) so that I can explain them why it is necessary in turn.

But that's *again* where you got things completely wrong:

<allcaps>
  This is not a design decision, this is how things have always worked
  and we cannot break previous behaviours. This is unfortunate, but
  that's how the Linux kernel works.
</allcaps>

The only "design decision" is that we don't change established
behaviours, explicit or implicit. If you want to implement something
that matches this requirement, go ahead and post patches. I'll gladly
review them. If you cannot abide by this rule, then please stop
wasting your time (and mine).

Feel free to relay this to the QEMU folks who, in my experience, very
much value the notion of ABI and the preservation of existing
behaviours.

Now, that's my last email on this thread. Do no bother replying.

	M. (pretty annoyed)

-- 
Without deviation from the norm, progress is not possible.