linux-kernel - Re: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFEAcA8FhgcaM_OsHKB3+3Z7B_oZJqU4LHX_j9p-ZQrHfWGX7g@mail.gmail.com>
Date: Mon, 13 Oct 2025 17:48:44 +0100
From: Peter Maydell <peter.maydell@...aro.org>
To: Marc Zyngier <maz@...nel.org>
Cc: salil.mehta@...src.net, linux-kernel@...r.kernel.org, 
	linux-arm-kernel@...ts.infradead.org, salil.mehta@...wei.com, 
	jonathan.cameron@...wei.com, will@...nel.org, catalin.marinas@....com, 
	mark.rutland@....com, james.morse@....com, sudeep.holla@....com, 
	lpieralisi@...nel.org, jean-philippe@...aro.org, tglx@...utronix.de, 
	oliver.upton@...ux.dev, richard.henderson@...aro.org, andrew.jones@...ux.dev, 
	mst@...hat.com, david@...hat.com, philmd@...aro.org, ardb@...nel.org, 
	borntraeger@...ux.ibm.com, alex.bennee@...aro.org, gustavo.romero@...aro.org, 
	npiggin@...il.com, linux@...linux.org.uk, karl.heubaum@...cle.com, 
	miguel.luis@...cle.com, darren@...amperecomputing.com, 
	ilkka@...amperecomputing.com, vishnu@...amperecomputing.com, 
	gankulkarni@...amperecomputing.com, wangyanan55@...wei.com, 
	wangzhou1@...ilicon.com, linuxarm@...wei.com
Subject: Re: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR_EL1 and allow
 lockless read when ready

On Mon, 13 Oct 2025 at 11:55, Marc Zyngier <maz@...nel.org> wrote:
>
> On Mon, 13 Oct 2025 09:42:58 +0100,
> Peter Maydell <peter.maydell@...aro.org> wrote:
> >
> > On Thu, 9 Oct 2025 at 14:48, Marc Zyngier <maz@...nel.org> wrote:
> > >
> > > On Wed, 08 Oct 2025 21:19:55 +0100,
> > > salil.mehta@...src.net wrote:
> > > >
> > > > From: Salil Mehta <salil.mehta@...wei.com>
> > > >
> > > > [A rough illustration of the problem and the probable solution]
> > > >
> > > > Userspace reads of ICC_CTLR_EL1 via KVM device attributes currently takes a slow
> > > > path that may acquire all vCPU locks. Under workloads that exercise userspace
> > > > PSCI CPU_ON flows or frequent vCPU resets, this can cause vCPU lock contention
> > > > in KVM and, in the worst cases, -EBUSY returns to userspace.
> > > >
> > > > When PSCI CPU_ON and CPU_OFF calls are handled entirely in KVM, these operations
> > > > are executed under KVM vCPU locks in the host kernel (EL1) and appear atomic to
> > > > other vCPU threads. In this context, system register accesses are serialized
> > > > under KVM vCPU locks, ensuring atomicity with respect to other vCPUs. After
> > > > SMCCC filtering was introduced, PSCI CPU_ON and CPU_OFF calls can now exit to
> > > > userspace (QEMU). During the handling of PSCI CPU_ON call in userspace, a
> > > > cpu_reset() is exerted which reads ICC_CTLR_EL1 through KVM device attribute
> > > > IOCTLs. To avoid transient inconsistency and -EBUSY errors, QEMU is forced to
> > > > pause all vCPUs before issuing these IOCTLs.
> > >
> > > I'm going to repeat in public what I already said in private.
> > >
> > > Why does QEMU need to know this? I don't see how this is related to
> > > PSCI, and outside of save/restore, there is no reason why QEMU should
> > > poke at this. If QEMU needs fixing, please fix QEMU.
> >
> > I don't know the background here, but generally speaking,
> > when we do a CPU reset that includes writing all the CPU state
> > of the "this is freshly reset from userspace's point of view" vcpu
> > back to the kernel. More generally, userspace should be able to
> > read and write sysregs for a vcpu any time it likes, and not
> > arbitrarily get back -EBUSY. What does the kernel expect
> > userspace to do with an errno like that?
>
> The main issue here is that GICv3 is modelled as a device, just like
> GICv2, and that all the sysregs that are relevant to the GIC have the
> same status as the MMIO registers: they can only be accessed when the
> vcpus are not running.
>
> These sysregs are not visible through the normal ONE_REG API, and
> therefore not subjected to the "do whatever you want" rule.

Ah, I'd forgotten that. But the cpuif registers are still
per-cpu, and they do still need to be reset on vcpu reset,
and that might still happen for a single vcpu when the VM
as a whole is still running.

That said, QEMU's current code for this could be refactored
to avoid the reset-time read of ICC_CTLR_EL1 from the kernel.
We do this so we can set the userspace struct field for this
register to the right value. But we could ask the kernel for
that value once on VM startup since it's not going to change mid-run.

That would bring ICC_CTLR_EL1 into line with the other cpuif
registers, where QEMU assumes it knows what the kernel's
reset value of them is (mostly "0") and doesn't bother to ask.
This is different from how we handle ONE_REG sysregs, where
I'm pretty sure we do ask the kernel the value of all of them
on a vcpu reset. (And then write the values back again, which
is a bit silly but nobody's ever said it was a performance
problem for them :-))

> Should we have done something else when the GICv3 save/restore API was
> introduced and agreed upon with the QEMU people? Probably. Can we
> change it now? Probably not. The only thing we could relax is the
> scope of the lock when accessing a sysreg, so that we only mandate
> that the targeted vcpu is not running instead of the whole VM.
>
> And finally, if you object to this API, why should we do for GICv5,
> which is so far implemented by following the exact same principles?

I don't object to the API inherently (I don't care whether we
do these register reads via a dev ioctl or something else,
from userspace's point of view it's just "do some syscall,
get a value") -- I'm just objecting to the kernel's
implementation of it where it might return EBUSY :-)

(Also, if the kernel had failed EINVAL unconditionally for
an attempt to do this on a not-stopped VM then we'd probably
have found this mismatch in understanding about how the
API should work years ago. "Mostly works but sometimes fails
EBUSY" is the worst of all worlds.)

I haven't yet got as far as thinking about the KVM interface
for GICv5 yet...

-- PMM