[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <864is2x6z9.wl-maz@kernel.org>
Date: Tue, 14 Oct 2025 08:44:42 +0100
From: Marc Zyngier <maz@...nel.org>
To: Peter Maydell <peter.maydell@...aro.org>
Cc: salil.mehta@...src.net,
linux-kernel@...r.kernel.org,
linux-arm-kernel@...ts.infradead.org,
salil.mehta@...wei.com,
jonathan.cameron@...wei.com,
will@...nel.org,
catalin.marinas@....com,
mark.rutland@....com,
james.morse@....com,
sudeep.holla@....com,
lpieralisi@...nel.org,
jean-philippe@...aro.org,
tglx@...utronix.de,
oliver.upton@...ux.dev,
richard.henderson@...aro.org,
andrew.jones@...ux.dev,
mst@...hat.com,
david@...hat.com,
philmd@...aro.org,
ardb@...nel.org,
borntraeger@...ux.ibm.com,
alex.bennee@...aro.org,
gustavo.romero@...aro.org,
npiggin@...il.com,
linux@...linux.org.uk,
karl.heubaum@...cle.com,
miguel.luis@...cle.com,
darren@...amperecomputing.com,
ilkka@...amperecomputing.com,
vishnu@...amperecomputing.com,
gankulkarni@...amperecomputing.com,
wangyanan55@...wei.com,
wangzhou1@...ilicon.com,
linuxarm@...wei.com
Subject: Re: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR_EL1 and allow lockless read when ready
On Mon, 13 Oct 2025 17:48:44 +0100,
Peter Maydell <peter.maydell@...aro.org> wrote:
>
> On Mon, 13 Oct 2025 at 11:55, Marc Zyngier <maz@...nel.org> wrote:
> >
> > On Mon, 13 Oct 2025 09:42:58 +0100,
> > Peter Maydell <peter.maydell@...aro.org> wrote:
> > >
> > > On Thu, 9 Oct 2025 at 14:48, Marc Zyngier <maz@...nel.org> wrote:
> > > >
> > > > On Wed, 08 Oct 2025 21:19:55 +0100,
> > > > salil.mehta@...src.net wrote:
> > > > >
> > > > > From: Salil Mehta <salil.mehta@...wei.com>
> > > > >
> > > > > [A rough illustration of the problem and the probable solution]
> > > > >
> > > > > Userspace reads of ICC_CTLR_EL1 via KVM device attributes currently takes a slow
> > > > > path that may acquire all vCPU locks. Under workloads that exercise userspace
> > > > > PSCI CPU_ON flows or frequent vCPU resets, this can cause vCPU lock contention
> > > > > in KVM and, in the worst cases, -EBUSY returns to userspace.
> > > > >
> > > > > When PSCI CPU_ON and CPU_OFF calls are handled entirely in KVM, these operations
> > > > > are executed under KVM vCPU locks in the host kernel (EL1) and appear atomic to
> > > > > other vCPU threads. In this context, system register accesses are serialized
> > > > > under KVM vCPU locks, ensuring atomicity with respect to other vCPUs. After
> > > > > SMCCC filtering was introduced, PSCI CPU_ON and CPU_OFF calls can now exit to
> > > > > userspace (QEMU). During the handling of PSCI CPU_ON call in userspace, a
> > > > > cpu_reset() is exerted which reads ICC_CTLR_EL1 through KVM device attribute
> > > > > IOCTLs. To avoid transient inconsistency and -EBUSY errors, QEMU is forced to
> > > > > pause all vCPUs before issuing these IOCTLs.
> > > >
> > > > I'm going to repeat in public what I already said in private.
> > > >
> > > > Why does QEMU need to know this? I don't see how this is related to
> > > > PSCI, and outside of save/restore, there is no reason why QEMU should
> > > > poke at this. If QEMU needs fixing, please fix QEMU.
> > >
> > > I don't know the background here, but generally speaking,
> > > when we do a CPU reset that includes writing all the CPU state
> > > of the "this is freshly reset from userspace's point of view" vcpu
> > > back to the kernel. More generally, userspace should be able to
> > > read and write sysregs for a vcpu any time it likes, and not
> > > arbitrarily get back -EBUSY. What does the kernel expect
> > > userspace to do with an errno like that?
> >
> > The main issue here is that GICv3 is modelled as a device, just like
> > GICv2, and that all the sysregs that are relevant to the GIC have the
> > same status as the MMIO registers: they can only be accessed when the
> > vcpus are not running.
> >
> > These sysregs are not visible through the normal ONE_REG API, and
> > therefore not subjected to the "do whatever you want" rule.
>
> Ah, I'd forgotten that. But the cpuif registers are still
> per-cpu, and they do still need to be reset on vcpu reset,
> and that might still happen for a single vcpu when the VM
> as a whole is still running.
>
> That said, QEMU's current code for this could be refactored
> to avoid the reset-time read of ICC_CTLR_EL1 from the kernel.
> We do this so we can set the userspace struct field for this
> register to the right value. But we could ask the kernel for
> that value once on VM startup since it's not going to change mid-run.
The reset value is indeed cast in stone once the GIC has been created.
> That would bring ICC_CTLR_EL1 into line with the other cpuif
> registers, where QEMU assumes it knows what the kernel's
> reset value of them is (mostly "0") and doesn't bother to ask.
> This is different from how we handle ONE_REG sysregs, where
> I'm pretty sure we do ask the kernel the value of all of them
> on a vcpu reset. (And then write the values back again, which
> is a bit silly but nobody's ever said it was a performance
> problem for them :-))
>
> > Should we have done something else when the GICv3 save/restore API was
> > introduced and agreed upon with the QEMU people? Probably. Can we
> > change it now? Probably not. The only thing we could relax is the
> > scope of the lock when accessing a sysreg, so that we only mandate
> > that the targeted vcpu is not running instead of the whole VM.
> >
> > And finally, if you object to this API, why should we do for GICv5,
> > which is so far implemented by following the exact same principles?
>
> I don't object to the API inherently (I don't care whether we
> do these register reads via a dev ioctl or something else,
> from userspace's point of view it's just "do some syscall,
> get a value") -- I'm just objecting to the kernel's
> implementation of it where it might return EBUSY :-)
To me, EBUSY has a clear meaning: you're otherwise using the resource,
and you need to relinquish it first, while EINVAL indicates that the
kernel doesn't understand what you want.
As I said, I'm happy to look at reducing the locking to only the
target vcpu in the case of a sysreg being accessed, but EBUSY will
stay.
>
> (Also, if the kernel had failed EINVAL unconditionally for
> an attempt to do this on a not-stopped VM then we'd probably
> have found this mismatch in understanding about how the
> API should work years ago. "Mostly works but sometimes fails
> EBUSY" is the worst of all worlds.)
>
> I haven't yet got as far as thinking about the KVM interface
> for GICv5 yet...
I guess that for the time being, we'll assume that GICv3 is the
reference.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
Powered by blists - more mailing lists