linux-kernel - Re: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <864is2x6z9.wl-maz@kernel.org>
Date: Tue, 14 Oct 2025 08:44:42 +0100
From: Marc Zyngier <maz@...nel.org>
To: Peter Maydell <peter.maydell@...aro.org>
Cc: salil.mehta@...src.net,
	linux-kernel@...r.kernel.org,
	linux-arm-kernel@...ts.infradead.org,
	salil.mehta@...wei.com,
	jonathan.cameron@...wei.com,
	will@...nel.org,
	catalin.marinas@....com,
	mark.rutland@....com,
	james.morse@....com,
	sudeep.holla@....com,
	lpieralisi@...nel.org,
	jean-philippe@...aro.org,
	tglx@...utronix.de,
	oliver.upton@...ux.dev,
	richard.henderson@...aro.org,
	andrew.jones@...ux.dev,
	mst@...hat.com,
	david@...hat.com,
	philmd@...aro.org,
	ardb@...nel.org,
	borntraeger@...ux.ibm.com,
	alex.bennee@...aro.org,
	gustavo.romero@...aro.org,
	npiggin@...il.com,
	linux@...linux.org.uk,
	karl.heubaum@...cle.com,
	miguel.luis@...cle.com,
	darren@...amperecomputing.com,
	ilkka@...amperecomputing.com,
	vishnu@...amperecomputing.com,
	gankulkarni@...amperecomputing.com,
	wangyanan55@...wei.com,
	wangzhou1@...ilicon.com,
	linuxarm@...wei.com
Subject: Re: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR_EL1 and allow lockless read when ready

On Mon, 13 Oct 2025 17:48:44 +0100,
Peter Maydell <peter.maydell@...aro.org> wrote:
> 
> On Mon, 13 Oct 2025 at 11:55, Marc Zyngier <maz@...nel.org> wrote:
> >
> > On Mon, 13 Oct 2025 09:42:58 +0100,
> > Peter Maydell <peter.maydell@...aro.org> wrote:
> > >
> > > On Thu, 9 Oct 2025 at 14:48, Marc Zyngier <maz@...nel.org> wrote:
> > > >
> > > > On Wed, 08 Oct 2025 21:19:55 +0100,
> > > > salil.mehta@...src.net wrote:
> > > > >
> > > > > From: Salil Mehta <salil.mehta@...wei.com>
> > > > >
> > > > > [A rough illustration of the problem and the probable solution]
> > > > >
> > > > > Userspace reads of ICC_CTLR_EL1 via KVM device attributes currently takes a slow
> > > > > path that may acquire all vCPU locks. Under workloads that exercise userspace
> > > > > PSCI CPU_ON flows or frequent vCPU resets, this can cause vCPU lock contention
> > > > > in KVM and, in the worst cases, -EBUSY returns to userspace.
> > > > >
> > > > > When PSCI CPU_ON and CPU_OFF calls are handled entirely in KVM, these operations
> > > > > are executed under KVM vCPU locks in the host kernel (EL1) and appear atomic to
> > > > > other vCPU threads. In this context, system register accesses are serialized
> > > > > under KVM vCPU locks, ensuring atomicity with respect to other vCPUs. After
> > > > > SMCCC filtering was introduced, PSCI CPU_ON and CPU_OFF calls can now exit to
> > > > > userspace (QEMU). During the handling of PSCI CPU_ON call in userspace, a
> > > > > cpu_reset() is exerted which reads ICC_CTLR_EL1 through KVM device attribute
> > > > > IOCTLs. To avoid transient inconsistency and -EBUSY errors, QEMU is forced to
> > > > > pause all vCPUs before issuing these IOCTLs.
> > > >
> > > > I'm going to repeat in public what I already said in private.
> > > >
> > > > Why does QEMU need to know this? I don't see how this is related to
> > > > PSCI, and outside of save/restore, there is no reason why QEMU should
> > > > poke at this. If QEMU needs fixing, please fix QEMU.
> > >
> > > I don't know the background here, but generally speaking,
> > > when we do a CPU reset that includes writing all the CPU state
> > > of the "this is freshly reset from userspace's point of view" vcpu
> > > back to the kernel. More generally, userspace should be able to
> > > read and write sysregs for a vcpu any time it likes, and not
> > > arbitrarily get back -EBUSY. What does the kernel expect
> > > userspace to do with an errno like that?
> >
> > The main issue here is that GICv3 is modelled as a device, just like
> > GICv2, and that all the sysregs that are relevant to the GIC have the
> > same status as the MMIO registers: they can only be accessed when the
> > vcpus are not running.
> >
> > These sysregs are not visible through the normal ONE_REG API, and
> > therefore not subjected to the "do whatever you want" rule.
> 
> Ah, I'd forgotten that. But the cpuif registers are still
> per-cpu, and they do still need to be reset on vcpu reset,
> and that might still happen for a single vcpu when the VM
> as a whole is still running.
> 
> That said, QEMU's current code for this could be refactored
> to avoid the reset-time read of ICC_CTLR_EL1 from the kernel.
> We do this so we can set the userspace struct field for this
> register to the right value. But we could ask the kernel for
> that value once on VM startup since it's not going to change mid-run.

The reset value is indeed cast in stone once the GIC has been created.

> That would bring ICC_CTLR_EL1 into line with the other cpuif
> registers, where QEMU assumes it knows what the kernel's
> reset value of them is (mostly "0") and doesn't bother to ask.
> This is different from how we handle ONE_REG sysregs, where
> I'm pretty sure we do ask the kernel the value of all of them
> on a vcpu reset. (And then write the values back again, which
> is a bit silly but nobody's ever said it was a performance
> problem for them :-))
>
> > Should we have done something else when the GICv3 save/restore API was
> > introduced and agreed upon with the QEMU people? Probably. Can we
> > change it now? Probably not. The only thing we could relax is the
> > scope of the lock when accessing a sysreg, so that we only mandate
> > that the targeted vcpu is not running instead of the whole VM.
> >
> > And finally, if you object to this API, why should we do for GICv5,
> > which is so far implemented by following the exact same principles?
> 
> I don't object to the API inherently (I don't care whether we
> do these register reads via a dev ioctl or something else,
> from userspace's point of view it's just "do some syscall,
> get a value") -- I'm just objecting to the kernel's
> implementation of it where it might return EBUSY :-)

To me, EBUSY has a clear meaning: you're otherwise using the resource,
and you need to relinquish it first, while EINVAL indicates that the
kernel doesn't understand what you want.

As I said, I'm happy to look at reducing the locking to only the
target vcpu in the case of a sysreg being accessed, but EBUSY will
stay.

> 
> (Also, if the kernel had failed EINVAL unconditionally for
> an attempt to do this on a not-stopped VM then we'd probably
> have found this mismatch in understanding about how the
> API should work years ago. "Mostly works but sometimes fails
> EBUSY" is the worst of all worlds.)
> 
> I haven't yet got as far as thinking about the KVM interface
> for GICv5 yet...

I guess that for the time being, we'll assume that GICv3 is the
reference.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.