linux-kernel - RE: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8b82541b805b4a9293f15740df73eaa8@huawei.com>
Date: Tue, 14 Oct 2025 10:24:23 +0000
From: Salil Mehta <salil.mehta@...wei.com>
To: Marc Zyngier <maz@...nel.org>, Peter Maydell <peter.maydell@...aro.org>
CC: "salil.mehta@...src.net" <salil.mehta@...src.net>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-arm-kernel@...ts.infradead.org"
	<linux-arm-kernel@...ts.infradead.org>, Jonathan Cameron
	<jonathan.cameron@...wei.com>, "will@...nel.org" <will@...nel.org>,
	"catalin.marinas@....com" <catalin.marinas@....com>, "mark.rutland@....com"
	<mark.rutland@....com>, "james.morse@....com" <james.morse@....com>,
	"sudeep.holla@....com" <sudeep.holla@....com>, "lpieralisi@...nel.org"
	<lpieralisi@...nel.org>, "jean-philippe@...aro.org"
	<jean-philippe@...aro.org>, "tglx@...utronix.de" <tglx@...utronix.de>,
	"oliver.upton@...ux.dev" <oliver.upton@...ux.dev>,
	"richard.henderson@...aro.org" <richard.henderson@...aro.org>,
	"andrew.jones@...ux.dev" <andrew.jones@...ux.dev>, "mst@...hat.com"
	<mst@...hat.com>, "david@...hat.com" <david@...hat.com>, "philmd@...aro.org"
	<philmd@...aro.org>, "ardb@...nel.org" <ardb@...nel.org>,
	"borntraeger@...ux.ibm.com" <borntraeger@...ux.ibm.com>,
	"alex.bennee@...aro.org" <alex.bennee@...aro.org>,
	"gustavo.romero@...aro.org" <gustavo.romero@...aro.org>, "npiggin@...il.com"
	<npiggin@...il.com>, "linux@...linux.org.uk" <linux@...linux.org.uk>,
	"karl.heubaum@...cle.com" <karl.heubaum@...cle.com>, "miguel.luis@...cle.com"
	<miguel.luis@...cle.com>, "darren@...amperecomputing.com"
	<darren@...amperecomputing.com>, "ilkka@...amperecomputing.com"
	<ilkka@...amperecomputing.com>, "vishnu@...amperecomputing.com"
	<vishnu@...amperecomputing.com>, "gankulkarni@...amperecomputing.com"
	<gankulkarni@...amperecomputing.com>, "wangyanan (Y)"
	<wangyanan55@...wei.com>, "Wangzhou (B)" <wangzhou1@...ilicon.com>, Linuxarm
	<linuxarm@...wei.com>
Subject: RE: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR_EL1 and allow
 lockless read when ready

HI Marc,

> From: Marc Zyngier <maz@...nel.org>
> Sent: Tuesday, October 14, 2025 8:45 AM
> To: Peter Maydell <peter.maydell@...aro.org>
> 
> On Mon, 13 Oct 2025 17:48:44 +0100,
> Peter Maydell <peter.maydell@...aro.org> wrote:
> >
> > On Mon, 13 Oct 2025 at 11:55, Marc Zyngier <maz@...nel.org> wrote:
> > >
> > > On Mon, 13 Oct 2025 09:42:58 +0100,
> > > Peter Maydell <peter.maydell@...aro.org> wrote:
> > > >
> > > > On Thu, 9 Oct 2025 at 14:48, Marc Zyngier <maz@...nel.org> wrote:
> > > > >
> > > > > On Wed, 08 Oct 2025 21:19:55 +0100, salil.mehta@...src.net
> > > > > wrote:
> > > > > >
> > > > > > From: Salil Mehta <salil.mehta@...wei.com>
> > > > > >
> > > > > > [A rough illustration of the problem and the probable
> > > > > > solution]
> > > > > >
> > > > > > Userspace reads of ICC_CTLR_EL1 via KVM device attributes
> > > > > > currently takes a slow path that may acquire all vCPU locks.
> > > > > > Under workloads that exercise userspace PSCI CPU_ON flows or
> > > > > > frequent vCPU resets, this can cause vCPU lock contention in KVM
> and, in the worst cases, -EBUSY returns to userspace.
> > > > > >
> > > > > > When PSCI CPU_ON and CPU_OFF calls are handled entirely in
> > > > > > KVM, these operations are executed under KVM vCPU locks in the
> > > > > > host kernel (EL1) and appear atomic to other vCPU threads. In
> > > > > > this context, system register accesses are serialized under
> > > > > > KVM vCPU locks, ensuring atomicity with respect to other
> > > > > > vCPUs. After SMCCC filtering was introduced, PSCI CPU_ON and
> > > > > > CPU_OFF calls can now exit to userspace (QEMU). During the
> > > > > > handling of PSCI CPU_ON call in userspace, a
> > > > > > cpu_reset() is exerted which reads ICC_CTLR_EL1 through KVM
> > > > > > device attribute IOCTLs. To avoid transient inconsistency and
> > > > > > -EBUSY errors, QEMU is forced to pause all vCPUs before issuing
> these IOCTLs.
> > > > >
> > > > > I'm going to repeat in public what I already said in private.
> > > > >
> > > > > Why does QEMU need to know this? I don't see how this is related
> > > > > to PSCI, and outside of save/restore, there is no reason why
> > > > > QEMU should poke at this. If QEMU needs fixing, please fix QEMU.
> > > >
> > > > I don't know the background here, but generally speaking, when we
> > > > do a CPU reset that includes writing all the CPU state of the
> > > > "this is freshly reset from userspace's point of view" vcpu back
> > > > to the kernel. More generally, userspace should be able to read
> > > > and write sysregs for a vcpu any time it likes, and not
> > > > arbitrarily get back -EBUSY. What does the kernel expect userspace
> > > > to do with an errno like that?
> > >
> > > The main issue here is that GICv3 is modelled as a device, just like
> > > GICv2, and that all the sysregs that are relevant to the GIC have
> > > the same status as the MMIO registers: they can only be accessed
> > > when the vcpus are not running.
> > >
> > > These sysregs are not visible through the normal ONE_REG API, and
> > > therefore not subjected to the "do whatever you want" rule.
> >
> > Ah, I'd forgotten that. But the cpuif registers are still per-cpu, and
> > they do still need to be reset on vcpu reset, and that might still
> > happen for a single vcpu when the VM as a whole is still running.
> >
> > That said, QEMU's current code for this could be refactored to avoid
> > the reset-time read of ICC_CTLR_EL1 from the kernel.
> > We do this so we can set the userspace struct field for this register
> > to the right value. But we could ask the kernel for that value once on
> > VM startup since it's not going to change mid-run.
> 
> The reset value is indeed cast in stone once the GIC has been created.
> 
> > That would bring ICC_CTLR_EL1 into line with the other cpuif
> > registers, where QEMU assumes it knows what the kernel's reset value
> > of them is (mostly "0") and doesn't bother to ask.
> > This is different from how we handle ONE_REG sysregs, where I'm pretty
> > sure we do ask the kernel the value of all of them on a vcpu reset.
> > (And then write the values back again, which is a bit silly but
> > nobody's ever said it was a performance problem for them :-))
> >
> > > Should we have done something else when the GICv3 save/restore API
> > > was introduced and agreed upon with the QEMU people? Probably. Can
> > > we change it now? Probably not. The only thing we could relax is the
> > > scope of the lock when accessing a sysreg, so that we only mandate
> > > that the targeted vcpu is not running instead of the whole VM.
> > >
> > > And finally, if you object to this API, why should we do for GICv5,
> > > which is so far implemented by following the exact same principles?
> >
> > I don't object to the API inherently (I don't care whether we do these
> > register reads via a dev ioctl or something else, from userspace's
> > point of view it's just "do some syscall, get a value") -- I'm just
> > objecting to the kernel's implementation of it where it might return
> > EBUSY :-)
> 
> To me, EBUSY has a clear meaning: you're otherwise using the resource, and
> you need to relinquish it first, while EINVAL indicates that the kernel doesn't
> understand what you want.
> 
> As I said, I'm happy to look at reducing the locking to only the target vcpu in
> the case of a sysreg being accessed, but EBUSY will stay.


I forgot to mention that we also saw once the issue of -EBUSY happening when
userspace GICv3 ITS resets during guest reboot and it calls KVM_DEV_ARM_ITS_CTRL_RESET.

File: kvm/vgic/vgic-its.c

static int vgic_its_ctrl(struct kvm *kvm, struct vgic_its *its, u64 attr)
{
[...]
	mutex_lock(&kvm->lock);

	if (kvm_trylock_all_vcpus(kvm)) {
		mutex_unlock(&kvm->lock);
		return -EBUSY;
	}

	mutex_lock(&kvm->arch.config_lock);
	mutex_lock(&its->its_lock);

	switch (attr) {
	case KVM_DEV_ARM_ITS_CTRL_RESET:
		vgic_its_reset(kvm, its);
		break;
[...]
}


Best regards
Salil.

> 
> >
> > (Also, if the kernel had failed EINVAL unconditionally for an attempt
> > to do this on a not-stopped VM then we'd probably have found this
> > mismatch in understanding about how the API should work years ago.
> > "Mostly works but sometimes fails EBUSY" is the worst of all worlds.)
> >
> > I haven't yet got as far as thinking about the KVM interface for GICv5
> > yet...
> 
> I guess that for the time being, we'll assume that GICv3 is the reference.
> 
> Thanks,
> 
> 	M.
> 
> --
> Without deviation from the norm, progress is not possible.