linux-kernel - Re: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <867bwzxe9r.wl-maz@kernel.org>
Date: Mon, 13 Oct 2025 11:54:56 +0100
From: Marc Zyngier <maz@...nel.org>
To: Peter Maydell <peter.maydell@...aro.org>
Cc: salil.mehta@...src.net,
	linux-kernel@...r.kernel.org,
	linux-arm-kernel@...ts.infradead.org,
	salil.mehta@...wei.com,
	jonathan.cameron@...wei.com,
	will@...nel.org,
	catalin.marinas@....com,
	mark.rutland@....com,
	james.morse@....com,
	sudeep.holla@....com,
	lpieralisi@...nel.org,
	jean-philippe@...aro.org,
	tglx@...utronix.de,
	oliver.upton@...ux.dev,
	richard.henderson@...aro.org,
	andrew.jones@...ux.dev,
	mst@...hat.com,
	david@...hat.com,
	philmd@...aro.org,
	ardb@...nel.org,
	borntraeger@...ux.ibm.com,
	alex.bennee@...aro.org,
	gustavo.romero@...aro.org,
	npiggin@...il.com,
	linux@...linux.org.uk,
	karl.heubaum@...cle.com,
	miguel.luis@...cle.com,
	darren@...amperecomputing.com,
	ilkka@...amperecomputing.com,
	vishnu@...amperecomputing.com,
	gankulkarni@...amperecomputing.com,
	wangyanan55@...wei.com,
	wangzhou1@...ilicon.com,
	linuxarm@...wei.com
Subject: Re: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR_EL1 and allow lockless read when ready

On Mon, 13 Oct 2025 09:42:58 +0100,
Peter Maydell <peter.maydell@...aro.org> wrote:
> 
> On Thu, 9 Oct 2025 at 14:48, Marc Zyngier <maz@...nel.org> wrote:
> >
> > On Wed, 08 Oct 2025 21:19:55 +0100,
> > salil.mehta@...src.net wrote:
> > >
> > > From: Salil Mehta <salil.mehta@...wei.com>
> > >
> > > [A rough illustration of the problem and the probable solution]
> > >
> > > Userspace reads of ICC_CTLR_EL1 via KVM device attributes currently takes a slow
> > > path that may acquire all vCPU locks. Under workloads that exercise userspace
> > > PSCI CPU_ON flows or frequent vCPU resets, this can cause vCPU lock contention
> > > in KVM and, in the worst cases, -EBUSY returns to userspace.
> > >
> > > When PSCI CPU_ON and CPU_OFF calls are handled entirely in KVM, these operations
> > > are executed under KVM vCPU locks in the host kernel (EL1) and appear atomic to
> > > other vCPU threads. In this context, system register accesses are serialized
> > > under KVM vCPU locks, ensuring atomicity with respect to other vCPUs. After
> > > SMCCC filtering was introduced, PSCI CPU_ON and CPU_OFF calls can now exit to
> > > userspace (QEMU). During the handling of PSCI CPU_ON call in userspace, a
> > > cpu_reset() is exerted which reads ICC_CTLR_EL1 through KVM device attribute
> > > IOCTLs. To avoid transient inconsistency and -EBUSY errors, QEMU is forced to
> > > pause all vCPUs before issuing these IOCTLs.
> >
> > I'm going to repeat in public what I already said in private.
> >
> > Why does QEMU need to know this? I don't see how this is related to
> > PSCI, and outside of save/restore, there is no reason why QEMU should
> > poke at this. If QEMU needs fixing, please fix QEMU.
> 
> I don't know the background here, but generally speaking,
> when we do a CPU reset that includes writing all the CPU state
> of the "this is freshly reset from userspace's point of view" vcpu
> back to the kernel. More generally, userspace should be able to
> read and write sysregs for a vcpu any time it likes, and not
> arbitrarily get back -EBUSY. What does the kernel expect
> userspace to do with an errno like that?

The main issue here is that GICv3 is modelled as a device, just like
GICv2, and that all the sysregs that are relevant to the GIC have the
same status as the MMIO registers: they can only be accessed when the
vcpus are not running.

These sysregs are not visible through the normal ONE_REG API, and
therefore not subjected to the "do whatever you want" rule.

Should we have done something else when the GICv3 save/restore API was
introduced and agreed upon with the QEMU people? Probably. Can we
change it now? Probably not. The only thing we could relax is the
scope of the lock when accessing a sysreg, so that we only mandate
that the targeted vcpu is not running instead of the whole VM.

And finally, if you object to this API, why should we do for GICv5,
which is so far implemented by following the exact same principles?

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.