lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ7pxebGOj5Z1iW6VPjzj73eY1hZFjS5uXMkee8XCm09DtFcKQ@mail.gmail.com>
Date: Tue, 14 Oct 2025 03:02:44 +0000
From: Salil Mehta <salil.mehta@...src.net>
To: Peter Maydell <peter.maydell@...aro.org>
Cc: Marc Zyngier <maz@...nel.org>, linux-kernel@...r.kernel.org, 
	linux-arm-kernel@...ts.infradead.org, salil.mehta@...wei.com, 
	jonathan.cameron@...wei.com, will@...nel.org, catalin.marinas@....com, 
	mark.rutland@....com, james.morse@....com, sudeep.holla@....com, 
	lpieralisi@...nel.org, jean-philippe@...aro.org, tglx@...utronix.de, 
	oliver.upton@...ux.dev, richard.henderson@...aro.org, andrew.jones@...ux.dev, 
	mst@...hat.com, david@...hat.com, philmd@...aro.org, ardb@...nel.org, 
	borntraeger@...ux.ibm.com, alex.bennee@...aro.org, gustavo.romero@...aro.org, 
	npiggin@...il.com, linux@...linux.org.uk, karl.heubaum@...cle.com, 
	miguel.luis@...cle.com, darren@...amperecomputing.com, 
	ilkka@...amperecomputing.com, vishnu@...amperecomputing.com, 
	gankulkarni@...amperecomputing.com, wangyanan55@...wei.com, 
	wangzhou1@...ilicon.com, linuxarm@...wei.com
Subject: Re: [RFC PATCH] KVM: arm64: vgic-v3: Cache ICC_CTLR_EL1 and allow
 lockless read when ready

Hi Peter

On Mon, Oct 13, 2025 at 4:48 PM Peter Maydell <peter.maydell@...aro.org> wrote:
>
> On Mon, 13 Oct 2025 at 11:55, Marc Zyngier <maz@...nel.org> wrote:
> >
> > On Mon, 13 Oct 2025 09:42:58 +0100,
> > Peter Maydell <peter.maydell@...aro.org> wrote:
> > >
> > > On Thu, 9 Oct 2025 at 14:48, Marc Zyngier <maz@...nel.org> wrote:
> > > >
> > > > On Wed, 08 Oct 2025 21:19:55 +0100,
> > > > salil.mehta@...src.net wrote:
> > > > >
> > > > > From: Salil Mehta <salil.mehta@...wei.com>
> > > > >
> > > > > [A rough illustration of the problem and the probable solution]
> > > > >
> > > > > Userspace reads of ICC_CTLR_EL1 via KVM device attributes currently takes a slow
> > > > > path that may acquire all vCPU locks. Under workloads that exercise userspace
> > > > > PSCI CPU_ON flows or frequent vCPU resets, this can cause vCPU lock contention
> > > > > in KVM and, in the worst cases, -EBUSY returns to userspace.
> > > > >
> > > > > When PSCI CPU_ON and CPU_OFF calls are handled entirely in KVM, these operations
> > > > > are executed under KVM vCPU locks in the host kernel (EL1) and appear atomic to
> > > > > other vCPU threads. In this context, system register accesses are serialized
> > > > > under KVM vCPU locks, ensuring atomicity with respect to other vCPUs. After
> > > > > SMCCC filtering was introduced, PSCI CPU_ON and CPU_OFF calls can now exit to
> > > > > userspace (QEMU). During the handling of PSCI CPU_ON call in userspace, a
> > > > > cpu_reset() is exerted which reads ICC_CTLR_EL1 through KVM device attribute
> > > > > IOCTLs. To avoid transient inconsistency and -EBUSY errors, QEMU is forced to
> > > > > pause all vCPUs before issuing these IOCTLs.
> > > >
> > > > I'm going to repeat in public what I already said in private.
> > > >
> > > > Why does QEMU need to know this? I don't see how this is related to
> > > > PSCI, and outside of save/restore, there is no reason why QEMU should
> > > > poke at this. If QEMU needs fixing, please fix QEMU.
> > >
> > > I don't know the background here, but generally speaking,
> > > when we do a CPU reset that includes writing all the CPU state
> > > of the "this is freshly reset from userspace's point of view" vcpu
> > > back to the kernel. More generally, userspace should be able to
> > > read and write sysregs for a vcpu any time it likes, and not
> > > arbitrarily get back -EBUSY. What does the kernel expect
> > > userspace to do with an errno like that?
> >
> > The main issue here is that GICv3 is modelled as a device, just like
> > GICv2, and that all the sysregs that are relevant to the GIC have the
> > same status as the MMIO registers: they can only be accessed when the
> > vcpus are not running.
> >
> > These sysregs are not visible through the normal ONE_REG API, and
> > therefore not subjected to the "do whatever you want" rule.
>
> Ah, I'd forgotten that. But the cpuif registers are still
> per-cpu, and they do still need to be reset on vcpu reset,
> and that might still happen for a single vcpu when the VM
> as a whole is still running.
>
> That said, QEMU's current code for this could be refactored
> to avoid the reset-time read of ICC_CTLR_EL1 from the kernel.
> We do this so we can set the userspace struct field for this
> register to the right value. But we could ask the kernel for
> that value once on VM startup since it's not going to change mid-run.
>
> That would bring ICC_CTLR_EL1 into line with the other cpuif
> registers, where QEMU assumes it knows what the kernel's
> reset value of them is (mostly "0") and doesn't bother to ask.
> This is different from how we handle ONE_REG sysregs, where
> I'm pretty sure we do ask the kernel the value of all of them
> on a vcpu reset. (And then write the values back again, which
> is a bit silly but nobody's ever said it was a performance
> problem for them :-))


This is effectively what the mentioned patch in the commit-log is doing.
Pasting here again:

https://lore.kernel.org/qemu-devel/20251001010127.3092631-22-salil.mehta@opnsrc.net/

Can this patch be accepted independently then?


>
> > Should we have done something else when the GICv3 save/restore API was
> > introduced and agreed upon with the QEMU people? Probably. Can we
> > change it now? Probably not. The only thing we could relax is the
> > scope of the lock when accessing a sysreg, so that we only mandate
> > that the targeted vcpu is not running instead of the whole VM.
> >
> > And finally, if you object to this API, why should we do for GICv5,
> > which is so far implemented by following the exact same principles?
>
> I don't object to the API inherently (I don't care whether we
> do these register reads via a dev ioctl or something else,
> from userspace's point of view it's just "do some syscall,
> get a value") -- I'm just objecting to the kernel's
> implementation of it where it might return EBUSY :-)
>
> (Also, if the kernel had failed EINVAL unconditionally for
> an attempt to do this on a not-stopped VM then we'd probably
> have found this mismatch in understanding about how the
> API should work years ago. "Mostly works but sometimes fails
> EBUSY" is the worst of all worlds.)
>
> I haven't yet got as far as thinking about the KVM interface
> for GICv5 yet...
>
> -- PMM

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ