[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZrFYsSPaDWUHOl0N@google.com>
Date: Mon, 5 Aug 2024 15:56:49 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Michal Luczaj <mhal@...x.co>
Cc: Will Deacon <will@...nel.org>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
Paolo Bonzini <pbonzini@...hat.com>, Alexander Potapenko <glider@...gle.com>, Marc Zyngier <maz@...nel.org>
Subject: Re: [PATCH] KVM: Fix error path in kvm_vm_ioctl_create_vcpu() on
xa_store() failure
On Sun, Aug 04, 2024, Michal Luczaj wrote:
> On 8/1/24 14:41, Will Deacon wrote:
> > On Wed, Jul 31, 2024 at 09:18:56AM -0700, Sean Christopherson wrote:
> >> [...]
> >> Ya, the basic problem is that we have two ways of publishing the vCPU, fd and
> >> vcpu_array, with no way of setting both atomically. Given that xa_store() should
> >> never fail, I vote we do the simple thing and deliberately leak the memory.
> >
> > I'm inclined to agree. This conversation did momentarily get me worried
> > about the window between the successful create_vcpu_fd() and the
> > xa_store(), but it looks like 'kvm->online_vcpus' protects that.
> >
> > I'll spin a v2 leaking the vCPU, then.
>
> But perhaps you're right. The window you've described may be an issue.
> For example:
>
> static u64 get_time_ref_counter(struct kvm *kvm)
> {
> ...
> vcpu = kvm_get_vcpu(kvm, 0); // may still be NULL
> tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> return mul_u64_u64_shr(tsc, hv->tsc_ref.tsc_scale, 64)
> + hv->tsc_ref.tsc_offset;
> }
>
> u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
> {
> return vcpu->arch.l1_tsc_offset +
> kvm_scale_tsc(host_tsc, vcpu->arch.l1_tsc_scaling_ratio);
> }
>
> After stuffing msleep() between fd install and vcpu_array store:
>
> [ 125.296110] BUG: kernel NULL pointer dereference, address: 0000000000000b38
> [ 125.296203] #PF: supervisor read access in kernel mode
> [ 125.296266] #PF: error_code(0x0000) - not-present page
> [ 125.296327] PGD 12539e067 P4D 12539e067 PUD 12539d067 PMD 0
> [ 125.296392] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
> [ 125.296454] CPU: 12 UID: 1000 PID: 1179 Comm: a.out Not tainted 6.11.0-rc1nokasan+ #19
> [ 125.296521] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> [ 125.296585] RIP: 0010:kvm_read_l1_tsc+0x6/0x50 [kvm]
> [ 125.297376] Call Trace:
> [ 125.297430] <TASK>
> [ 125.297919] get_time_ref_counter+0x70/0x90 [kvm]
> [ 125.298039] kvm_hv_get_msr_common+0xc1/0x7d0 [kvm]
> [ 125.298150] __kvm_get_msr+0x72/0xf0 [kvm]
> [ 125.298421] do_get_msr+0x16/0x50 [kvm]
> [ 125.298531] msr_io+0x9d/0x110 [kvm]
> [ 125.298626] kvm_arch_vcpu_ioctl+0xdc5/0x19c0 [kvm]
> [ 125.299345] kvm_vcpu_ioctl+0x6cc/0x920 [kvm]
> [ 125.299540] __x64_sys_ioctl+0x90/0xd0
> [ 125.299582] do_syscall_64+0x93/0x180
> [ 125.300206] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 125.300243] RIP: 0033:0x7f2d64aded2d
>
> So, is get_time_ref_counter() broken (with a trivial fix) or should it be
> considered a regression after commit afb2acb2e3a3
> ("KVM: Fix vcpu_array[0] races")?
The latter, though arguably afb2acb2e3a3 isn't really a regression since it
essentially just reverts back to the pre-Xarray code, i.e. the bug was always
there, it was just temporarily masked by a worst bug.
I don't think we want to go down the path of declaring get_time_ref_counter()
broken, because that is going to result in an impossible programming model.
Ha! We can kill two birds with one stone. If we take vcpu->mutex before installing
the file descriptor, and hold it until online_vcpus is bumped, userspace
Argh, so close, kvm_arch_vcpu_async_ioctl() throws a wrench in that idea. Double
argh, whether or not an ioctl is async is buried in arch code.
I still think it makes sense to grab vcpu->mutex for synchronous ioctls. That
way there's no vibisle change to userspace, and we can lean on that code to reject
the async ioctls, as I can't imagine there's a practical use case for emitting an
an async ioctl without first doing a synchronous ioctl. E.g. in addition to the
below patch, plus changes to add kvm_arch_is_async_vcpu_ioctl():
/*
* Some architectures have vcpu ioctls that are asynchronous to vcpu
* execution; mutex_lock() would break them. Disallow asynchronous
* ioctls until the vCPU is fully online. This can only happen if
* userspace has *never* a done a synchronous ioctl, as acquiring the
* vCPU's mutex ensures the vCPU is online, i.e. isn't a restriction
* for any practical use case.
*/
if (kvm_arch_is_async_vcpu_ioctl(ioctl)) {
if (vcpu->vcpu_idx < atomic_read(&kvm->online_vcpus))
return -EINVAL;
return kvm_vcpu_async_ioctl(filp, ioctl, arg);
}
Alternatively, we could go for the super simple change and cross our fingers that
no "real" VMM emits vCPU ioctls before KVM_CREATE_VCPU returns.
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d0788d0a72cc..9ae9022a015f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4450,6 +4450,9 @@ static long kvm_vcpu_ioctl(struct file *filp,
if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
return -EINVAL;
+ if (unlikely(vcpu->vcpu_idx < atomic_read(&kvm->online_vcpus)))
+ return -EINVAL;
+
/*
* Some architectures have vcpu ioctls that are asynchronous to vcpu
* execution; mutex_lock() would break them.
The mutex approach, sans async ioctl support:
---
virt/kvm/kvm_main.c | 28 +++++++++++++++++++---------
1 file changed, 19 insertions(+), 9 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d0788d0a72cc..0a9c390b18a3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4269,12 +4269,6 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
mutex_lock(&kvm->lock);
-#ifdef CONFIG_LOCKDEP
- /* Ensure that lockdep knows vcpu->mutex is taken *inside* kvm->lock */
- mutex_lock(&vcpu->mutex);
- mutex_unlock(&vcpu->mutex);
-#endif
-
if (kvm_get_vcpu_by_id(kvm, id)) {
r = -EEXIST;
goto unlock_vcpu_destroy;
@@ -4285,15 +4279,29 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
if (r)
goto unlock_vcpu_destroy;
- /* Now it's all set up, let userspace reach it */
+ /*
+ * Now it's all set up, let userspace reach it. Grab the vCPU's mutex
+ * so that userspace can't invoke vCPU ioctl()s until the vCPU is fully
+ * visibile (per online_vcpus), e.g. so that KVM doesn't get tricked
+ * into a NULL-pointer dereference because KVM thinks the _current_
+ * vCPU doesn't exist. As a bonus, taking vcpu->mutex ensures lockdep
+ * knows it's taken *inside* kvm->lock.
+ */
+ mutex_lock(&vcpu->mutex);
kvm_get_kvm(kvm);
r = create_vcpu_fd(vcpu);
if (r < 0)
goto kvm_put_xa_release;
+ /*
+ * xa_store() should never fail, see xa_reserve() above. Leak the vCPU
+ * if the impossible happens, as userspace already has access to the
+ * vCPU, i.e. freeing the vCPU before userspace puts its file reference
+ * would trigger a use-after-free.
+ */
if (KVM_BUG_ON(xa_store(&kvm->vcpu_array, vcpu->vcpu_idx, vcpu, 0), kvm)) {
- r = -EINVAL;
- goto kvm_put_xa_release;
+ mutex_unlock(&vcpu->mutex);
+ return -EINVAL;
}
/*
@@ -4302,6 +4310,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
*/
smp_wmb();
atomic_inc(&kvm->online_vcpus);
+ mutex_unlock(&vcpu->mutex);
mutex_unlock(&kvm->lock);
kvm_arch_vcpu_postcreate(vcpu);
@@ -4309,6 +4318,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
return r;
kvm_put_xa_release:
+ mutex_unlock(&vcpu->mutex);
kvm_put_kvm_no_destroy(kvm);
xa_release(&kvm->vcpu_array, vcpu->vcpu_idx);
unlock_vcpu_destroy:
base-commit: 332d2c1d713e232e163386c35a3ba0c1b90df83f
--
Powered by blists - more mailing lists