[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aLe3V9mgv/gK4MPV@yzhao56-desk.sh.intel.com>
Date: Wed, 3 Sep 2025 11:34:47 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
CC: "seanjc@...gle.com" <seanjc@...gle.com>, "Huang, Kai"
<kai.huang@...el.com>, "ackerleytng@...gle.com" <ackerleytng@...gle.com>,
"Annapurve, Vishal" <vannapurve@...gle.com>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "Weiny, Ira" <ira.weiny@...el.com>,
"pbonzini@...hat.com" <pbonzini@...hat.com>, "michael.roth@....com"
<michael.roth@....com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>
Subject: Re: [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the
initial measurement fails
On Wed, Sep 03, 2025 at 08:18:10AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-09-02 at 10:04 -0700, Sean Christopherson wrote:
> > On Tue, Sep 02, 2025, Yan Zhao wrote:
> > > But during writing another concurrency test, I found a sad news :
> > >
> > > SEAMCALL TDH_VP_INIT requires to hold exclusive lock for resource TDR when its
> > > leaf_opcode.version > 0. So, when I use v1 (which is the current value in
> > > upstream, for x2apic?) to test executing ioctl KVM_TDX_INIT_VCPU on different
> > > vCPUs concurrently, the TDX_BUG_ON() following tdh_vp_init() will print error
> > > "SEAMCALL TDH_VP_INIT failed: 0x8000020000000080".
> > >
> > > If I switch to using v0 version of TDH_VP_INIT, the contention will be gone.
> >
> > Uh, so that's exactly the type of breaking ABI change that isn't acceptable. If
> > it's really truly necessary, then we can can probably handle the change in KVM
> > since TDX is so new, but generally speaking such changes simply must not happen.
> >
> > > Note: this acquiring of exclusive lock was not previously present in the public
> > > repo https://github.com/intel/tdx-module.git, branch tdx_1.5.
> > > (The branch has been force-updated to new implementation now).
> >
> > Lovely.
>
> Hmm, this exactly the kind of TDX module change we were just discussing
> reporting as a bug. Not clear on the timing of the change as far as the landing
> upstream. We could investigate whether whether we could fix it in the TDX
> module. This probably falls into the category of not actually regressing any
> userspace. But it does trigger a kernel warning, so warrant a fix, hmm.
>
> >
> > > > Acquire kvm->lock to prevent VM-wide things from happening, slots_lock to prevent
> > > > kvm_mmu_zap_all_fast(), and _all_ vCPU mutexes to prevent vCPUs from interefering.
> > > Nit: we should have no worry to kvm_mmu_zap_all_fast(), since it only zaps
> > > !mirror roots. The slots_lock should be for slots deletion.
> >
> > Oof, I missed that. We should have required nx_huge_pages=never for tdx=1.
> > Probably too late for that now though :-/
> >
> > > > Doing that for a vCPU ioctl is a bit awkward, but not awful. E.g. we can abuse
> > > > kvm_arch_vcpu_async_ioctl(). In hindsight, a more clever approach would have
> > > > been to make KVM_TDX_INIT_MEM_REGION a VM-scoped ioctl that takes a vCPU fd. Oh
> > > > well.
> > > >
> > > > Anyways, I think we need to avoid the "synchronous" ioctl path anyways, because
> > > > taking kvm->slots_lock inside vcpu->mutex is gross. AFAICT it's not actively
> > > > problematic today, but it feels like a deadlock waiting to happen.
> > > Note: Looks kvm_inhibit_apic_access_page() also takes kvm->slots_lock inside
> > > vcpu->mutex.
> >
> > Yikes. As does kvm_alloc_apic_access_page(), which is likely why I thought it
> > was ok to take slots_lock. But while kvm_alloc_apic_access_page() appears to be
> > called with vCPU scope, it's actually called from VM scope during vCPU creation.
> >
> > I'll chew on this, though if someone has any ideas...
> >
> > > So, do we need to move KVM_TDX_INIT_VCPU to tdx_vcpu_async_ioctl() as well?
> >
> > If it's _just_ INIT_VCPU that can race (assuming the VM-scoped state transtitions
> > take all vcpu->mutex locks, as proposed), then a dedicated mutex (spinlock?) would
> > suffice, and probably would be preferable. If INIT_VCPU needs to take kvm->lock
> > to protect against other races, then I guess the big hammer approach could work?
We need the big hammer approach as INIT_VCPU may race with vcpu_load()
in other vCPU ioctls.
> A duplicate TDR lock inside KVM or maybe even the arch/x86 side would make the
> reasoning easier to follow. Like, you don't need to remember "we take
> slots_lock/kvm_lock because of TDR lock", it's just 1:1. I hate the idea of
> adding more locks, and have argued against it in the past. But are we just
> fooling ourselves though? There are already more locks.
>
> Another reason to duplicate (some) locks is that if it gives the scheduler more
> hints as far as waking up waiters, etc. The TDX module needs these locks to
> protect itself, so those are required. But when we just do retry loops (or let
> userspace do this), then we lose out on all of the locking goodness in the
> kernel.
>
> Anyway, just a strawman. I don't have any great ideas.
Do you think the following fix is good?
It moves INIT_VCPU to tdx_vcpu_async_ioctl and uses the big hammer to make it
impossible to contend with other SEAMCALLs, just as for tdh_mr_extend().
It passed my local concurrent test.
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 99381c8b4108..8a6f2feaab41 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3047,16 +3047,22 @@ static int tdx_vcpu_get_cpuid(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
{
- u64 apic_base;
+ struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
struct vcpu_tdx *tdx = to_tdx(vcpu);
+ u64 apic_base;
int ret;
+ CLASS(tdx_vm_state_guard, guard)(vcpu->kvm);
+ if (IS_ERR(guard))
+ return PTR_ERR(guard);
+
if (cmd->flags)
return -EINVAL;
- if (tdx->state != VCPU_TD_STATE_UNINITIALIZED)
+ if (!is_hkid_assigned(kvm_tdx) || tdx->state != VCPU_TD_STATE_UNINITIALIZED)
return -EINVAL;
+ vcpu_load(vcpu);
/*
* TDX requires X2APIC, userspace is responsible for configuring guest
* CPUID accordingly.
@@ -3075,6 +3081,7 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
td_vmcs_setbit32(tdx, PIN_BASED_VM_EXEC_CONTROL, PIN_BASED_POSTED_INTR);
tdx->state = VCPU_TD_STATE_INITIALIZED;
+ vcpu_put(vcpu);
return 0;
}
@@ -3228,10 +3235,18 @@ int tdx_vcpu_async_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
if (r)
return r;
- if (cmd.id != KVM_TDX_INIT_MEM_REGION)
- return -ENOIOCTLCMD;
-
- return tdx_vcpu_init_mem_region(vcpu, &cmd);
+ switch (cmd.id) {
+ case KVM_TDX_INIT_MEM_REGION:
+ r = tdx_vcpu_init_mem_region(vcpu, &cmd);
+ break;
+ case KVM_TDX_INIT_VCPU:
+ r = tdx_vcpu_init(vcpu, &cmd);
+ break;
+ default:
+ r = -ENOIOCTLCMD;
+ break;
+ }
+ return r;
}
int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
@@ -3248,9 +3263,6 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
return ret;
switch (cmd.id) {
- case KVM_TDX_INIT_VCPU:
- ret = tdx_vcpu_init(vcpu, &cmd);
- break;
case KVM_TDX_GET_CPUID:
ret = tdx_vcpu_get_cpuid(vcpu, &cmd);
break;
Besides, to unblock testing the above code, I fixed a bug related to vcpu_load()
in current TDX upstream code. Attached the fixup patch below.
Sean, please let me know if you want to include it into this series.
(It still lacks a Fixes tag as I haven't found out which commit is the best fit).
>From 0d1ba6d60315e34bdb0e54acceb6e8dd0fbdb262 Mon Sep 17 00:00:00 2001
From: Yan Zhao <yan.y.zhao@...el.com>
Date: Tue, 2 Sep 2025 18:31:27 -0700
Subject: [PATCH 1/2] KVM: TDX: Fix list_add corruption during vcpu_load()
During vCPU creation, a vCPU may be destroyed immediately after
kvm_arch_vcpu_create() (e.g., due to vCPU id confiliction). However, the
vcpu_load() inside kvm_arch_vcpu_create() may have associate the vCPU to
pCPU via "list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu))"
before invoking tdx_vcpu_free().
Though there's no need to invoke tdh_vp_flush() on the vCPU, failing to
dissociate the vCPU from pCPU (i.e., "list_del(&to_tdx(vcpu)->cpu_list)")
will cause list corruption of the per-pCPU list associated_tdvcpus.
Then, a later list_add() during vcpu_load() would detect list corruption
and print calltrace as shown below.
Dissociate a vCPU from its associated pCPU in tdx_vcpu_free() for the vCPUs
destroyed immediately after creation which must be in
VCPU_TD_STATE_UNINITIALIZED state.
kernel BUG at lib/list_debug.c:29!
Oops: invalid opcode: 0000 [#2] SMP NOPTI
RIP: 0010:__list_add_valid_or_report+0x82/0xd0
Call Trace:
<TASK>
tdx_vcpu_load+0xa8/0x120
vt_vcpu_load+0x25/0x30
kvm_arch_vcpu_load+0x81/0x300
vcpu_load+0x55/0x90
kvm_arch_vcpu_create+0x24f/0x330
kvm_vm_ioctl_create_vcpu+0x1b1/0x53
? trace_lock_release+0x6d/0xb0
kvm_vm_ioctl+0xc2/0xa60
? tty_ldisc_deref+0x16/0x20
? debug_smp_processor_id+0x17/0x20
? __fget_files+0xc2/0x1b0
? debug_smp_processor_id+0x17/0x20
? rcu_is_watching+0x13/0x70
? __fget_files+0xc2/0x1b0
? trace_lock_release+0x6d/0xb0
? lock_release+0x14/0xd0
? __fget_files+0xcc/0x1b0
__x64_sys_ioctl+0x9a/0xf0
? rcu_is_watching+0x13/0x70
x64_sys_call+0x10ee/0x20d0
do_syscall_64+0xc3/0x470
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
---
arch/x86/kvm/vmx/tdx.c | 42 +++++++++++++++++++++++++++++++++++++-----
1 file changed, 37 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e99d07611393..99381c8b4108 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -837,19 +837,51 @@ void tdx_vcpu_put(struct kvm_vcpu *vcpu)
tdx_prepare_switch_to_host(vcpu);
}
+/*
+ * Life cycles for a TD and a vCPU:
+ * 1. KVM_CREATE_VM ioctl.
+ * TD state is TD_STATE_UNINITIALIZED.
+ * hkid is not assigned at this stage.
+ * 2. KVM_TDX_INIT_VM ioctl.
+ * TD transistions to TD_STATE_INITIALIZED.
+ * hkid is assigned after this stage.
+ * 3. KVM_CREATE_VCPU ioctl. (only when TD is TD_STATE_INITIALIZED).
+ * 3.1 tdx_vcpu_create() transitions vCPU state to VCPU_TD_STATE_UNINITIALIZED.
+ * 3.2 vcpu_load() and vcpu_put() in kvm_arch_vcpu_create().
+ * 3.3 (conditional) if any error encountered after kvm_arch_vcpu_create()
+ * kvm_arch_vcpu_destroy() --> tdx_vcpu_free().
+ * 4. KVM_TDX_INIT_VCPU ioctl.
+ * tdx_vcpu_init() transistions vCPU state to VCPU_TD_STATE_INITIALIZED.
+ * vCPU control structures are allocated at this stage.
+ * 5. kvm_destroy_vm().
+ * 5.1 tdx_mmu_release_hkid(): (1) tdh_vp_flush(), disassociats all vCPUs.
+ * (2) puts hkid to !assigned state.
+ * 5.2 kvm_destroy_vcpus() --> tdx_vcpu_free():
+ * transistions vCPU to VCPU_TD_STATE_UNINITIALIZED state.
+ * 5.3 tdx_vm_destroy()
+ * transitions TD to TD_STATE_UNINITIALIZED state.
+ *
+ * tdx_vcpu_free() can be invoked only at 3.3 or 5.2.
+ * - If at 3.3, hkid is still assigned, but the vCPU must be in
+ * VCPU_TD_STATE_UNINITIALIZED state.
+ * - if at 5.2, hkid must be !assigned and all vCPUs must be in
+ * VCPU_TD_STATE_INITIALIZED state and have been dissociated.
+ */
void tdx_vcpu_free(struct kvm_vcpu *vcpu)
{
struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
struct vcpu_tdx *tdx = to_tdx(vcpu);
int i;
+ if (vcpu->cpu != -1) {
+ KVM_BUG_ON(tdx->state == VCPU_TD_STATE_INITIALIZED, vcpu->kvm);
+ tdx_disassociate_vp(vcpu);
+ return;
+ }
/*
* It is not possible to reclaim pages while hkid is assigned. It might
- * be assigned if:
- * 1. the TD VM is being destroyed but freeing hkid failed, in which
- * case the pages are leaked
- * 2. TD VCPU creation failed and this on the error path, in which case
- * there is nothing to do anyway
+ * be assigned if the TD VM is being destroyed but freeing hkid failed,
+ * in which case the pages are leaked.
*/
if (is_hkid_assigned(kvm_tdx))
return;
--
2.43.0
Powered by blists - more mailing lists