linux-kernel - Re: [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial measurement fails

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aLe3V9mgv/gK4MPV@yzhao56-desk.sh.intel.com>
Date: Wed, 3 Sep 2025 11:34:47 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
CC: "seanjc@...gle.com" <seanjc@...gle.com>, "Huang, Kai"
	<kai.huang@...el.com>, "ackerleytng@...gle.com" <ackerleytng@...gle.com>,
	"Annapurve, Vishal" <vannapurve@...gle.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "Weiny, Ira" <ira.weiny@...el.com>,
	"pbonzini@...hat.com" <pbonzini@...hat.com>, "michael.roth@....com"
	<michael.roth@....com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>
Subject: Re: [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the
 initial measurement fails

On Wed, Sep 03, 2025 at 08:18:10AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2025-09-02 at 10:04 -0700, Sean Christopherson wrote:
> > On Tue, Sep 02, 2025, Yan Zhao wrote:
> > > But during writing another concurrency test, I found a sad news :
> > > 
> > > SEAMCALL TDH_VP_INIT requires to hold exclusive lock for resource TDR when its
> > > leaf_opcode.version > 0. So, when I use v1 (which is the current value in
> > > upstream, for x2apic?) to test executing ioctl KVM_TDX_INIT_VCPU on different
> > > vCPUs concurrently, the TDX_BUG_ON() following tdh_vp_init() will print error
> > > "SEAMCALL TDH_VP_INIT failed: 0x8000020000000080".
> > > 
> > > If I switch to using v0 version of TDH_VP_INIT, the contention will be gone.
> > 
> > Uh, so that's exactly the type of breaking ABI change that isn't acceptable.  If
> > it's really truly necessary, then we can can probably handle the change in KVM
> > since TDX is so new, but generally speaking such changes simply must not happen.
> > 
> > > Note: this acquiring of exclusive lock was not previously present in the public
> > > repo https://github.com/intel/tdx-module.git, branch tdx_1.5.
> > > (The branch has been force-updated to new implementation now).
> > 
> > Lovely.
> 
> Hmm, this exactly the kind of TDX module change we were just discussing
> reporting as a bug. Not clear on the timing of the change as far as the landing
> upstream. We could investigate whether whether we could fix it in the TDX
> module. This probably falls into the category of not actually regressing any
> userspace. But it does trigger a kernel warning, so warrant a fix, hmm.
> 
> > 
> > > > Acquire kvm->lock to prevent VM-wide things from happening, slots_lock to prevent
> > > > kvm_mmu_zap_all_fast(), and _all_ vCPU mutexes to prevent vCPUs from interefering.
> > > Nit: we should have no worry to kvm_mmu_zap_all_fast(), since it only zaps
> > > !mirror roots. The slots_lock should be for slots deletion.
> > 
> > Oof, I missed that.  We should have required nx_huge_pages=never for tdx=1.
> > Probably too late for that now though :-/
> > 
> > > > Doing that for a vCPU ioctl is a bit awkward, but not awful.  E.g. we can abuse
> > > > kvm_arch_vcpu_async_ioctl().  In hindsight, a more clever approach would have
> > > > been to make KVM_TDX_INIT_MEM_REGION a VM-scoped ioctl that takes a vCPU fd.  Oh
> > > > well.
> > > > 
> > > > Anyways, I think we need to avoid the "synchronous" ioctl path anyways, because
> > > > taking kvm->slots_lock inside vcpu->mutex is gross.  AFAICT it's not actively
> > > > problematic today, but it feels like a deadlock waiting to happen.
> > > Note: Looks kvm_inhibit_apic_access_page() also takes kvm->slots_lock inside
> > > vcpu->mutex.
> > 
> > Yikes.  As does kvm_alloc_apic_access_page(), which is likely why I thought it
> > was ok to take slots_lock.  But while kvm_alloc_apic_access_page() appears to be
> > called with vCPU scope, it's actually called from VM scope during vCPU creation.
> > 
> > I'll chew on this, though if someone has any ideas...
> > 
> > > So, do we need to move KVM_TDX_INIT_VCPU to tdx_vcpu_async_ioctl() as well?
> > 
> > If it's _just_ INIT_VCPU that can race (assuming the VM-scoped state transtitions
> > take all vcpu->mutex locks, as proposed), then a dedicated mutex (spinlock?) would
> > suffice, and probably would be preferable.  If INIT_VCPU needs to take kvm->lock
> > to protect against other races, then I guess the big hammer approach could work?
We need the big hammer approach as INIT_VCPU may race with vcpu_load()
in other vCPU ioctls.

> A duplicate TDR lock inside KVM or maybe even the arch/x86 side would make the
> reasoning easier to follow. Like, you don't need to remember "we take
> slots_lock/kvm_lock because of TDR lock", it's just 1:1. I hate the idea of
> adding more locks, and have argued against it in the past. But are we just
> fooling ourselves though? There are already more locks.
> 
> Another reason to duplicate (some) locks is that if it gives the scheduler more
> hints as far as waking up waiters, etc. The TDX module needs these locks to
> protect itself, so those are required. But when we just do retry loops (or let
> userspace do this), then we lose out on all of the locking goodness in the
> kernel.
> 
> Anyway, just a strawman. I don't have any great ideas.
Do you think the following fix is good?
It moves INIT_VCPU to tdx_vcpu_async_ioctl and uses the big hammer to make it
impossible to contend with other SEAMCALLs, just as for tdh_mr_extend().

It passed my local concurrent test.

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 99381c8b4108..8a6f2feaab41 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3047,16 +3047,22 @@ static int tdx_vcpu_get_cpuid(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)

 static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
 {
-       u64 apic_base;
+       struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
        struct vcpu_tdx *tdx = to_tdx(vcpu);
+       u64 apic_base;
        int ret;

+       CLASS(tdx_vm_state_guard, guard)(vcpu->kvm);
+       if (IS_ERR(guard))
+               return PTR_ERR(guard);
+
        if (cmd->flags)
                return -EINVAL;

-       if (tdx->state != VCPU_TD_STATE_UNINITIALIZED)
+       if (!is_hkid_assigned(kvm_tdx) || tdx->state != VCPU_TD_STATE_UNINITIALIZED)
                return -EINVAL;

+       vcpu_load(vcpu);
        /*
         * TDX requires X2APIC, userspace is responsible for configuring guest
         * CPUID accordingly.
@@ -3075,6 +3081,7 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
        td_vmcs_setbit32(tdx, PIN_BASED_VM_EXEC_CONTROL, PIN_BASED_POSTED_INTR);

        tdx->state = VCPU_TD_STATE_INITIALIZED;
+       vcpu_put(vcpu);

        return 0;
 }
@@ -3228,10 +3235,18 @@ int tdx_vcpu_async_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
        if (r)
                return r;

-       if (cmd.id != KVM_TDX_INIT_MEM_REGION)
-               return -ENOIOCTLCMD;
-
-       return tdx_vcpu_init_mem_region(vcpu, &cmd);
+       switch (cmd.id) {
+       case KVM_TDX_INIT_MEM_REGION:
+               r = tdx_vcpu_init_mem_region(vcpu, &cmd);
+               break;
+       case KVM_TDX_INIT_VCPU:
+               r = tdx_vcpu_init(vcpu, &cmd);
+               break;
+       default:
+               r = -ENOIOCTLCMD;
+               break;
+       }
+       return r;
 }

 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
@@ -3248,9 +3263,6 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
                return ret;

        switch (cmd.id) {
-       case KVM_TDX_INIT_VCPU:
-               ret = tdx_vcpu_init(vcpu, &cmd);
-               break;
        case KVM_TDX_GET_CPUID:
                ret = tdx_vcpu_get_cpuid(vcpu, &cmd);
                break;


Besides, to unblock testing the above code, I fixed a bug related to vcpu_load()
in current TDX upstream code. Attached the fixup patch below.

Sean, please let me know if you want to include it into this series.
(It still lacks a Fixes tag as I haven't found out which commit is the best fit). 


>From 0d1ba6d60315e34bdb0e54acceb6e8dd0fbdb262 Mon Sep 17 00:00:00 2001
From: Yan Zhao <yan.y.zhao@...el.com>
Date: Tue, 2 Sep 2025 18:31:27 -0700
Subject: [PATCH 1/2] KVM: TDX: Fix list_add corruption during vcpu_load()

During vCPU creation, a vCPU may be destroyed immediately after
kvm_arch_vcpu_create() (e.g., due to vCPU id confiliction). However, the
vcpu_load() inside kvm_arch_vcpu_create() may have associate the vCPU to
pCPU via "list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu))"
before invoking tdx_vcpu_free().

Though there's no need to invoke tdh_vp_flush() on the vCPU, failing to
dissociate the vCPU from pCPU (i.e., "list_del(&to_tdx(vcpu)->cpu_list)")
will cause list corruption of the per-pCPU list associated_tdvcpus.

Then, a later list_add() during vcpu_load() would detect list corruption
and print calltrace as shown below.

Dissociate a vCPU from its associated pCPU in tdx_vcpu_free() for the vCPUs
destroyed immediately after creation which must be in
VCPU_TD_STATE_UNINITIALIZED state.

kernel BUG at lib/list_debug.c:29!
Oops: invalid opcode: 0000 [#2] SMP NOPTI
RIP: 0010:__list_add_valid_or_report+0x82/0xd0

Call Trace:
 <TASK>
 tdx_vcpu_load+0xa8/0x120
 vt_vcpu_load+0x25/0x30
 kvm_arch_vcpu_load+0x81/0x300
 vcpu_load+0x55/0x90
 kvm_arch_vcpu_create+0x24f/0x330
 kvm_vm_ioctl_create_vcpu+0x1b1/0x53
 ? trace_lock_release+0x6d/0xb0
 kvm_vm_ioctl+0xc2/0xa60
 ? tty_ldisc_deref+0x16/0x20
 ? debug_smp_processor_id+0x17/0x20
 ? __fget_files+0xc2/0x1b0
 ? debug_smp_processor_id+0x17/0x20
 ? rcu_is_watching+0x13/0x70
 ? __fget_files+0xc2/0x1b0
 ? trace_lock_release+0x6d/0xb0
 ? lock_release+0x14/0xd0
 ? __fget_files+0xcc/0x1b0
 __x64_sys_ioctl+0x9a/0xf0
 ? rcu_is_watching+0x13/0x70
 x64_sys_call+0x10ee/0x20d0
 do_syscall_64+0xc3/0x470
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
---
 arch/x86/kvm/vmx/tdx.c | 42 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e99d07611393..99381c8b4108 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -837,19 +837,51 @@ void tdx_vcpu_put(struct kvm_vcpu *vcpu)
        tdx_prepare_switch_to_host(vcpu);
 }

+/*
+ * Life cycles for a TD and a vCPU:
+ * 1. KVM_CREATE_VM ioctl.
+ *    TD state is TD_STATE_UNINITIALIZED.
+ *    hkid is not assigned at this stage.
+ * 2. KVM_TDX_INIT_VM ioctl.
+ *    TD transistions to TD_STATE_INITIALIZED.
+ *    hkid is assigned after this stage.
+ * 3. KVM_CREATE_VCPU ioctl. (only when TD is TD_STATE_INITIALIZED).
+ *    3.1 tdx_vcpu_create() transitions vCPU state to VCPU_TD_STATE_UNINITIALIZED.
+ *    3.2 vcpu_load() and vcpu_put() in kvm_arch_vcpu_create().
+ *    3.3 (conditional) if any error encountered after kvm_arch_vcpu_create()
+ *        kvm_arch_vcpu_destroy() --> tdx_vcpu_free().
+ * 4. KVM_TDX_INIT_VCPU ioctl.
+ *    tdx_vcpu_init() transistions vCPU state to VCPU_TD_STATE_INITIALIZED.
+ *    vCPU control structures are allocated at this stage.
+ * 5. kvm_destroy_vm().
+ *    5.1 tdx_mmu_release_hkid(): (1) tdh_vp_flush(), disassociats all vCPUs.
+ *                                (2) puts hkid to !assigned state.
+ *    5.2 kvm_destroy_vcpus() --> tdx_vcpu_free():
+ *        transistions vCPU to VCPU_TD_STATE_UNINITIALIZED state.
+ *    5.3 tdx_vm_destroy()
+ *        transitions TD to TD_STATE_UNINITIALIZED state.
+ *
+ * tdx_vcpu_free() can be invoked only at 3.3 or 5.2.
+ * - If at 3.3, hkid is still assigned, but the vCPU must be in
+ *   VCPU_TD_STATE_UNINITIALIZED state.
+ * - if at 5.2, hkid must be !assigned and all vCPUs must be in
+ *   VCPU_TD_STATE_INITIALIZED state and have been dissociated.
+ */
 void tdx_vcpu_free(struct kvm_vcpu *vcpu)
 {
        struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
        struct vcpu_tdx *tdx = to_tdx(vcpu);
        int i;

+       if (vcpu->cpu != -1) {
+               KVM_BUG_ON(tdx->state == VCPU_TD_STATE_INITIALIZED, vcpu->kvm);
+               tdx_disassociate_vp(vcpu);
+               return;
+       }
        /*
         * It is not possible to reclaim pages while hkid is assigned. It might
-        * be assigned if:
-        * 1. the TD VM is being destroyed but freeing hkid failed, in which
-        * case the pages are leaked
-        * 2. TD VCPU creation failed and this on the error path, in which case
-        * there is nothing to do anyway
+        * be assigned if the TD VM is being destroyed but freeing hkid failed,
+        * in which case the pages are leaked.
         */
        if (is_hkid_assigned(kvm_tdx))
                return;
--
2.43.0