linux-kernel - Re: [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial measurement fails

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aLcjppW1eiCrxJPC@google.com>
Date: Tue, 2 Sep 2025 10:04:38 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Yan Zhao <yan.y.zhao@...el.com>
Cc: Rick P Edgecombe <rick.p.edgecombe@...el.com>, Kai Huang <kai.huang@...el.com>, 
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, Vishal Annapurve <vannapurve@...gle.com>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, Ira Weiny <ira.weiny@...el.com>, 
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "michael.roth@....com" <michael.roth@....com>, 
	"pbonzini@...hat.com" <pbonzini@...hat.com>
Subject: Re: [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial
 measurement fails

On Tue, Sep 02, 2025, Yan Zhao wrote:
> But during writing another concurrency test, I found a sad news :
> 
> SEAMCALL TDH_VP_INIT requires to hold exclusive lock for resource TDR when its
> leaf_opcode.version > 0. So, when I use v1 (which is the current value in
> upstream, for x2apic?) to test executing ioctl KVM_TDX_INIT_VCPU on different
> vCPUs concurrently, the TDX_BUG_ON() following tdh_vp_init() will print error
> "SEAMCALL TDH_VP_INIT failed: 0x8000020000000080".
> 
> If I switch to using v0 version of TDH_VP_INIT, the contention will be gone.

Uh, so that's exactly the type of breaking ABI change that isn't acceptable.  If
it's really truly necessary, then we can can probably handle the change in KVM
since TDX is so new, but generally speaking such changes simply must not happen.

> Note: this acquiring of exclusive lock was not previously present in the public
> repo https://github.com/intel/tdx-module.git, branch tdx_1.5.
> (The branch has been force-updated to new implementation now).

Lovely.

> > Acquire kvm->lock to prevent VM-wide things from happening, slots_lock to prevent
> > kvm_mmu_zap_all_fast(), and _all_ vCPU mutexes to prevent vCPUs from interefering.
> Nit: we should have no worry to kvm_mmu_zap_all_fast(), since it only zaps
> !mirror roots. The slots_lock should be for slots deletion.

Oof, I missed that.  We should have required nx_huge_pages=never for tdx=1.
Probably too late for that now though :-/

> > Doing that for a vCPU ioctl is a bit awkward, but not awful.  E.g. we can abuse
> > kvm_arch_vcpu_async_ioctl().  In hindsight, a more clever approach would have
> > been to make KVM_TDX_INIT_MEM_REGION a VM-scoped ioctl that takes a vCPU fd.  Oh
> > well.
> > 
> > Anyways, I think we need to avoid the "synchronous" ioctl path anyways, because
> > taking kvm->slots_lock inside vcpu->mutex is gross.  AFAICT it's not actively
> > problematic today, but it feels like a deadlock waiting to happen.
> Note: Looks kvm_inhibit_apic_access_page() also takes kvm->slots_lock inside
> vcpu->mutex.

Yikes.  As does kvm_alloc_apic_access_page(), which is likely why I thought it
was ok to take slots_lock.  But while kvm_alloc_apic_access_page() appears to be
called with vCPU scope, it's actually called from VM scope during vCPU creation.

I'll chew on this, though if someone has any ideas...

> So, do we need to move KVM_TDX_INIT_VCPU to tdx_vcpu_async_ioctl() as well?

If it's _just_ INIT_VCPU that can race (assuming the VM-scoped state transtitions
take all vcpu->mutex locks, as proposed), then a dedicated mutex (spinlock?) would
suffice, and probably would be preferable.  If INIT_VCPU needs to take kvm->lock
to protect against other races, then I guess the big hammer approach could work?