[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <77cd1034c59b23bae2bbf3693bf6a740d283d787.camel@intel.com>
Date: Wed, 3 Sep 2025 00:18:10 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "seanjc@...gle.com" <seanjc@...gle.com>, "Zhao, Yan Y"
<yan.y.zhao@...el.com>
CC: "Huang, Kai" <kai.huang@...el.com>, "ackerleytng@...gle.com"
<ackerleytng@...gle.com>, "Annapurve, Vishal" <vannapurve@...gle.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Weiny, Ira"
<ira.weiny@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
"michael.roth@....com" <michael.roth@....com>, "kvm@...r.kernel.org"
<kvm@...r.kernel.org>
Subject: Re: [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial
measurement fails
On Tue, 2025-09-02 at 10:04 -0700, Sean Christopherson wrote:
> On Tue, Sep 02, 2025, Yan Zhao wrote:
> > But during writing another concurrency test, I found a sad news :
> >
> > SEAMCALL TDH_VP_INIT requires to hold exclusive lock for resource TDR when its
> > leaf_opcode.version > 0. So, when I use v1 (which is the current value in
> > upstream, for x2apic?) to test executing ioctl KVM_TDX_INIT_VCPU on different
> > vCPUs concurrently, the TDX_BUG_ON() following tdh_vp_init() will print error
> > "SEAMCALL TDH_VP_INIT failed: 0x8000020000000080".
> >
> > If I switch to using v0 version of TDH_VP_INIT, the contention will be gone.
>
> Uh, so that's exactly the type of breaking ABI change that isn't acceptable. If
> it's really truly necessary, then we can can probably handle the change in KVM
> since TDX is so new, but generally speaking such changes simply must not happen.
>
> > Note: this acquiring of exclusive lock was not previously present in the public
> > repo https://github.com/intel/tdx-module.git, branch tdx_1.5.
> > (The branch has been force-updated to new implementation now).
>
> Lovely.
Hmm, this exactly the kind of TDX module change we were just discussing
reporting as a bug. Not clear on the timing of the change as far as the landing
upstream. We could investigate whether whether we could fix it in the TDX
module. This probably falls into the category of not actually regressing any
userspace. But it does trigger a kernel warning, so warrant a fix, hmm.
>
> > > Acquire kvm->lock to prevent VM-wide things from happening, slots_lock to prevent
> > > kvm_mmu_zap_all_fast(), and _all_ vCPU mutexes to prevent vCPUs from interefering.
> > Nit: we should have no worry to kvm_mmu_zap_all_fast(), since it only zaps
> > !mirror roots. The slots_lock should be for slots deletion.
>
> Oof, I missed that. We should have required nx_huge_pages=never for tdx=1.
> Probably too late for that now though :-/
>
> > > Doing that for a vCPU ioctl is a bit awkward, but not awful. E.g. we can abuse
> > > kvm_arch_vcpu_async_ioctl(). In hindsight, a more clever approach would have
> > > been to make KVM_TDX_INIT_MEM_REGION a VM-scoped ioctl that takes a vCPU fd. Oh
> > > well.
> > >
> > > Anyways, I think we need to avoid the "synchronous" ioctl path anyways, because
> > > taking kvm->slots_lock inside vcpu->mutex is gross. AFAICT it's not actively
> > > problematic today, but it feels like a deadlock waiting to happen.
> > Note: Looks kvm_inhibit_apic_access_page() also takes kvm->slots_lock inside
> > vcpu->mutex.
>
> Yikes. As does kvm_alloc_apic_access_page(), which is likely why I thought it
> was ok to take slots_lock. But while kvm_alloc_apic_access_page() appears to be
> called with vCPU scope, it's actually called from VM scope during vCPU creation.
>
> I'll chew on this, though if someone has any ideas...
>
> > So, do we need to move KVM_TDX_INIT_VCPU to tdx_vcpu_async_ioctl() as well?
>
> If it's _just_ INIT_VCPU that can race (assuming the VM-scoped state transtitions
> take all vcpu->mutex locks, as proposed), then a dedicated mutex (spinlock?) would
> suffice, and probably would be preferable. If INIT_VCPU needs to take kvm->lock
> to protect against other races, then I guess the big hammer approach could work?
A duplicate TDR lock inside KVM or maybe even the arch/x86 side would make the
reasoning easier to follow. Like, you don't need to remember "we take
slots_lock/kvm_lock because of TDR lock", it's just 1:1. I hate the idea of
adding more locks, and have argued against it in the past. But are we just
fooling ourselves though? There are already more locks.
Another reason to duplicate (some) locks is that if it gives the scheduler more
hints as far as waking up waiters, etc. The TDX module needs these locks to
protect itself, so those are required. But when we just do retry loops (or let
userspace do this), then we lose out on all of the locking goodness in the
kernel.
Anyway, just a strawman. I don't have any great ideas.
Powered by blists - more mailing lists