linux-kernel - Re: [RFC PATCH v2 21/23] KVM: TDX: Preallocate PAMT pages to be used in split path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aTouJ0SaFp9pluUN@yzhao56-desk.sh.intel.com>
Date: Thu, 11 Dec 2025 10:36:23 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Vishal Annapurve <vannapurve@...gle.com>
CC: Sagi Shahar <sagis@...gle.com>, <pbonzini@...hat.com>,
	<seanjc@...gle.com>, <linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>,
	<x86@...nel.org>, <rick.p.edgecombe@...el.com>, <dave.hansen@...el.com>,
	<kas@...nel.org>, <tabba@...gle.com>, <ackerleytng@...gle.com>,
	<quic_eberman@...cinc.com>, <michael.roth@....com>, <david@...hat.com>,
	<vbabka@...e.cz>, <thomas.lendacky@....com>, <pgonda@...gle.com>,
	<zhiquan1.li@...el.com>, <fan.du@...el.com>, <jun.miao@...el.com>,
	<ira.weiny@...el.com>, <isaku.yamahata@...el.com>, <xiaoyao.li@...el.com>,
	<binbin.wu@...ux.intel.com>, <chao.p.peng@...el.com>
Subject: Re: [RFC PATCH v2 21/23] KVM: TDX: Preallocate PAMT pages to be used
 in split path

On Wed, Dec 10, 2025 at 05:42:44PM -0800, Vishal Annapurve wrote:
> On Sun, Dec 7, 2025 at 9:51 PM Yan Zhao <yan.y.zhao@...el.com> wrote:
> >
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index 6b6c46c27390..508b133df903 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -1591,6 +1591,8 @@ struct kvm_arch {
> > > >  #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
> > > >         struct kvm_mmu_memory_cache split_desc_cache;
> > > >
> > > > +       struct kvm_mmu_memory_cache pamt_page_cache;
> > > > +
> > >
> > > The latest DPAMT patches use a per-vcpu tdx_prealloc struct to handle
> > > preallocating pages for pamt. I'm wondering if you've considered how
> > > this would work here since some of the calls requiring pamt originate
> > > from user space ioctls and therefore are not associated with a vcpu.
> > I'll use a per-VM tdx_prealloc struct for splitting here, similar to the
> > per-VM pamt_page_cache.
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> > index 43dd295b7fd6..91bea25da528 100644
> > --- a/arch/x86/kvm/vmx/tdx.h
> > +++ b/arch/x86/kvm/vmx/tdx.h
> > @@ -48,6 +48,9 @@ struct kvm_tdx {
> >          * Set/unset is protected with kvm->mmu_lock.
> >          */
> >         bool wait_for_sept_zap;
> > +
> > +       spinlock_t external_kvm_split_cache_lock;
> > +       struct tdx_prealloc prealloc_split_cache;
> >  };
> >
> > > Since the tdx_prealloc is a per vcpu struct there are no race issues
> > > when multiple vcpus need to add pamt pages but here it would be
> > > trickier here because theoretically, multiple threads could split
> > > different pages simultaneously.
> > A spin lock external_kvm_split_cache_lock is introduced to protect the cache
> > enqueue and dequeue.
> > (a) When tdp_mmu_split_huge_pages_root() is invoked under write mmu_lock:
> >     - Since cache dequeue is already under write mmu_lock in
> >       tdp_mmu_split_huge_page()-->tdx_sept_split_private_spte(), acquiring/
> >       releasing another spin lock doesn't matter.
> >     - Though the cache enqueue in topup_external_split_cache() is not protected
> >       by mmu_lock, protecting enqueue with a spinlock should not reduce
> >       concurrency.
> 
> Even with the spin lock protecting the cache topup/consumption
> operation, is it possible that one split operation context consumes
> the top-up performed by the other split operation causing failure with
> the subsequent consumptions?
The sequence of check topup, topup, and consume is like this

1. write_lock(&kvm->mmu_lock)
   check topup

2. write_unlock(&kvm->mmu_lock)
   topup (get/put split lock to enqueue)

3. write_lock(&kvm->mmu_lock)
   check topup (goto 2 if topup is necessary)  (*)
   get split lock
   consume
   put split lock
   write_unlock(&kvm->mmu_lock)

Note: due to the "iter.yielded = true" and "continue" after the topup, (see my
posted diff in last reply), consuming does not directly follow the topup. i.e.,
there is step 3.

Due to (*) in step 3, and the consuming is under write mmu_lock, it's impossible 
for splits in other threads to consume pages allocated for this split.


> > (b) When tdp_mmu_split_huge_pages_root() is invoked under read mmu_lock:
> >     Introducing a new spinlock may hurt concurrency for a brief duration (which
> >     is necessary).
Let's talk more about this future potential use cases (i.e., if there're
multiple callers of tdp_mmu_split_huge_pages_root() under shared mmu_lock).
The sequence would be

1. read_lock(&kvm->mmu_lock)
   check topup

2. read_unlock(&kvm->mmu_lock)
   topup (get/put split lock to enqueue)

3. read_lock(&kvm->mmu_lock)
   check topup (goto 2 if topup is necessary)

   get split lock
   check topup (return retry if topup is necessary) (**)
   consume
   put split lock
   read_unlock(&kvm->mmu_lock)


Due to  (**) in step 3, and the consuming is under split lock and read mmu_lock,
it's also impossible for splits in other threads to consume pages allocated for
this split.


> >     However, there's no known (future) use case for multiple threads invoking
> >     tdp_mmu_split_huge_pages_root() on mirror root under shared mmu_lock.
> >
> >     For future splitting under shared mmu_lock in the fault path, we'll use
> >     the per-vCPU tdx_prealloc instead of the per-VM cache. TDX can leverage
> >     kvm_get_running_vcpu() to differentiate between the two caches.
> >
> >