lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aK/sdr2OQqYv9DBZ@yzhao56-desk.sh.intel.com>
Date: Thu, 28 Aug 2025 13:43:18 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Paolo Bonzini <pbonzini@...hat.com>, <kvm@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, Michael Roth <michael.roth@....com>, "Ira
 Weiny" <ira.weiny@...el.com>, Vishal Annapurve <vannapurve@...gle.com>, "Rick
 Edgecombe" <rick.p.edgecombe@...el.com>
Subject: Re: [RFC PATCH 09/12] KVM: TDX: Fold
 tdx_mem_page_record_premap_cnt() into its sole caller

On Wed, Aug 27, 2025 at 12:08:27PM -0700, Sean Christopherson wrote:
> On Wed, Aug 27, 2025, Yan Zhao wrote:
> > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote:
> > > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > >  		return -EIO;
> > >  
> > >  	/*
> > > -	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
> > > -	 * barrier in tdx_td_finalize().
> > > +	 * Ensure pre_fault_allowed is read by kvm_arch_vcpu_pre_fault_memory()
> > > +	 * before kvm_tdx->state.  Userspace must not be allowed to pre-fault
> > > +	 * arbitrary memory until the initial memory image is finalized.  Pairs
> > > +	 * with the smp_wmb() in tdx_td_finalize().
> > >  	 */
> > >  	smp_rmb();
> > > -	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
> > > -		return tdx_mem_page_aug(kvm, gfn, level, pfn);
> > >  
> > > -	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
> > > +	/*
> > > +	 * If the TD isn't finalized/runnable, then userspace is initializing
> > > +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> > > +	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> > > +	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> > > +	 * the counter to ensure all mapped pages have been added to the image,
> > > +	 * to prevent running the TD with uninitialized memory.
> > To prevent the mismatch between mirror EPT and the S-EPT?
> 
> No?  Because KVM bumps the count when installing the S-EPT and decrements it
> on AUG, so I don't see how nr_premapped guards against M-EPT vs. S-EPT issues?
Hmm, I think there must be some misunderstanding.

Before userspace invokes KVM_TDX_FINALIZE_VM,
=======
1. the normal path (userspace invokes KVM_TDX_INIT_MEM_REGION).
   (1) KVM holds slot_lock and filemap lock.
   (2) KVM invokes kvm_tdp_map_page() (or kvm_tdp_mmu_map_private_pfn() in
       patch 2).
       KVM increases nr_premapped in tdx_sept_set_private_spte() to indicate
       that there's a page mapped in M-EPT, while it's not yet installed in
       S-EPT.
   (3) KVM invokes TDH.MEM.PAGE.ADD and decreases nr_premapped, indicating the
       page has been mapped in S-EPT too.
       
   As the name of nr_premapped indicates, the count means a page is pre-mapped
   in the M-EPT, before its real mapping in the S-EPT.
   If ADD fails in step (3), nr_premapped will not be decreased.

   With mere the normal path, nr_premapped should return back to 0 after all
   KVM_TDX_INIT_MEM_REGIONs.
      

2. Expected zap paths (e.g. If userspace does something strange, such as
   removing a slot after KVM_TDX_INIT_MEM_REGION)
   Those zap paths could be triggered by
   1) userspace performs a page attribute conversion
   2) userspace invokes gmem punch hole
   3) userspace removes a slot
   As all those paths either hold a slot_lock or a filemap lock, they can't
   contend with tdx_vcpu_init_mem_region() (tdx_vcpu_init_mem_region holds both
   slot_lock and internally filemap lock).
   Consequently, those zaps must occur
   a) before kvm_tdp_map_page() or
   b) after TDH.MEM.PAGE.ADD.
   For a), tdx_sept_zap_private_spte() won't not be invoked as the page is not
           mapped in M-EPT either;
   For b), tdx_sept_zap_private_spte() should succeed, as the BLOCK and REMOVE
           SEAMCALLs are following the ADD.
   nr_premapped is therere unchanged, since it does not change the consistency
   between M-EPT and S-EPT.

3. Unexpected zaps (such as kvm_zap_gfn_range()).
   Those zaps are currently just paranoid ones. Not found in any existing paths
   yet. i.e.,
   We want to detect any future code or any missed code piecies, which invokes
   kvm_zap_gfn_range() (or maybe zaps under read mmu_lock).

   As those zaps do not necessarily hold slot_lock or filemap lock, they may
   ocurr after installing M-EPT and before installing S-EPT.
   As a result, the BLOCK fails and tdx_is_sept_zap_err_due_to_premap() returns
   true.
   Decreasing nr_premapped here to indicate the count of pages mapped in M-EPT
   but not in S-EPT decreases.

   TDH.MEM.PAGE.ADD after this zap can still succeed. If this occurs, the page
   will be mapped in S-EPT only. As KVM also decreases nr_premapped after a
   successful TDH.MEM.PAGE.ADD, the nr_premapped will be <0 in the end.
   So, we will be able to detect those unexpected zaps.
   

When userspace invokes KVM_TDX_FINALIZE_VM,
=======
The nr_premapped must be 0 before tdx_td_finalize() succeeds.

The nr_premapped could be 0 if
(1) userspace invokes KVM_TDX_INIT_MEM_REGIONs as in a normal way.
(2) userspace never triggers any KVM_TDX_INIT_MEM_REGION.
(3) userspace triggers KVM_TDX_INIT_MEM_REGION but zaps all initial memory
    regions.

For (2)and(3), KVM_TDX_FINALIZE_VM can still succeed. So, TD can still run with
uninitialized memory.

> > e.g., Before KVM_TDX_FINALIZE_VM, if userspace performs a zap after the
> > TDH.MEM.PAGE.ADD, the page will be removed from the S-EPT. The count of
> > nr_premapped will not change after the successful TDH.MEM.RANGE.BLOCK and
> > TDH.MEM.PAGE.REMOVE.
> 
> Eww.  It would be nice to close that hole, but I suppose it's futile, e.g. the
> underlying problem is unexpectedly removing pages from the initial, whether the
> VMM is doing stupid things before vs. after FINALIZE doesn't really matter.
Are you referring to the above "case 2 Expected zap paths"?

It's equal to that userspace never triggers any KVM_TDX_INIT_MEM_REGION.
We can't force userspace must invoke KVM_TDX_INIT_MEM_REGION after all.

I don't think there's a hole from the guest point of view. See below.

> > As a result, the TD will still run with uninitialized memory.
> 
> No?  Because BLOCK+REMOVE means there are no valid S-EPT mappings.  There's a
> "hole" that the guest might not expect, but that hole will trigger an EPT
> violation and only get "filled" if the guest explicitly accepts an AUG'd page.

If TD runs with unintialized memory,
- for the linux guest, it will cause TD to access unaccepted memory and get
  killed by KVM;
- for non-linux guest which configured with #VE, the guest will see #VE and
  be informed of that the page must be accepted before access. Though the guest
  should not be able to run without any initial code, but there's not any
  security problem.


> Side topic, why does KVM tolerate tdh_mem_page_add() failure?  IIUC, playing
We don't. It returns -EBUSY or -EIO immediately.

> nice with tdh_mem_page_add() failure necessitates both the
> tdx_is_sept_zap_err_due_to_premap() craziness and the check in tdx_td_finalize()
> that all pending pages have been consumed.

tdx_is_sept_zap_err_due_to_premap() detects the error of BLOCK, which is caused
by executing BLOCK before ADD.

> What reasonable use case is there for gracefully handling tdh_mem_page_add() failure?
If tdh_mem_page_add() fails, the KVM_TDX_INIT_MEM_REGION just fails.

> If there is a need to handle failure, I gotta imagine it's only for the -EBUSY
> case.  And if it's only for -EBUSY, why can't that be handled by retrying in
> tdx_vcpu_init_mem_region()?  If tdx_vcpu_init_mem_region() guarantees that all
I analyzed the contention status of tdh_mem_sept_add() at
https://lore.kernel.org/kvm/20250113021050.18828-1-yan.y.zhao@intel.com.

As the userspace is expected to execute KVM_TDX_INIT_MEM_REGION in only one
vCPU, returning -EBUSY instead of retrying looks safer and easier.

> pages mapped into the S-EPT are ADDed, then it can assert that there are no
> pending pages when it completes (even if it "fails"), and similarly
> tdx_td_finalize() can KVM_BUG_ON/WARN_ON the number of pending pages being
> non-zero.
tdx_td_finalize() now just returns -EINVAL in case of nr_premapped being !0.
KVM_BUG_ON/WARN_ON should be also ok.

> > > +	 */
> > > +	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
> > > +		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> > > +			return -EIO;
> > > +
> > > +		atomic64_inc(&kvm_tdx->nr_premapped);
> > > +		return 0;
> > > +	}
> > > +
> > > +	return tdx_mem_page_aug(kvm, gfn, level, pfn);
> > >  }
> > >  
> > >  static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > > -- 
> > > 2.51.0.268.g9569e192d0-goog
> > > 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ