lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fcfafa17b29cd24018c3f18f075a9f83b7f2f6e6.camel@intel.com>
Date: Thu, 28 Aug 2025 18:52:24 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "seanjc@...gle.com" <seanjc@...gle.com>, "Zhao, Yan Y"
	<yan.y.zhao@...el.com>
CC: "kvm@...r.kernel.org" <kvm@...r.kernel.org>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "Annapurve, Vishal" <vannapurve@...gle.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"michael.roth@....com" <michael.roth@....com>, "Weiny, Ira"
	<ira.weiny@...el.com>
Subject: Re: [RFC PATCH 09/12] KVM: TDX: Fold tdx_mem_page_record_premap_cnt()
 into its sole caller

On Thu, 2025-08-28 at 10:00 -0700, Sean Christopherson wrote:
> On Thu, Aug 28, 2025, Yan Zhao wrote:
> > On Wed, Aug 27, 2025 at 12:08:27PM -0700, Sean Christopherson wrote:
> > > On Wed, Aug 27, 2025, Yan Zhao wrote:
> > > > On Tue, Aug 26, 2025 at 05:05:19PM -0700, Sean Christopherson wrote:
> > > > > @@ -1641,14 +1618,30 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > > > > +	/*
> > > > > +	 * If the TD isn't finalized/runnable, then userspace is initializing
> > > > > +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> > > > > +	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> > > > > +	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> > > > > +	 * the counter to ensure all mapped pages have been added to the image,
> > > > > +	 * to prevent running the TD with uninitialized memory.
> > > > To prevent the mismatch between mirror EPT and the S-EPT?
> > > 
> > > No?  Because KVM bumps the count when installing the S-EPT and decrements it
> > > on AUG, so I don't see how nr_premapped guards against M-EPT vs. S-EPT issues?
> > Hmm, I think there must be some misunderstanding.
> 
> Yeah, I forgot that AUG and ADD create the leaf S-EPT entries.
> 
> > Before userspace invokes KVM_TDX_FINALIZE_VM,
> > =======
> > 1. the normal path (userspace invokes KVM_TDX_INIT_MEM_REGION).
> >    (1) KVM holds slot_lock and filemap lock.
> >    (2) KVM invokes kvm_tdp_map_page() (or kvm_tdp_mmu_map_private_pfn() in
> >        patch 2).
> >        KVM increases nr_premapped in tdx_sept_set_private_spte() to indicate
> >        that there's a page mapped in M-EPT, while it's not yet installed in
> >        S-EPT.
> >    (3) KVM invokes TDH.MEM.PAGE.ADD and decreases nr_premapped, indicating the
> >        page has been mapped in S-EPT too.
> >        
> >    As the name of nr_premapped indicates, the count means a page is pre-mapped
> >    in the M-EPT, before its real mapping in the S-EPT.
> >    If ADD fails in step (3), nr_premapped will not be decreased.
> > 
> >    With mere the normal path, nr_premapped should return back to 0 after all
> >    KVM_TDX_INIT_MEM_REGIONs.
> >       
> > 
> > 2. Expected zap paths (e.g. If userspace does something strange, such as
> >    removing a slot after KVM_TDX_INIT_MEM_REGION)
> >    Those zap paths could be triggered by
> >    1) userspace performs a page attribute conversion
> >    2) userspace invokes gmem punch hole
> >    3) userspace removes a slot
> >    As all those paths either hold a slot_lock or a filemap lock, they can't
> >    contend with tdx_vcpu_init_mem_region() (tdx_vcpu_init_mem_region holds both
> >    slot_lock and internally filemap lock).
> >    Consequently, those zaps must occur
> >    a) before kvm_tdp_map_page() or
> >    b) after TDH.MEM.PAGE.ADD.
> >    For a), tdx_sept_zap_private_spte() won't not be invoked as the page is not
> >            mapped in M-EPT either;
> >    For b), tdx_sept_zap_private_spte() should succeed, as the BLOCK and REMOVE
> >            SEAMCALLs are following the ADD.
> >    nr_premapped is therere unchanged, since it does not change the consistency
> >    between M-EPT and S-EPT.
> > 
> > 3. Unexpected zaps (such as kvm_zap_gfn_range()).
> 
> Side topic related to kvm_zap_gfn_range(), the KVM_BUG_ON() in vt_refresh_apicv_exec_ctrl()
> is flawed.  If kvm_recalculate_apic_map() fails to allocate an optimized map, KVM
> will mark APICv as inhibited, i.e. the associated WARN_ON_ONCE() is effectively
> user-triggerable.
> 
> Easiest thing would be to mark the vCPU as dead (though we obviously need
> "KVM: Never clear KVM_REQ_VM_DEAD from a vCPU's requests" for that to be robust).
> 
> 
> 
I'm going need to look up the related apic discussions from the base series and
circle back.

[snip]

> > tdx_td_finalize() now just returns -EINVAL in case of nr_premapped being !0.
> > KVM_BUG_ON/WARN_ON should be also ok.
> 
> Ok, so I vaguely recall that I may have pushed back on using a scratch field in
> "struct kvm_tdx" for temporary data (or maybe it was abusing vcpus[0] that I
> disliked?), but what we ended up with is far worse.

I think it was also that the tdh_mr_extend() loop was too heavyweight for the
fault path. But that was before we got to the kick+lock stuff.

> 
> For all intents and purposes, nr_premapped _is_ a temporary scratch field, but
> with access rules that are all but impossible to understand, e.g. there's practically
> zero chance anyone could suss out complications with "Unexpected zaps", and the
> existence of that super subtle edge case necessitates using an atomic because
> KVM can't strictly guarantee that access to the field is mutually exclusive.  And
> that also means it's inherently racy, e.g. if a zap happens while tdx_td_finalize()
> is checking nr_premapped, what happens?
> 
> The real killer is that deferring TDH.MEM.PAGE.ADD and TDH.MR_EXTEND until after
> the map completes and mmu_lock is dropped means that failure at that point leaves
> the TDP MMU in an inconsistent state, where the M-EPT has a present page but the
> S-EPT does not.  Eww.
> 
> Note, in no way am I trying to blame anyone; quite the opposite, you've done an
> admirable job to get all of this landed.  And I apologize if any of my past
> feedback led y'all down this road.  I suspect my prefaulting idea really screwed
> things up; sorry :-(

It's unfortunate we didn't have the gmem mmap() support then. Otherwise we could
have just encrypted it in place.

Otherwise, I'm really glad to see these cleanups/scrutiny. I kind of got the
impression that you wanted to see less TDX churn for a bit. The fact is, the TDX
base support still needs more work like this.

> 
> Back to the code, unless I'm missing yet another complication, I think the obvious
> fix to all of this is to pass the source page and metadata flags via a scratch
> field in "struct kvm_tdx", and then do PAGE.ADD and MR.EXTEND in
> tdx_sept_set_private_spte().  Then there is no need to keep track of pending
> pages, because the M-EPT and S-EPT are always consistent.  E.g. if PAGE.ADD fails
> with -EBUSY, then KVM will naturally revert the M-EPT entry from FROZEN to !PRESENT.
> It also allows KVM to KVM_BUG_ON() MR.EXTEND failure, because it should be impossible
> for the S-EPT entry to be modified between PAGE.ADD and MR.EXTEND.
> 
> Diff on top below for feedback on the idea.  A proper series for this would simply
> replace several of the patches, e.g. asserting that slots_lock is held on
> tdx_is_sept_zap_err_due_to_premap() is wrong.

This works. The "stuffing data on the vCPU" part is a little ugly as it was
before, but the other solutions were worse. Especially nr_premap was problematic
for a number of reasons that keeps growing.

So, big Acked-by: Rick Edgecombe <rick.p.edgecombe@...el.com>

> 
> ---
>  arch/x86/kvm/vmx/tdx.c | 157 +++++++++++++++++------------------------
>  arch/x86/kvm/vmx/tdx.h |  11 +--
>  2 files changed, 70 insertions(+), 98 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index f9ac590e8ff0..5d981a061442 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1586,6 +1586,56 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>  	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
>  }
>  
> +
> +struct kvm_tdx_page_add {
> +	struct page *src;
> +	unsigned long flags;
> +};
> +
> +static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +			    kvm_pfn_t pfn)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	u64 err, entry, level_state;
> +	gpa_t gpa = gfn_to_gpa(gfn);
> +	int i;
> +
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm) ||
> +	    KVM_BUG_ON(!kvm_tdx->page_add, kvm))
> +		return -EIO;
> +
> +	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
> +			       kvm_tdx->page_add->src, &entry, &level_state);
> +	if (unlikely(tdx_operand_busy(err)))
> +		return -EBUSY;
> +
> +	if (KVM_BUG_ON(err, kvm)) {
> +		pr_tdx_error_2(TDH_MEM_PAGE_ADD, err, entry, level_state);
> +		return -EIO;
> +	}
> +
> +	if (!(kvm_tdx->page_add->flags & KVM_TDX_MEASURE_MEMORY_REGION))
> +		return 0;
> +
> +	/*
> +	 * Extend the measurement while holding mmu_lock to ensure MR.EXTEND
> +	 * can't fail, e.g. due to the S-EPT entry being zapped after PAGE.ADD.
> +	 * Note!  If extended the measurement fails, bug the VM, but do NOT
> +	 * return an error, as mapping the page in the S-EPT succeeded and so
> +	 * needs to be tracked in KVM's mirror page tables.
> +	 */
> +	for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> +		err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state);
> +		if (KVM_BUG_ON(err, kvm)) {
> +			pr_tdx_error_2(TDH_MR_EXTEND, err, entry, level_state);
> +			break;
> +		}
> +	}
> +	return 0;
> +}
> +
>  static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
>  			    enum pg_level level, kvm_pfn_t pfn)
>  {
> @@ -1627,21 +1677,11 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>  
>  	/*
>  	 * If the TD isn't finalized/runnable, then userspace is initializing
> -	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Increment the number of
> -	 * pages that need to be initialized via TDH.MEM.PAGE.ADD (PAGE.ADD
> -	 * requires a pre-existing S-EPT mapping).  KVM_TDX_FINALIZE_VM checks
> -	 * the counter to ensure all mapped pages have been added to the image,
> -	 * to prevent running the TD with uninitialized memory.
> +	 * the VM image via KVM_TDX_INIT_MEM_REGION.  Add the page to the TD,
> +	 * and optionally extend the measurement with the page contents.
>  	 */
> -	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
> -		lockdep_assert_held(&kvm->slots_lock);
> -
> -		if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> -			return -EIO;
> -
> -		kvm_tdx->nr_pending_tdh_mem_page_adds++;
> -		return 0;
> -	}
> +	if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE))
> +		return tdx_mem_page_add(kvm, gfn, level, pfn);
>  
>  	return tdx_mem_page_aug(kvm, gfn, level, pfn);
>  }
> @@ -1716,39 +1756,6 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
>  	return 0;
>  }
>  
> -/*
> - * Check if the error returned from a SEPT zap SEAMCALL is due to that a page is
> - * mapped by KVM_TDX_INIT_MEM_REGION without tdh_mem_page_add() being called
> - * successfully.
> - *
> - * Since tdh_mem_sept_add() must have been invoked successfully before a
> - * non-leaf entry present in the mirrored page table, the SEPT ZAP related
> - * SEAMCALLs should not encounter err TDX_EPT_WALK_FAILED. They should instead
> - * find TDX_EPT_ENTRY_STATE_INCORRECT due to an empty leaf entry found in the
> - * SEPT.
> - *
> - * Further check if the returned entry from SEPT walking is with RWX permissions
> - * to filter out anything unexpected.
> - *
> - * Note: @level is pg_level, not the tdx_level. The tdx_level extracted from
> - * level_state returned from a SEAMCALL error is the same as that passed into
> - * the SEAMCALL.
> - */
> -static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err,
> -					     u64 entry, int level)
> -{
> -	if (!err || kvm_tdx->state == TD_STATE_RUNNABLE)
> -		return false;
> -
> -	if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))
> -		return false;
> -
> -	if ((is_last_spte(entry, level) && (entry & VMX_EPT_RWX_MASK)))
> -		return false;
> -
> -	return true;
> -}
> -
>  static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>  				     enum pg_level level, struct page *page)
>  {
> @@ -1768,15 +1775,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>  		err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
>  		tdx_no_vcpus_enter_stop(kvm);
>  	}
> -	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
> -		lockdep_assert_held(&kvm->slots_lock);
> -
> -		if (KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm))
> -			return -EIO;
> -
> -		return 0;
> -	}
> -
>  	if (KVM_BUG_ON(err, kvm)) {
>  		pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
>  		return -EIO;
> @@ -2842,12 +2840,6 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>  
>  	if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE)
>  		return -EINVAL;
> -	/*
> -	 * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue
> -	 * TDH.MEM.PAGE.ADD().
> -	 */
> -	if (kvm_tdx->nr_pending_tdh_mem_page_adds)
> -		return -EINVAL;
>  
>  	cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td);
>  	if (tdx_operand_busy(cmd->hw_error))
> @@ -3131,50 +3123,29 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>  {
>  	struct tdx_gmem_post_populate_arg *arg = _arg;
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> -	u64 err, entry, level_state;
> -	gpa_t gpa = gfn_to_gpa(gfn);
> -	struct page *src_page;
> -	int ret, i;
> +	struct kvm_tdx_page_add page_add = {
> +		.flags = arg->flags,
> +	};
> +	int ret;
>  
> -	lockdep_assert_held(&kvm->slots_lock);
> +	if (KVM_BUG_ON(kvm_tdx->page_add, kvm))
> +		return -EIO;
>  
>  	/*
>  	 * Get the source page if it has been faulted in. Return failure if the
>  	 * source page has been swapped out or unmapped in primary memory.
>  	 */
> -	ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page);
> +	ret = get_user_pages_fast((unsigned long)src, 1, 0, &page_add.src);
>  	if (ret < 0)
>  		return ret;
>  	if (ret != 1)
>  		return -ENOMEM;
>  
> +	kvm_tdx->page_add = &page_add;
>  	ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
> -	if (ret < 0)
> -		goto out;
> +	kvm_tdx->page_add = NULL;
>  
> -	ret = 0;
> -	err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
> -			       src_page, &entry, &level_state);
> -	if (err) {
> -		ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO;
> -		goto out;
> -	}
> -
> -	KVM_BUG_ON(--kvm_tdx->nr_pending_tdh_mem_page_adds < 0, kvm);
> -
> -	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
> -		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> -			err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry,
> -					    &level_state);
> -			if (err) {
> -				ret = -EIO;
> -				break;
> -			}
> -		}
> -	}
> -
> -out:
> -	put_page(src_page);
> +	put_page(page_add.src);
>  	return ret;
>  }
>  
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 45d86f9fa41c..39e0c3bcc866 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -21,6 +21,8 @@ enum kvm_tdx_state {
>  	TD_STATE_RUNNABLE,
>  };
>  
> +struct kvm_tdx_add_page;
> +
>  struct kvm_tdx {
>  	struct kvm kvm;
>  
> @@ -37,12 +39,11 @@ struct kvm_tdx {
>  	struct tdx_td td;
>  
>  	/*
> -	 * The number of pages that KVM_TDX_INIT_MEM_REGION has mapped into the
> -	 * S-EPT, but not yet initialized via TDH.MEM.PAGE_ADD.  Used to sanity
> -	 * check adding pages to the image, and to ensure that all pages have
> -	 * been initialized before finalizing the TD.
> +	 * Scratch structure used to pass the source page and metadata flags to
> +	 * tdx_mem_page_add.  Protected by slots_lock, and non-NULL only when
> +	 * mapping a private pfn via tdx_gmem_post_populate().
>  	 */
> -	unsigned long nr_pending_tdh_mem_page_adds;
> +	struct kvm_tdx_page_add *page_add;
>  
>  	/*
>  	 * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do
> 
> base-commit: 7c7a3893b102bdeb4826f7140280b7b16081b385
> --

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ