[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <17ef5f493d5ef6d76c4dc9ca19f1d4d7fe4c73f9.camel@intel.com>
Date: Fri, 29 Aug 2025 23:42:13 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "pbonzini@...hat.com" <pbonzini@...hat.com>, "seanjc@...gle.com"
<seanjc@...gle.com>
CC: "Huang, Kai" <kai.huang@...el.com>, "ackerleytng@...gle.com"
<ackerleytng@...gle.com>, "Annapurve, Vishal" <vannapurve@...gle.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Zhao, Yan Y"
<yan.y.zhao@...el.com>, "Weiny, Ira" <ira.weiny@...el.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "michael.roth@....com"
<michael.roth@....com>
Subject: Re: [RFC PATCH v2 13/18] KVM: TDX: ADD pages to the TD image while
populating mirror EPT entries
On Thu, 2025-08-28 at 17:06 -0700, Sean Christopherson wrote:
> When populating the initial memory image for a TDX guest, ADD pages to the
> TD as part of establishing the mappings in the mirror EPT, as opposed to
> creating the mappings and then doing ADD after the fact. Doing ADD in the
> S-EPT callbacks eliminates the need to track "premapped" pages, as the
> mirror EPT (M-EPT) and S-EPT are always synchronized, e.g. if ADD fails,
> KVM reverts to the previous M-EPT entry (guaranteed to be !PRESENT).
>
> Eliminating the hole where the M-EPT can have a mapping that doesn't exist
> in the S-EPT in turn obviates the need to handle errors that are unique to
> encountering a missing S-EPT entry (see tdx_is_sept_zap_err_due_to_premap()).
>
> Keeping the M-EPT and S-EPT synchronized also eliminates the need to check
> for unconsumed "premap" entries during tdx_td_finalize(), as there simply
> can't be any such entries. Dropping that check in particular reduces the
> overall cognitive load, as the managemented of nr_premapped with respect
> to removal of S-EPT is _very_ subtle. E.g. successful removal of an S-EPT
> entry after it completed ADD doesn't adjust nr_premapped, but it's not
> clear why that's "ok" but having half-baked entries is not (it's not truly
> "ok" in that removing pages from the image will likely prevent the guest
> from booting, but from KVM's perspective it's "ok").
>
> Doing ADD in the S-EPT path requires passing an argument via a scratch
> field, but the current approach of tracking the number of "premapped"
> pages effectively does the same. And the "premapped" counter is much more
> dangerous, as it doesn't have a singular lock to protect its usage, since
> nr_premapped can be modified as soon as mmu_lock is dropped, at least in
> theory. I.e. nr_premapped is guarded by slots_lock, but only for "happy"
> paths.
>
> Note, this approach was used/tried at various points in TDX development,
> but was ultimately discarded due to a desire to avoid stashing temporary
> state in kvm_tdx. But as above, KVM ended up with such state anyways,
> and fully committing to using temporary state provides better access
> rules (100% guarded by slots_lock), and makes several edge cases flat out
> impossible.
>
> Note #2, continue to extend the measurement outside of mmu_lock, as it's
> a slow operation (typically 16 SEAMCALLs per page whose data is included
> in the measurement), and doesn't *need* to be done under mmu_lock, e.g.
> for consistency purposes. However, MR.EXTEND isn't _that_ slow, e.g.
> ~1ms latency to measure a full page, so if it needs to be done under
> mmu_lock in the future, e.g. because KVM gains a flow that can remove
> S-EPT entries uring KVM_TDX_INIT_MEM_REGION, then extending the
^using
> measurement can also be moved into the S-EPT mapping path (again, only if
> absolutely necessary). P.S. _If_ MR.EXTEND is moved into the S-EPT path,
> take care not to return an error up the stack if TDH_MR_EXTEND fails, as
> removing the M-EPT entry but not the S-EPT entry would result in
> inconsistent state!
>
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> ---
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@...el.com>
But some possible unintended changed below.
> arch/x86/kvm/vmx/tdx.c | 116 ++++++++++++++---------------------------
> arch/x86/kvm/vmx/tdx.h | 8 ++-
> 2 files changed, 46 insertions(+), 78 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index bc92e87a1dbb..00c3dc376690 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1586,6 +1586,32 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
> td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
> }
>
> +static int tdx_mem_page_add(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> + kvm_pfn_t pfn)
> +{
> + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> + u64 err, entry, level_state;
> + gpa_t gpa = gfn_to_gpa(gfn);
> +
> + lockdep_assert_held(&kvm->slots_lock);
> +
> + if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm) ||
> + KVM_BUG_ON(!kvm_tdx->page_add_src, kvm))
> + return -EIO;
> +
> + err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
> + kvm_tdx->page_add_src, &entry, &level_state);
> + if (unlikely(tdx_operand_busy(err)))
> + return -EBUSY;
> +
> + if (KVM_BUG_ON(err, kvm)) {
> + pr_tdx_error_2(TDH_MEM_PAGE_ADD, err, entry, level_state);
> + return -EIO;
> + }
> +
> + return 0;
> +}
> +
> static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> enum pg_level level, kvm_pfn_t pfn)
> {
> @@ -1627,19 +1653,10 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>
> /*
> * If the TD isn't finalized/runnable, then userspace is initializing
> - * the VM image via KVM_TDX_INIT_MEM_REGION. Increment the number of
> - * pages that need to be mapped and initialized via TDH.MEM.PAGE.ADD.
> - * KVM_TDX_FINALIZE_VM checks the counter to ensure all mapped pages
> - * have been added to the image, to prevent running the TD with a
> - * valid mapping in the mirror EPT, but not in the S-EPT.
> + * the VM image via KVM_TDX_INIT_MEM_REGION; ADD the page to the TD.
> */
> - if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE)) {
> - if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm))
> - return -EIO;
> -
> - atomic64_inc(&kvm_tdx->nr_premapped);
> - return 0;
> - }
> + if (unlikely(kvm_tdx->state != TD_STATE_RUNNABLE))
> + return tdx_mem_page_add(kvm, gfn, level, pfn);
>
> return tdx_mem_page_aug(kvm, gfn, level, pfn);
> }
> @@ -1665,39 +1682,6 @@ static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
> return 0;
> }
>
> -/*
> - * Check if the error returned from a SEPT zap SEAMCALL is due to that a page is
> - * mapped by KVM_TDX_INIT_MEM_REGION without tdh_mem_page_add() being called
> - * successfully.
> - *
> - * Since tdh_mem_sept_add() must have been invoked successfully before a
> - * non-leaf entry present in the mirrored page table, the SEPT ZAP related
> - * SEAMCALLs should not encounter err TDX_EPT_WALK_FAILED. They should instead
> - * find TDX_EPT_ENTRY_STATE_INCORRECT due to an empty leaf entry found in the
> - * SEPT.
> - *
> - * Further check if the returned entry from SEPT walking is with RWX permissions
> - * to filter out anything unexpected.
> - *
> - * Note: @level is pg_level, not the tdx_level. The tdx_level extracted from
> - * level_state returned from a SEAMCALL error is the same as that passed into
> - * the SEAMCALL.
> - */
> -static int tdx_is_sept_zap_err_due_to_premap(struct kvm_tdx *kvm_tdx, u64 err,
> - u64 entry, int level)
> -{
> - if (!err || kvm_tdx->state == TD_STATE_RUNNABLE)
> - return false;
> -
> - if (err != (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))
> - return false;
> -
> - if ((is_last_spte(entry, level) && (entry & VMX_EPT_RWX_MASK)))
> - return false;
> -
> - return true;
> -}
> -
> static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
> enum pg_level level, struct page *page)
> {
> @@ -1717,12 +1701,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
> err = tdh_mem_range_block(&kvm_tdx->td, gpa, tdx_level, &entry, &level_state);
> tdx_no_vcpus_enter_stop(kvm);
> }
> - if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level)) {
> - if (KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm))
> - return -EIO;
> -
> - return 0;
> - }
>
> if (KVM_BUG_ON(err, kvm)) {
> pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
> @@ -2827,12 +2805,6 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>
> if (!is_hkid_assigned(kvm_tdx) || kvm_tdx->state == TD_STATE_RUNNABLE)
> return -EINVAL;
> - /*
> - * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue
> - * TDH.MEM.PAGE.ADD().
> - */
> - if (atomic64_read(&kvm_tdx->nr_premapped))
> - return -EINVAL;
>
> cmd->hw_error = tdh_mr_finalize(&kvm_tdx->td);
> if (tdx_operand_busy(cmd->hw_error))
> @@ -3116,11 +3088,14 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> {
> struct tdx_gmem_post_populate_arg *arg = _arg;
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> - u64 err, entry, level_state;
> gpa_t gpa = gfn_to_gpa(gfn);
> + u64 err, entry, level_state;
Fine, but ?
> struct page *src_page;
> int ret, i;
>
> + if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
> + return -EIO;
> +
> /*
> * Get the source page if it has been faulted in. Return failure if the
> * source page has been swapped out or unmapped in primary memory.
> @@ -3131,22 +3106,14 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> if (ret != 1)
> return -ENOMEM;
>
> + kvm_tdx->page_add_src = src_page;
> ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
> - if (ret < 0)
> - goto out;
> + kvm_tdx->page_add_src = NULL;
>
> - ret = 0;
> - err = tdh_mem_page_add(&kvm_tdx->td, gpa, pfn_to_page(pfn),
> - src_page, &entry, &level_state);
> - if (err) {
> - ret = unlikely(tdx_operand_busy(err)) ? -EBUSY : -EIO;
> - goto out;
> - }
> + put_page(src_page);
>
> - KVM_BUG_ON(atomic64_dec_return(&kvm_tdx->nr_premapped) < 0, kvm);
> -
> - if (!(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION))
> - goto out;
> + if (ret || !(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION))
> + return ret;
>
> /*
> * Note, MR.EXTEND can fail if the S-EPT mapping is somehow removed
> @@ -3159,14 +3126,11 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> err = tdh_mr_extend(&kvm_tdx->td, gpa + i, &entry, &level_state);
> if (KVM_BUG_ON(err, kvm)) {
> pr_tdx_error_2(TDH_MR_EXTEND, err, entry, level_state);
> - ret = -EIO;
> - goto out;
> + return -EIO;
> }
> }
>
> -out:
> - put_page(src_page);
> - return ret;
> + return 0;
> }
>
> static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index ca39a9391db1..1b00adbbaf77 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -36,8 +36,12 @@ struct kvm_tdx {
>
> struct tdx_td td;
>
> - /* For KVM_TDX_INIT_MEM_REGION. */
> - atomic64_t nr_premapped;
> + /*
> + * Scratch pointer used to pass the source page to tdx_mem_page_add.
> + * Protected by slots_lock, and non-NULL only when mapping a private
> + * pfn via tdx_gmem_post_populate().
> + */
> + struct page *page_add_src;
>
> /*
> * Prevent vCPUs from TD entry to ensure SEPT zap related SEAMCALLs do
Powered by blists - more mailing lists