linux-kernel - Re: [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <84ccd200-d457-4b67-864d-d40d6aa732ac@linux.intel.com>
Date: Fri, 29 Aug 2025 16:36:05 +0800
From: Binbin Wu <binbin.wu@...ux.intel.com>
To: Sean Christopherson <seanjc@...gle.com>,
 Paolo Bonzini <pbonzini@...hat.com>
Cc: kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
 Ira Weiny <ira.weiny@...el.com>, Kai Huang <kai.huang@...el.com>,
 Michael Roth <michael.roth@....com>, Yan Zhao <yan.y.zhao@...el.com>,
 Vishal Annapurve <vannapurve@...gle.com>,
 Rick Edgecombe <rick.p.edgecombe@...el.com>,
 Ackerley Tng <ackerleytng@...gle.com>
Subject: Re: [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in
 S-EPT management



On 8/29/2025 8:06 AM, Sean Christopherson wrote:
> From: Yan Zhao <yan.y.zhao@...el.com>
>
> Don't explicitly pin pages when mapping pages into the S-EPT, guest_memfd
> doesn't support page migration in any capacity, i.e. there are no migrate
> callbacks because guest_memfd pages *can't* be migrated.  See the WARN in
> kvm_gmem_migrate_folio().
>
> Eliminating TDX's explicit pinning will also enable guest_memfd to support
> in-place conversion between shared and private memory[1][2].  Because KVM
> cannot distinguish between speculative/transient refcounts and the
> intentional refcount for TDX on private pages[3], failing to release
> private page refcount in TDX could cause guest_memfd to indefinitely wait
> on decreasing the refcount for the splitting.
>
> Under normal conditions, not holding an extra page refcount in TDX is safe
> because guest_memfd ensures pages are retained until its invalidation
> notification to KVM MMU is completed. However, if there're bugs in KVM/TDX
> module, not holding an extra refcount when a page is mapped in S-EPT could
> result in a page being released from guest_memfd while still mapped in the
> S-EPT.  But, doing work to make a fatal error slightly less fatal is a net
> negative when that extra work adds complexity and confusion.
>
> Several approaches were considered to address the refcount issue, including
>    - Attempting to modify the KVM unmap operation to return a failure,
>      which was deemed too complex and potentially incorrect[4].
>   - Increasing the folio reference count only upon S-EPT zapping failure[5].
>   - Use page flags or page_ext to indicate a page is still used by TDX[6],
>     which does not work for HVO (HugeTLB Vmemmap Optimization).
>    - Setting HWPOISON bit or leveraging folio_set_hugetlb_hwpoison()[7].
Nit: alignment issue with the bullets.

Otherwise,
Reviewed-by: Binbin Wu <binbin.wu@...ux.intel.com>

>
> Due to the complexity or inappropriateness of these approaches, and the
> fact that S-EPT zapping failure is currently only possible when there are
> bugs in the KVM or TDX module, which is very rare in a production kernel,
> a straightforward approach of simply not holding the page reference count
> in TDX was chosen[8].
>
> When S-EPT zapping errors occur, KVM_BUG_ON() is invoked to kick off all
> vCPUs and mark the VM as dead. Although there is a potential window that a
> private page mapped in the S-EPT could be reallocated and used outside the
> VM, the loud warning from KVM_BUG_ON() should provide sufficient debug
> information. To be robust against bugs, the user can enable panic_on_warn
> as normal.
>
> Link: https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com [1]
> Link: https://youtu.be/UnBKahkAon4 [2]
> Link: https://lore.kernel.org/all/CAGtprH_ypohFy9TOJ8Emm_roT4XbQUtLKZNFcM6Fr+fhTFkE0Q@mail.gmail.com [3]
> Link: https://lore.kernel.org/all/aEEEJbTzlncbRaRA@yzhao56-desk.sh.intel.com [4]
> Link: https://lore.kernel.org/all/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com [5]
> Link: https://lore.kernel.org/all/aFkeBtuNBN1RrDAJ@yzhao56-desk.sh.intel.com [6]
> Link: https://lore.kernel.org/all/diqzy0tikran.fsf@ackerleytng-ctop.c.googlers.com [7]
> Link: https://lore.kernel.org/all/53ea5239f8ef9d8df9af593647243c10435fd219.camel@intel.com [8]
> Suggested-by: Vishal Annapurve <vannapurve@...gle.com>
> Suggested-by: Ackerley Tng <ackerleytng@...gle.com>
> Suggested-by: Rick Edgecombe <rick.p.edgecombe@...el.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> Reviewed-by: Ira Weiny <ira.weiny@...el.com>
> Reviewed-by: Kai Huang <kai.huang@...el.com>
> [sean: extract out of hugepage series, massage changelog accordingly]
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> ---
>   arch/x86/kvm/vmx/tdx.c | 28 ++++------------------------
>   1 file changed, 4 insertions(+), 24 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index c83e1ff02827..f24f8635b433 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1586,29 +1586,22 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>   	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
>   }
>   
> -static void tdx_unpin(struct kvm *kvm, struct page *page)
> -{
> -	put_page(page);
> -}
> -
>   static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> -			    enum pg_level level, struct page *page)
> +			    enum pg_level level, kvm_pfn_t pfn)
>   {
>   	int tdx_level = pg_level_to_tdx_sept_level(level);
>   	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	struct page *page = pfn_to_page(pfn);
>   	gpa_t gpa = gfn_to_gpa(gfn);
>   	u64 entry, level_state;
>   	u64 err;
>   
>   	err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
> -	if (unlikely(tdx_operand_busy(err))) {
> -		tdx_unpin(kvm, page);
> +	if (unlikely(tdx_operand_busy(err)))
>   		return -EBUSY;
> -	}
>   
>   	if (KVM_BUG_ON(err, kvm)) {
>   		pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
> -		tdx_unpin(kvm, page);
>   		return -EIO;
>   	}
>   
> @@ -1642,29 +1635,18 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>   				     enum pg_level level, kvm_pfn_t pfn)
>   {
>   	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> -	struct page *page = pfn_to_page(pfn);
>   
>   	/* TODO: handle large pages. */
>   	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
>   		return -EINVAL;
>   
> -	/*
> -	 * Because guest_memfd doesn't support page migration with
> -	 * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
> -	 * migration.  Until guest_memfd supports page migration, prevent page
> -	 * migration.
> -	 * TODO: Once guest_memfd introduces callback on page migration,
> -	 * implement it and remove get_page/put_page().
> -	 */
> -	get_page(page);
> -
>   	/*
>   	 * Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
>   	 * barrier in tdx_td_finalize().
>   	 */
>   	smp_rmb();
>   	if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
> -		return tdx_mem_page_aug(kvm, gfn, level, page);
> +		return tdx_mem_page_aug(kvm, gfn, level, pfn);
>   
>   	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
>   }
> @@ -1715,7 +1697,6 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>   		return -EIO;
>   	}
>   	tdx_clear_page(page);
> -	tdx_unpin(kvm, page);
>   	return 0;
>   }
>   
> @@ -1795,7 +1776,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
>   	if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
>   	    !KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
>   		atomic64_dec(&kvm_tdx->nr_premapped);
> -		tdx_unpin(kvm, page);
>   		return 0;
>   	}
>