linux-kernel - [RFC PATCH 21/21] KVM: x86: Ignore splitting huge pages in fault path for TDX

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250424030926.554-1-yan.y.zhao@intel.com>
Date: Thu, 24 Apr 2025 11:09:26 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: pbonzini@...hat.com,
	seanjc@...gle.com
Cc: linux-kernel@...r.kernel.org,
	kvm@...r.kernel.org,
	x86@...nel.org,
	rick.p.edgecombe@...el.com,
	dave.hansen@...el.com,
	kirill.shutemov@...el.com,
	tabba@...gle.com,
	ackerleytng@...gle.com,
	quic_eberman@...cinc.com,
	michael.roth@....com,
	david@...hat.com,
	vannapurve@...gle.com,
	vbabka@...e.cz,
	jroedel@...e.de,
	thomas.lendacky@....com,
	pgonda@...gle.com,
	zhiquan1.li@...el.com,
	fan.du@...el.com,
	jun.miao@...el.com,
	ira.weiny@...el.com,
	isaku.yamahata@...el.com,
	xiaoyao.li@...el.com,
	binbin.wu@...ux.intel.com,
	chao.p.peng@...el.com,
	Yan Zhao <yan.y.zhao@...el.com>
Subject: [RFC PATCH 21/21] KVM: x86: Ignore splitting huge pages in fault path for TDX

As handling the BUSY error from SEAMCALLs for splitting huge pages is more
comprehensive in the fault path, which is with kvm->mmu_lock held for
reading, simply ignore the splitting huge page request in fault path for
TDX.

Splitting in fault path can now be caused by vCPUs' concurrent ACCEPT
operations at diffent levels, e.g. one vCPU accepts at 2MB level while
another vCPU accepts at 4KB level. As the first vCPU will accepts the whole
2MB range ultimately, ignoring the mapping request (which leads to huge
page splitting) caused by the second 4KB ACCEPT operation is fine.

A rare case that could lead to splitting in the fault path is when a TD is
configured to receive #VE and accesses memory before the ACCEPT operation.
By the time a vCPU accesses a private GFN, due to the lack of any guest
preferred level, KVM could create a mapping at 2MB level. If the TD then
only performs the ACCEPT operation at 4KB level, splitting in the fault
path will be triggered. However, this is not regarded as a typical use
case, as usually TD always accepts pages in the order from 1GB->2MB->4KB.
The worst outcome to ignore the resulting splitting request is an endless
EPT violation. This would not happen for a Linux guest, which does not
expect any #VE.

This ignoring of splitting huge page in fault path is achieved in 3 parts:
1. In KVM TDP MMU,
   allow splitting huge pages in fault path for the mirror page table and
   propagate the splitting request to TDX.
2. Enhance the hook split_external_spt by
   passing in shared/exclusive status of kvm->mmu_lock
3. In TDX's implementation of hook split_external_spt,
   do nothing but to set the max_level of the next fault on the splitting
   GFN range on the vCPU to 2MB and return -EBUSY.

Then after tdx_sept_split_private_spt() returns, TDX's EPT violation
handler may (a) return to guest directly (when signal/interrupt pending) or
(b) retry locally in the TDX's code.
- for (a), the TD can retry the ACCEPT operation, finding the memory is
  accepted at 2MB level by another vCPU.
- for (b), as the violation_request_level is set to 2MB, the next
  kvm_mmu_page_fault() will return RET_PF_SPURIOUS, causing re-entering of
  the TD.

Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.c      | 44 ++++++++++++++++++++-------------
 arch/x86/kvm/vmx/tdx.c          | 25 ++++++++++++++++++-
 arch/x86/kvm/vmx/x86_ops.h      |  5 ++--
 4 files changed, 55 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5167458742bf..faae82eefd99 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1814,7 +1814,7 @@ struct kvm_x86_ops {
 
 	/* Split the external page table into smaller page tables */
 	int (*split_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
-				  void *external_spt);
+				  void *external_spt, bool mmu_lock_shared);
 
 	bool (*has_wbinvd_exit)(void);
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d3fba5d11ea2..1b2bacde009f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -388,15 +388,15 @@ static void remove_external_spte(struct kvm *kvm, gfn_t gfn, u64 old_spte,
 }
 
 static int split_external_spt(struct kvm *kvm, gfn_t gfn, u64 old_spte,
-			      u64 new_spte, int level)
+			      u64 new_spte, int level, bool shared)
 {
 	void *external_spt = get_external_spt(gfn, new_spte, level);
 	int ret;
 
 	KVM_BUG_ON(!external_spt, kvm);
 
-	ret = static_call(kvm_x86_split_external_spt)(kvm, gfn, level, external_spt);
-	KVM_BUG_ON(ret, kvm);
+	ret = static_call(kvm_x86_split_external_spt)(kvm, gfn, level, external_spt, shared);
+	KVM_BUG_ON(ret && !shared, kvm);
 
 	return ret;
 }
@@ -536,11 +536,13 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
 {
 	bool was_present = is_shadow_present_pte(old_spte);
 	bool is_present = is_shadow_present_pte(new_spte);
+	bool was_leaf = was_present && is_last_spte(old_spte, level);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
 	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
 	int ret = 0;
 
-	KVM_BUG_ON(was_present, kvm);
+	/* leaf to leaf or non-leaf to non-leaf updates are not allowed */
+	KVM_BUG_ON((was_leaf && is_leaf) || (was_present && !was_leaf && !is_leaf), kvm);
 
 	lockdep_assert_held(&kvm->mmu_lock);
 	/*
@@ -551,18 +553,28 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
 	if (!try_cmpxchg64(rcu_dereference(sptep), &old_spte, FROZEN_SPTE))
 		return -EBUSY;
 
-	/*
-	 * Use different call to either set up middle level
-	 * external page table, or leaf.
-	 */
-	if (is_leaf) {
-		ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn);
-	} else {
-		void *external_spt = get_external_spt(gfn, new_spte, level);
+	if (!was_present) {
+		/*
+		 * Propagate to install a new leaf or non-leaf entry in external
+		 * page table.
+		 */
+		if (is_leaf) {
+			ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn);
+		} else {
+			void *external_spt = get_external_spt(gfn, new_spte, level);
 
-		KVM_BUG_ON(!external_spt, kvm);
-		ret = static_call(kvm_x86_link_external_spt)(kvm, gfn, level, external_spt);
+			KVM_BUG_ON(!external_spt, kvm);
+			ret = static_call(kvm_x86_link_external_spt)(kvm, gfn, level, external_spt);
+		}
+	} else if (was_leaf && is_present && !is_leaf) {
+		/* demote */
+		ret = split_external_spt(kvm, gfn, old_spte, new_spte, level, true);
+	} else {
+		/* Promotion is not supported by mirror root (TDX)*/
+		KVM_BUG_ON(1, kvm);
+		ret = -EOPNOTSUPP;
 	}
+
 	if (ret)
 		__kvm_tdp_mmu_write_spte(sptep, old_spte);
 	else
@@ -784,7 +796,7 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 		if (!is_shadow_present_pte(new_spte))
 			remove_external_spte(kvm, gfn, old_spte, level);
 		else if (is_last_spte(old_spte, level) && !is_last_spte(new_spte, level))
-			split_external_spt(kvm, gfn, old_spte, new_spte, level);
+			split_external_spt(kvm, gfn, old_spte, new_spte, level, false);
 		else
 			KVM_BUG_ON(1, kvm);
 	}
@@ -1315,8 +1327,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
 
 		if (is_shadow_present_pte(iter.old_spte)) {
-			/* Don't support large page for mirrored roots (TDX) */
-			KVM_BUG_ON(is_mirror_sptep(iter.sptep), vcpu->kvm);
 			r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
 		} else {
 			r = tdp_mmu_link_sp(kvm, &iter, sp, true);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e24d1cbcc762..e994a6c08a75 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1834,7 +1834,7 @@ static int tdx_spte_demote_private_spte(struct kvm *kvm, gfn_t gfn,
 }
 
 int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
-			       void *private_spt)
+			       void *private_spt, bool mmu_lock_shared)
 {
 	struct page *page = virt_to_page(private_spt);
 	int ret;
@@ -1842,6 +1842,29 @@ int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
 	if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE || level != PG_LEVEL_2M, kvm))
 		return -EINVAL;
 
+	/*
+	 * Split request with mmu_lock held for reading can only occur when one
+	 * vCPU accepts at 2MB level while another vCPU accepts at 4KB level.
+	 * Ignore this 4KB mapping request by setting violation_request_level to
+	 * 2MB and returning -EBUSY for retry. Then the next fault at 2MB level
+	 * would be a spurious fault. The vCPU accepting at 2MB will accept the
+	 * whole 2MB range.
+	 */
+	if (mmu_lock_shared) {
+		struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+		struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+		if (KVM_BUG_ON(!vcpu, kvm))
+			return -EOPNOTSUPP;
+
+		/* Request to map as 2MB leaf for the whole 2MB range */
+		tdx->violation_gfn_start = gfn_round_for_level(gfn, level);
+		tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
+		tdx->violation_request_level = level;
+
+		return -EBUSY;
+	}
+
 	ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
 	if (ret <= 0)
 		return ret;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 0619e9390e5d..fcba76887508 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -159,7 +159,7 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 				 enum pg_level level, kvm_pfn_t pfn);
 int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
-			       void *private_spt);
+			       void *private_spt, bool mmu_lock_shared);
 
 void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
 void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);
@@ -228,7 +228,8 @@ static inline int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 
 static inline int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn,
 					     enum pg_level level,
-					     void *private_spt)
+					     void *private_spt,
+					     bool mmu_lock_shared)
 {
 	return -EOPNOTSUPP;
 }
-- 
2.43.2