linux-kernel - [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260106102136.25108-1-yan.y.zhao@intel.com>
Date: Tue,  6 Jan 2026 18:21:36 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: pbonzini@...hat.com,
	seanjc@...gle.com
Cc: linux-kernel@...r.kernel.org,
	kvm@...r.kernel.org,
	x86@...nel.org,
	rick.p.edgecombe@...el.com,
	dave.hansen@...el.com,
	kas@...nel.org,
	tabba@...gle.com,
	ackerleytng@...gle.com,
	michael.roth@....com,
	david@...nel.org,
	vannapurve@...gle.com,
	sagis@...gle.com,
	vbabka@...e.cz,
	thomas.lendacky@....com,
	nik.borisov@...e.com,
	pgonda@...gle.com,
	fan.du@...el.com,
	jun.miao@...el.com,
	francescolavra.fl@...il.com,
	jgross@...e.com,
	ira.weiny@...el.com,
	isaku.yamahata@...el.com,
	xiaoyao.li@...el.com,
	kai.huang@...el.com,
	binbin.wu@...ux.intel.com,
	chao.p.peng@...el.com,
	chao.gao@...el.com,
	yan.y.zhao@...el.com
Subject: [PATCH v3 11/24] KVM: x86/mmu: Introduce kvm_split_cross_boundary_leafs()

Introduce kvm_split_cross_boundary_leafs() to split huge leaf entries that
cross the boundary of a specified range.

Splitting huge leaf entries that cross the boundary is essential before
zapping a specified range in the mirror root. This ensures that the
subsequent zap operation does not affect any GFNs outside the specified
range, which is crucial for the mirror root, as the private page table
requires the guest's ACCEPT operation after faulting back.

While the core of kvm_split_cross_boundary_leafs() leverages the main logic
of tdp_mmu_split_huge_pages_root(), the former only splits huge leaf
entries when their mapping ranges cross the specified range boundary. When
splitting is necessary, kvm->mmu_lock may be temporarily released for
memory allocation, meaning returning -ENOMEM is possible.

Since tdp_mmu_split_huge_pages_root() is originally invoked by dirty page
tracking related functions that flush TLB unconditionally at the end,
tdp_mmu_split_huge_pages_root() doesn't flush TLB before it temporarily
releases mmu_lock.

Do not enhance tdp_mmu_split_huge_pages_root() to return split or flush
status for kvm_split_cross_boundary_leafs(). This is because the status
could be inaccurate when multiple threads are trying to split the same
memory range concurrently, e.g., if kvm_split_cross_boundary_leafs()
returns split/flush as false, it doesn't mean there're no splits in the
specified range, since splits could have occurred in other threads due to
temporary release of mmu_lock.

Therefore, callers of kvm_split_cross_boundary_leafs() need to determine
how/when to flush TLB according to the use cases:

- If the split is triggered in a fault path for TDX, the hardware shouldn't
  have cached the old huge translation. Therefore, no need to flush TLB.

- If the split is triggered by zaps in guest_memfd punch hole or page
  conversion, it can delay the TLB flush until after zaps.

- If the use case relies on pure split status (e.g., splitting for PML),
  flush TLB unconditionally. (Just hypothetical. No such use case currently
  exists for kvm_split_cross_boundary_leafs()).

Signed-off-by: Xiaoyao Li <xiaoyao.li@...el.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@...el.com>
Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
---
v3:
- s/only_cross_bounday/only_cross_boundary. (Kai)
- Do not return flush status and have the callers to determine how/when to
  flush TLB.
- Always pass "flush" as false to tdp_mmu_iter_cond_resched(). (Kai)
- Added a default implementation for kvm_split_boundary_leafs() for non-x86
  platforms.
- Removed middle level function tdp_mmu_split_cross_boundary_leafs().
- Use EXPORT_SYMBOL_FOR_KVM_INTERNAL().

RFC v2:
- Rename the API to kvm_split_cross_boundary_leafs().
- Make the API to be usable for direct roots or under shared mmu_lock.
- Leverage the main logic from tdp_mmu_split_huge_pages_root(). (Rick)

RFC v1:
- Split patch.
- introduced API kvm_split_boundary_leafs(), refined the logic and
  simplified the code.
---
 arch/x86/kvm/mmu/mmu.c     | 34 ++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c | 42 ++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu/tdp_mmu.h |  3 +++
 include/linux/kvm_host.h   |  2 ++
 virt/kvm/kvm_main.c        |  7 +++++++
 5 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b4f2e3ced716..f40af7ac75b3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1644,6 +1644,40 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm,
 				 start, end - 1, can_yield, true, flush);
 }
 
+/*
+ * Split large leafs crossing the boundary of the specified range.
+ * Only support TDP MMU. Do nothing if !tdp_mmu_enabled.
+ *
+ * This API does not flush TLB. Callers need to determine how/when to flush TLB
+ * according to their use cases, e.g.,
+ * - No need to flush TLB. e.g., if it's in a fault path or TLB flush has been
+ *   ensured.
+ * - Delay the TLB flush until after zaps if the split is invoked for precise
+ *   zapping.
+ * - Unconditionally flush TLB if a use case relies on pure split status (e.g.,
+ *   splitting for PML).
+ *
+ * Return value: 0 : success;  <0: failure
+ */
+int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range,
+				   bool shared)
+{
+	bool ret = 0;
+
+	lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
+			    lockdep_is_held(&kvm->slots_lock) ||
+			    srcu_read_lock_held(&kvm->srcu));
+
+	if (!range->may_block)
+		return -EOPNOTSUPP;
+
+	if (tdp_mmu_enabled)
+		ret = kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(kvm, range,
+								       shared);
+	return ret;
+}
+EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_split_cross_boundary_leafs);
+
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool flush = false;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 074209d91ec3..b984027343b7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1600,10 +1600,17 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	return ret;
 }
 
+static bool iter_cross_boundary(struct tdp_iter *iter, gfn_t start, gfn_t end)
+{
+	return !(iter->gfn >= start &&
+		 (iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) <= end);
+}
+
 static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 					 struct kvm_mmu_page *root,
 					 gfn_t start, gfn_t end,
-					 int target_level, bool shared)
+					 int target_level, bool shared,
+					 bool only_cross_boundary)
 {
 	struct kvm_mmu_page *sp = NULL;
 	struct tdp_iter iter;
@@ -1615,6 +1622,10 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 	 * level into one lower level. For example, if we encounter a 1GB page
 	 * we split it into 512 2MB pages.
 	 *
+	 * When only_cross_boundary is true, just split huge pages above the
+	 * target level into one lower level if the huge pages cross the start
+	 * or end boundary.
+	 *
 	 * Since the TDP iterator uses a pre-order traversal, we are guaranteed
 	 * to visit an SPTE before ever visiting its children, which means we
 	 * will correctly recursively split huge pages that are more than one
@@ -1629,6 +1640,10 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 		if (!is_shadow_present_pte(iter.old_spte) || !is_large_pte(iter.old_spte))
 			continue;
 
+		if (only_cross_boundary &&
+		    !iter_cross_boundary(&iter, start, end))
+			continue;
+
 		if (!sp) {
 			rcu_read_unlock();
 
@@ -1692,12 +1707,35 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
 
 	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
 	for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id) {
-		r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level, shared);
+		r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level,
+						  shared, false);
+		if (r) {
+			kvm_tdp_mmu_put_root(kvm, root);
+			break;
+		}
+	}
+}
+
+int kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(struct kvm *kvm,
+						     struct kvm_gfn_range *range,
+						     bool shared)
+{
+	enum kvm_tdp_mmu_root_types types;
+	struct kvm_mmu_page *root;
+	int r = 0;
+
+	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
+	types = kvm_gfn_range_filter_to_root_types(kvm, range->attr_filter);
+
+	__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, types) {
+		r = tdp_mmu_split_huge_pages_root(kvm, root, range->start, range->end,
+						  PG_LEVEL_4K, shared, true);
 		if (r) {
 			kvm_tdp_mmu_put_root(kvm, root);
 			break;
 		}
 	}
+	return r;
 }
 
 static bool tdp_mmu_need_write_protect(struct kvm *kvm, struct kvm_mmu_page *sp)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index bd62977c9199..c20b1416e4b2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -70,6 +70,9 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
 				  enum kvm_tdp_mmu_root_types root_types);
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared);
+int kvm_tdp_mmu_gfn_range_split_cross_boundary_leafs(struct kvm *kvm,
+						     struct kvm_gfn_range *range,
+						     bool shared);
 
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8144d27e6c12..e563bb22c481 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -275,6 +275,8 @@ struct kvm_gfn_range {
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+int kvm_split_cross_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range,
+				   bool shared);
 #endif
 
 enum {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1d7ab2324d10..feeef7747099 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -910,6 +910,13 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 	return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
 }
 
+int __weak kvm_split_cross_boundary_leafs(struct kvm *kvm,
+					  struct kvm_gfn_range *range,
+					  bool shared)
+{
+	return 0;
+}
+
 #else  /* !CONFIG_KVM_GENERIC_MMU_NOTIFIER */
 
 static int kvm_init_mmu_notifier(struct kvm *kvm)
-- 
2.43.2