linux-kernel - Re: [PATCH 10/16] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240517090348.GN168153@ls.amr.corp.intel.com>
Date: Fri, 17 May 2024 02:03:48 -0700
From: Isaku Yamahata <isaku.yamahata@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
Cc: "Yamahata, Isaku" <isaku.yamahata@...el.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"seanjc@...gle.com" <seanjc@...gle.com>,
	"Huang, Kai" <kai.huang@...el.com>,
	"sagis@...gle.com" <sagis@...gle.com>,
	"isaku.yamahata@...ux.intel.com" <isaku.yamahata@...ux.intel.com>,
	"isaku.yamahata@...il.com" <isaku.yamahata@...il.com>,
	"Aktas, Erdem" <erdemaktas@...gle.com>,
	"Zhao, Yan Y" <yan.y.zhao@...el.com>,
	"dmatlack@...gle.com" <dmatlack@...gle.com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"pbonzini@...hat.com" <pbonzini@...hat.com>
Subject: Re: [PATCH 10/16] KVM: x86/tdp_mmu: Support TDX private mapping for
 TDP MMU

On Fri, May 17, 2024 at 02:35:46AM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@...el.com> wrote:

> Here is a diff of an attempt to merge all the feedback so far. It's on top of
> the the dev branch from this series.
> 
> On Thu, 2024-05-16 at 12:42 -0700, Isaku Yamahata wrote:
> > - rename role.is_private => role.is_mirrored_pt
> 
> Agreed.
> 
> > 
> > - sp->gfn: gfn without shared bit.
> > 
> > - fault->address: without gfn_shared_mask
> >   Actually it doesn't matter much.  We can use gpa with gfn_shared_mask.
> 
> I left fault->addr with shared bits. It's not used anymore for TDX except in the
> tracepoint which I think makes sense.

As discussed with Kai [1], make fault->addr represent the real fault address.

[1] https://lore.kernel.org/kvm/20240517081440.GM168153@ls.amr.corp.intel.com/

> 
> > 
> > - Update struct tdp_iter
> >   struct tdp_iter
> >     gfn: gfn without shared bit
> > 
> >     /* Add new members */
> > 
> >     /* Indicates which PT to walk. */
> >     bool mirrored_pt;
> > 
> >     // This is used tdp_iter_refresh_sptep()
> >     // shared gfn_mask if mirrored_pt
> >     // 0 if !mirrored_pt
> >     gfn_shared_mask
> > 
> > - Pass mirrored_pt and gfn_shared_mask to
> >   tdp_iter_start(..., mirrored_pt, gfn_shared_mask)
> > 
> >   and update tdp_iter_refresh_sptep()
> >   static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
> >         ...
> >         iter->sptep = iter->pt_path[iter->level - 1] +
> >                 SPTE_INDEX((iter->gfn << PAGE_SHIFT) | iter->gfn_shared_mask,
> > iter->level);
> 
> I tried something else. The iterators still have gfn's with shared bits, but the
> addition of the shared bit is wrapped in tdp_mmu_for_each_pte(), so
> kvm_tdp_mmu_map() and similar don't have to handle the shared bits. They just
> pass in a root, and tdp_mmu_for_each_pte() knows how to adjust the GFN. Like:
> 
> #define tdp_mmu_for_each_pte(_iter, _kvm, _root, _start, _end)	\
> 	for_each_tdp_pte(_iter, _root,	\
> 			 kvm_gfn_for_root(_kvm, _root, _start), \
> 			 kvm_gfn_for_root(_kvm, _root, _end))

I'm wondering to remove kvm_gfn_for_root() at all.


> I also changed the callers to use the new enum to specify roots. This way they
> can pass something with a nice name instead of true/false for bool private.

This is nice.


> Keeping a gfn_shared_mask inside the iterator didn't seem more clear to me, and
> bit more cumbersome. But please compare it.
> 
> > 
> >   Change for_each_tdp_mte_min_level() accordingly.
> >   Also the iteretor to call this.
> >    
> >   #define for_each_tdp_pte_min_level(kvm, iter, root, min_level, start,
> > end)      \
> >           for (tdp_iter_start(&iter, root, min_level,
> > start,                      \
> >                mirrored_root, mirrored_root ? kvm_gfn_shared_mask(kvm) :
> > 0);      \
> >                iter.valid && iter.gfn < kvm_gfn_for_root(kvm, root,
> > end);         \
> >                tdp_iter_next(&iter))
> 
> I liked it a lot because the callers don't need to manually call
> kvm_gfn_for_root() anymore. But I tried it and it required a lot of additions of
> kvm to the iterators call sites. I ended up removing it, but I'm not sure.

..

> > - Update spte handler (handle_changed_spte(), handle_removed_pt()...),
> >   use iter->mirror_pt or pass down mirror_pt.
> 
> You mean just rename it, or something else?

I scratch this. I thought Kai didn't like to use role [2].
But now it seems okay. [3]

[2] https://lore.kernel.org/kvm/4ba18e4e-5971-4683-82eb-63c985e98e6b@intel.com/
  > I don't think using kvm_mmu_page.role is correct.

[3] https://lore.kernel.org/kvm/20240517081440.GM168153@ls.amr.corp.intel.com/
  > I think you can just get @mirrored_pt from the sptep:
  >  mirrored_pt = sptep_to_sp(sptep)->role.mirrored_pt;


> Anyway below is a first cut based on the discussion.
> 
> A few other things:
> 1. kvm_is_private_gpa() is moved into Intel code. kvm_gfn_shared_mask() remains
> for only two operations in common code:
>  - kvm_gfn_for_root() <- required for zapping/mapping
>  - Stripping the bit when setting fault.gfn <- possible to remove if we strip
> cr2_or_gpa
> 2. I also played with changing KVM_PRIVATE_ROOTS to KVM_MIRROR_ROOTS.
> Unfortunately there is still some confusion between private and mirrored. For
> example you walk a mirror root (what is actually happening), but you have to
> allocate private page tables as you do, as well as call out to x86_ops named
> private. So those concepts are effectively linked and used a bit
> interchangeably.

On top of your patch, I created the following patch to remove kvm_gfn_for_root().
Although I haven't tested it yet, I think the following shows my idea.

- Add gfn_shared_mask to struct tdp_iter.
- Use iter.gfn_shared_mask to determine the starting sptep in the root.
- Remove kvm_gfn_for_root()

---
 arch/x86/kvm/mmu/mmu_internal.h | 10 -------
 arch/x86/kvm/mmu/tdp_iter.c     |  5 ++--
 arch/x86/kvm/mmu/tdp_iter.h     | 16 ++++++-----
 arch/x86/kvm/mmu/tdp_mmu.c      | 48 ++++++++++-----------------------
 4 files changed, 26 insertions(+), 53 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 2b1b2a980b03..9676af0cb133 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -180,16 +180,6 @@ static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_m
 	sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_spt_cache);
 }
 
-static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root,
-				     gfn_t gfn)
-{
-	gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
-
-	/* Set shared bit if not private */
-	gfn_for_root |= -(gfn_t)!is_mirrored_sp(root) & kvm_gfn_shared_mask(kvm);
-	return gfn_for_root;
-}
-
 static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
 {
 	/*
diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index 04c247bfe318..c5f2ca1ceede 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -12,7 +12,7 @@
 static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
 {
 	iter->sptep = iter->pt_path[iter->level - 1] +
-		SPTE_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
+		SPTE_INDEX((iter->gfn | iter->gfn_shared_mask) << PAGE_SHIFT, iter->level);
 	iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);
 }
 
@@ -37,7 +37,7 @@ void tdp_iter_restart(struct tdp_iter *iter)
  * rooted at root_pt, starting with the walk to translate next_last_level_gfn.
  */
 void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
-		    int min_level, gfn_t next_last_level_gfn)
+		    int min_level, gfn_t next_last_level_gfn, gfn_t gfn_shared_mask)
 {
 	if (WARN_ON_ONCE(!root || (root->role.level < 1) ||
 			 (root->role.level > PT64_ROOT_MAX_LEVEL))) {
@@ -46,6 +46,7 @@ void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
 	}
 
 	iter->next_last_level_gfn = next_last_level_gfn;
+	iter->gfn_shared_mask = gfn_shared_mask;
 	iter->root_level = root->role.level;
 	iter->min_level = min_level;
 	iter->pt_path[iter->root_level - 1] = (tdp_ptep_t)root->spt;
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 8a64bcef9deb..274b42707f0a 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -91,8 +91,9 @@ struct tdp_iter {
 	tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
 	/* A pointer to the current SPTE */
 	tdp_ptep_t sptep;
-	/* The lowest GFN (shared bits included) mapped by the current SPTE */
+	/* The lowest GFN (shared bits excluded) mapped by the current SPTE */
 	gfn_t gfn;
+	gfn_t gfn_shared_mask;
 	/* The level of the root page given to the iterator */
 	int root_level;
 	/* The lowest level the iterator should traverse to */
@@ -120,18 +121,19 @@ struct tdp_iter {
  * Iterates over every SPTE mapping the GFN range [start, end) in a
  * preorder traversal.
  */
-#define for_each_tdp_pte_min_level(iter, root, min_level, start, end) \
-	for (tdp_iter_start(&iter, root, min_level, start); \
-	     iter.valid && iter.gfn < end;		     \
+#define for_each_tdp_pte_min_level(iter, kvm, root, min_level, start, end) \
+	for (tdp_iter_start(&iter, root, min_level, start,			\
+			    is_mirrored_sp(root) ? 0: kvm_gfn_shared_mask(kvm)); \
+	     iter.valid && iter.gfn < end;					\
 	     tdp_iter_next(&iter))
 
-#define for_each_tdp_pte(iter, root, start, end) \
-	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end)
+#define for_each_tdp_pte(iter, kvm, root, start, end)				\
+	for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end)
 
 tdp_ptep_t spte_to_child_pt(u64 pte, int level);
 
 void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
-		    int min_level, gfn_t next_last_level_gfn);
+		    int min_level, gfn_t next_last_level_gfn, gfn_t gfn_shared_mask);
 void tdp_iter_next(struct tdp_iter *iter);
 void tdp_iter_restart(struct tdp_iter *iter);
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7f13016e210b..bf7aa87eb593 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -862,20 +862,18 @@ static inline void tdp_mmu_iter_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 					  iter->gfn, iter->level);
 }
 
-#define tdp_root_for_each_pte(_iter, _root, _start, _end) \
-	for_each_tdp_pte(_iter, _root, _start, _end)
+#define tdp_root_for_each_pte(_iter, _kvm, _root, _start, _end)	\
+	for_each_tdp_pte(_iter, _kvm, _root, _start, _end)
 
-#define tdp_root_for_each_leaf_pte(_iter, _root, _start, _end)	\
-	tdp_root_for_each_pte(_iter, _root, _start, _end)		\
+#define tdp_root_for_each_leaf_pte(_iter, _kvm, _root, _start, _end)	\
+	tdp_root_for_each_pte(_iter, _kvm, _root, _start, _end)		\
 		if (!is_shadow_present_pte(_iter.old_spte) ||		\
 		    !is_last_spte(_iter.old_spte, _iter.level))		\
 			continue;					\
 		else
 
 #define tdp_mmu_for_each_pte(_iter, _kvm, _root, _start, _end)	\
-	for_each_tdp_pte(_iter, _root,	\
-			 kvm_gfn_for_root(_kvm, _root, _start), \
-			 kvm_gfn_for_root(_kvm, _root, _end))
+	for_each_tdp_pte(_iter, _kvm, _root, _start, _end)
 
 /*
  * Yield if the MMU lock is contended or this thread needs to return control
@@ -941,7 +939,7 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 	gfn_t end = tdp_mmu_max_gfn_exclusive();
 	gfn_t start = 0;
 
-	for_each_tdp_pte_min_level(iter, root, zap_level, start, end) {
+	for_each_tdp_pte_min_level(iter, kvm, root, zap_level, start, end) {
 retry:
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
 			continue;
@@ -1043,17 +1041,9 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
-	/*
-	 * start and end doesn't have GFN shared bit.  This function zaps
-	 * a region including alias.  Adjust shared bit of [start, end) if the
-	 * root is shared.
-	 */
-	start = kvm_gfn_for_root(kvm, root, start);
-	end = kvm_gfn_for_root(kvm, root, end);
-
 	rcu_read_lock();
 
-	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
+	for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end) {
 		if (can_yield &&
 		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
 			flush = false;
@@ -1448,19 +1438,9 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
 	 * into this helper allow blocking; it'd be dead, wasteful code.
 	 */
 	__for_each_tdp_mmu_root(kvm, root, range->slot->as_id, types) {
-		gfn_t start, end;
-
-		/*
-		 * For TDX shared mapping, set GFN shared bit to the range,
-		 * so the handler() doesn't need to set it, to avoid duplicated
-		 * code in multiple handler()s.
-		 */
-		start = kvm_gfn_for_root(kvm, root, range->start);
-		end = kvm_gfn_for_root(kvm, root, range->end);
-
 		rcu_read_lock();
 
-		tdp_root_for_each_leaf_pte(iter, root, start, end)
+		tdp_root_for_each_leaf_pte(iter, kvm, root, range->start, range->end)
 			ret |= handler(kvm, &iter, range);
 
 		rcu_read_unlock();
@@ -1543,7 +1523,7 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
 
-	for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
+	for_each_tdp_pte_min_level(iter, kvm, root, min_level, start, end) {
 retry:
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
 			continue;
@@ -1706,7 +1686,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 	 * level above the target level (e.g. splitting a 1GB to 512 2MB pages,
 	 * and then splitting each of those to 512 4KB pages).
 	 */
-	for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
+	for_each_tdp_pte_min_level(iter, kvm, root, target_level + 1, start, end) {
 retry:
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
 			continue;
@@ -1791,7 +1771,7 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	rcu_read_lock();
 
-	tdp_root_for_each_pte(iter, root, start, end) {
+	tdp_root_for_each_pte(iter, kvm, root, start, end) {
 retry:
 		if (!is_shadow_present_pte(iter.old_spte) ||
 		    !is_last_spte(iter.old_spte, iter.level))
@@ -1846,7 +1826,7 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	rcu_read_lock();
 
-	tdp_root_for_each_leaf_pte(iter, root, gfn + __ffs(mask),
+	tdp_root_for_each_leaf_pte(iter, kvm, root, gfn + __ffs(mask),
 				    gfn + BITS_PER_LONG) {
 		if (!mask)
 			break;
@@ -1903,7 +1883,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 
 	rcu_read_lock();
 
-	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
+	for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_2M, start, end) {
 retry:
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
 			continue;
@@ -1973,7 +1953,7 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	rcu_read_lock();
 
-	for_each_tdp_pte_min_level(iter, root, min_level, gfn, gfn + 1) {
+	for_each_tdp_pte_min_level(iter, kvm, root, min_level, gfn, gfn + 1) {
 		if (!is_shadow_present_pte(iter.old_spte) ||
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
-- 
2.43.2


-- 
Isaku Yamahata <isaku.yamahata@...el.com>