linux-kernel - Re: [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aEnbDya7OOXdO85q@google.com>
Date: Wed, 11 Jun 2025 12:37:51 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Paolo Bonzini <pbonzini@...hat.com>
Cc: Xiaoyao Li <xiaoyao.li@...el.com>, rick.p.edgecombe@...el.com, kvm@...r.kernel.org, 
	linux-kernel@...r.kernel.org, yan.y.zhao@...el.com, reinette.chatre@...el.com, 
	kai.huang@...el.com, adrian.hunter@...el.com, isaku.yamahata@...el.com, 
	Binbin Wu <binbin.wu@...ux.intel.com>, tony.lindgren@...ux.intel.com
Subject: Re: [PATCH] KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY

On Wed, Jun 11, 2025, Paolo Bonzini wrote:
> On Wed, Jun 11, 2025 at 8:10 PM Sean Christopherson <seanjc@...gle.com> wrote:
> > > +     direct_bits = 0;
> > >       if (kvm_arch_has_private_mem(vcpu->kvm) &&
> > >           kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(range->gpa)))
> > >               error_code |= PFERR_PRIVATE_ACCESS;
> > > +     else
> > > +             direct_bits = gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm));
> > 
> > Eww.  It's bad enough that TDX bleeds it's mirror needs into common MMU code,
> > but stuffing vendor specific GPA bits in common code goes too far.  Actually,
> > all of this goes too far.  There's zero reason any code outside of TDX needs to
> > *explicitly* care whether mirrors or "direct" MMUs have mandatory gfn bits.
> > 
> > Back to the main topic, KVM needs to have a single source of truth when it comes
> > to whether a fault is private and thus mirrored (or not).  Common KVM needs to be
> > aware of aliased GFN bits, but absolute nothing outside of TDX (including common
> > VMX code) should be aware the mirror vs. "direct" (I hate that terminology; KVM
> > has far, far too much history and baggage with "direct") is tied to the existence
> > and polarity of aliased GFN bits.
> > 
> > To detect a mirror fault:
> > 
> >   static inline bool kvm_is_mirror_fault(struct kvm *kvm, u64 error_code)
> >   {
> >         return kvm_has_mirrored_tdp(kvm) &&
> >                error_code & PFERR_PRIVATE_ACCESS;
> >   }
> > 
> > And for TDX, it should darn well explicitly track the shared GPA mask:
> > 
> >   static bool tdx_is_private_gpa(struct kvm *kvm, gpa_t gpa)
> >   {
> >         /* For TDX the direct mask is the shared mask. */
> >         return !(gpa & to_kvm_tdx(kvm)->shared_gpa_mask);
> >   }
> 
> My fault - this is more similar, at least in spirit, to what
> Yan and Xiaoyao had tested earlier:
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 52acf99d40a0..209103bf0f30 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -48,7 +48,7 @@ static inline enum kvm_tdp_mmu_root_types
>  static inline struct kvm_mmu_page *tdp_mmu_get_root_for_fault(struct kvm_vcpu *vcpu,
>  							      struct kvm_page_fault *fault)
>  {
> -	if (unlikely(!kvm_is_addr_direct(vcpu->kvm, fault->addr)))
> +	if (unlikely(fault->is_private))
>  		return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);
> 
> and I instead proposed the version that you hate with such ardor.
> 
> My reasoning was that I preferred to have the pre-fault scenario "look like"
> what you get while the VM runs.

Yes, 100% agreed.  I forgot fault->addr has the unmodified GPA, whereas fault->gfn
has the unaliased GFN. :-/

> > Outside of TDX, detecting mirrors, and anti-aliasing logic, the only use of
> > kvm_gfn_direct_bits() is to constrain TDP MMU walks to the appropriate gfn range.
> > And for that, we can simply use kvm_mmu_page.gfn, with a kvm_x86_ops hook to get
> > the TDP MMU root GFN (root allocation is a slow path, the CALL+RET is a non-issue).
> > 
> > Compile tested only, and obviously needs to be split into multiple patches.
> 
> Also obviously needs to be delayed to 6.17, since a working fix can be a
> one line change. :)

Ya, definitely.

> (Plus your kvm_is_gfn_alias() test which should be
> included anyway and independently).
> 
> What do you hate less between Yan's idea above and this patch? Just tell me
> and I'll handle posting v2.

As much as it pains me, this version :-(

There are other things that rely on the GPA being "correct", e.g. walking the
SPTEs in fast_page_fault().  So for a 6.16 fix, this is the safer and more complete
option.

Ugh, and the whole tdp_mmu_get_root_for_fault() handling is broken.
is_page_fault_stale() only looks at mmu->root.hpa, i.e. could theoretically blow
up if the shared root is somehow valid but the mirror root is not.  Probably can't
happen in practice, but it's ugly.

Oof, and I've no idea what kvm_tdp_mmu_fast_pf_get_last_sptep() is doing.  It
says:

	/* Fast pf is not supported for mirrored roots  */

but I don't see anything that actually enforces that.

So tdp_mmu_get_root_for_fault() should be a generic kvm_mmu_get_root_for_fault(),
and tdp_mmu_get_root() simply shouldn't exist.

As for stuffing the correct GPA, with kvm_mmu_get_root_for_fault() being generic
and the root holding its gfn modifier, kvm_tdp_map_page() can simply OR in the
appropriate gfn (and maybe WARN if there's overlap?).

Something like so (on top of the other untested blob):

-----
 arch/x86/kvm/mmu/mmu.c     |  8 ++++++--
 arch/x86/kvm/mmu/spte.h    | 15 +++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++-----
 arch/x86/kvm/mmu/tdp_mmu.h | 21 ++-------------------
 4 files changed, 28 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0228d49ac363..3bcc8d4848bd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3589,7 +3589,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		u64 new_spte;
 
 		if (tdp_mmu_enabled)
-			sptep = kvm_tdp_mmu_fast_pf_get_last_sptep(vcpu, fault->gfn, &spte);
+			sptep = kvm_tdp_mmu_fast_pf_get_last_sptep(vcpu, fault, &spte);
 		else
 			sptep = fast_pf_get_last_sptep(vcpu, fault->addr, &spte);
 
@@ -4682,7 +4682,7 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
 static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
 				struct kvm_page_fault *fault)
 {
-	struct kvm_mmu_page *sp = root_to_sp(vcpu->arch.mmu->root.hpa);
+	struct kvm_mmu_page *sp = kvm_mmu_get_root_for_fault(vcpu, fault);
 
 	/* Special roots, e.g. pae_root, are not backed by shadow pages. */
 	if (sp && is_obsolete_sp(vcpu->kvm, sp))
@@ -4849,6 +4849,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
 {
+	struct kvm_mmu_page *root = __kvm_mmu_get_root_for_fault(vcpu, error_code);
 	int r;
 
 	/*
@@ -4858,6 +4859,9 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
 	if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
 		return -EOPNOTSUPP;
 
+	/* Comment goes here. */
+	gpa |= gfn_to_gpa(root->gfn);
+
 	do {
 		if (signal_pending(current))
 			return -EINTR;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 1e94f081bdaf..68e7979ac1fe 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -280,6 +280,21 @@ static inline bool is_mirror_sptep(tdp_ptep_t sptep)
 	return is_mirror_sp(sptep_to_sp(rcu_dereference(sptep)));
 }
 
+static inline struct kvm_mmu_page *__kvm_mmu_get_root_for_fault(struct kvm_vcpu *vcpu,
+								u64 error_code)
+{
+	if (unlikely(kvm_is_mirror_fault(vcpu->kvm, error_code)))
+		return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);
+
+	return root_to_sp(vcpu->arch.mmu->root.hpa);
+}
+
+static inline struct kvm_mmu_page *kvm_mmu_get_root_for_fault(struct kvm_vcpu *vcpu,
+							      struct kvm_page_fault *fault)
+{
+	return __kvm_mmu_get_root_for_fault(vcpu, fault->error_code);
+}
+
 static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
 {
 	return (spte & shadow_mmio_mask) == kvm->arch.shadow_mmio_value &&
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 15daf4353ccc..ecfffc6fbb73 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1240,7 +1240,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
  */
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
-	struct kvm_mmu_page *root = tdp_mmu_get_root_for_fault(vcpu, fault);
+	struct kvm_mmu_page *root = kvm_mmu_get_root_for_fault(vcpu, fault);
 	struct kvm *kvm = vcpu->kvm;
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
@@ -1967,15 +1967,15 @@ EXPORT_SYMBOL_GPL(kvm_tdp_mmu_gpa_is_mapped);
  *
  * WARNING: This function is only intended to be called during fast_page_fault.
  */
-u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gfn_t gfn,
+u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu,
+					struct kvm_page_fault *fault,
 					u64 *spte)
 {
-	/* Fast pf is not supported for mirrored roots  */
-	struct kvm_mmu_page *root = tdp_mmu_get_root(vcpu, KVM_DIRECT_ROOTS);
+	struct kvm_mmu_page *root = kvm_mmu_get_root_for_fault(vcpu, fault);
 	struct tdp_iter iter;
 	tdp_ptep_t sptep = NULL;
 
-	for_each_tdp_pte(iter, vcpu->kvm, root, gfn, gfn + 1) {
+	for_each_tdp_pte(iter, vcpu->kvm, root, fault->gfn, fault->gfn + 1) {
 		*spte = iter.old_spte;
 		sptep = iter.sptep;
 	}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 397309dfc73f..f75888474b73 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -45,24 +45,6 @@ static inline enum kvm_tdp_mmu_root_types kvm_gfn_range_filter_to_root_types(str
 	return ret;
 }
 
-static inline struct kvm_mmu_page *tdp_mmu_get_root_for_fault(struct kvm_vcpu *vcpu,
-							      struct kvm_page_fault *fault)
-{
-	if (unlikely(kvm_is_mirror_fault(vcpu->kvm, fault->error_code)))
-		return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);
-
-	return root_to_sp(vcpu->arch.mmu->root.hpa);
-}
-
-static inline struct kvm_mmu_page *tdp_mmu_get_root(struct kvm_vcpu *vcpu,
-						    enum kvm_tdp_mmu_root_types type)
-{
-	if (unlikely(type == KVM_MIRROR_ROOTS))
-		return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);
-
-	return root_to_sp(vcpu->arch.mmu->root.hpa);
-}
-
 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
 bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
@@ -109,7 +91,8 @@ static inline void kvm_tdp_mmu_walk_lockless_end(void)
 
 int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 			 int *root_level);
-u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gfn_t gfn,
+u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu,
+					struct kvm_page_fault *fault,
 					u64 *spte);
 
 #ifdef CONFIG_X86_64

base-commit: 1abe48190d919c44e69aae17beb9e55d83db2303
--