[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aXgzlo1BsTjUIVzc@yzhao56-desk.sh.intel.com>
Date: Tue, 27 Jan 2026 11:40:06 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
CC: <pbonzini@...hat.com>, <linux-kernel@...r.kernel.org>,
<kvm@...r.kernel.org>, <x86@...nel.org>, <rick.p.edgecombe@...el.com>,
<dave.hansen@...el.com>, <kas@...nel.org>, <tabba@...gle.com>,
<ackerleytng@...gle.com>, <michael.roth@....com>, <david@...nel.org>,
<vannapurve@...gle.com>, <sagis@...gle.com>, <vbabka@...e.cz>,
<thomas.lendacky@....com>, <nik.borisov@...e.com>, <pgonda@...gle.com>,
<fan.du@...el.com>, <jun.miao@...el.com>, <francescolavra.fl@...il.com>,
<jgross@...e.com>, <ira.weiny@...el.com>, <isaku.yamahata@...el.com>,
<xiaoyao.li@...el.com>, <kai.huang@...el.com>, <binbin.wu@...ux.intel.com>,
<chao.p.peng@...el.com>, <chao.gao@...el.com>
Subject: Re: [PATCH v3 06/24] KVM: x86/mmu: Disallow page merging (huge page
adjustment) for mirror root
On Mon, Jan 26, 2026 at 08:08:31AM -0800, Sean Christopherson wrote:
> On Fri, Jan 16, 2026, Yan Zhao wrote:
> > Hi Sean,
> > Thanks for the review!
> >
> > On Thu, Jan 15, 2026 at 02:49:59PM -0800, Sean Christopherson wrote:
> > > On Tue, Jan 06, 2026, Yan Zhao wrote:
> > > > From: Rick P Edgecombe <rick.p.edgecombe@...el.com>
> > > >
> > > > Disallow page merging (huge page adjustment) for the mirror root by
> > > > utilizing disallowed_hugepage_adjust().
> > >
> > > Why? What is this actually doing? The below explains "how" but I'm baffled as
> > > to the purpose. I'm guessing there are hints in the surrounding patches, but I
> > > haven't read them in depth, and shouldn't need to in order to understand the
> > > primary reason behind a change.
> > Sorry for missing the background. I will explain the "why" in the patch log in
> > the next version.
> >
> > The reason for introducing this patch is to disallow page merging for TDX. I
> > explained the reasons to disallow page merging in the cover letter:
> >
> > "
> > 7. Page merging (page promotion)
> >
> > Promotion is disallowed, because:
> >
> > - The current TDX module requires all 4KB leafs to be either all PENDING
> > or all ACCEPTED before a successful promotion to 2MB. This requirement
> > prevents successful page merging after partially converting a 2MB
> > range from private to shared and then back to private, which is the
> > primary scenario necessitating page promotion.
> >
> > - tdh_mem_page_promote() depends on tdh_mem_range_block() in the current
> > TDX module. Consequently, handling BUSY errors is complex, as page
> > merging typically occurs in the fault path under shared mmu_lock.
> >
> > - Limited amount of initial private memory (typically ~4MB) means the
> > need for page merging during TD build time is minimal.
> > "
>
> > However, we currently don't support page merging yet. Specifically for the above
> > scenariol, the purpose is to avoid handling the error from
> > tdh_mem_page_promote(), which SEAMCALL currently needs to be preceded by
> > tdh_mem_range_block(). To handle the promotion error (e.g., due to busy) under
> > read mmu_lock, we may need to introduce several spinlocks and guarantees from
> > the guest to ensure the success of tdh_mem_range_unblock() to restore the S-EPT
> > status.
> >
> > Therefore, we introduced this patch for simplicity, and because the promotion
> > scenario is not common.
>
> Say that in the changelog! Describing the "how" in detail is completely unnecessary,
I'll keep it in mind in the future!
> or at least it should be. Because I strongly disagree with Rick's opinion from
> the RFC that kvm_tdp_mmu_map() should check kvm_has_mirrored_tdp()[*].
>
> : I think part of the thing that is bugging me is that
> : nx_huge_page_workaround_enabled is not conceptually about whether the specific
> : fault/level needs to disallow huge page adjustments, it's whether it needs to
> : check if it does. Then disallowed_hugepage_adjust() does the actual specific
> : checking. But for the mirror logic the check is the same for both. It's
> : asymmetric with NX huge pages, and just sort of jammed in. It would be easier to
> : follow if the kvm_tdp_mmu_map() conditional checked wither mirror TDP was
> : "active", rather than the mirror role.
>
> [*] http://lore.kernel.org/all/eea0bf7925c3b9c16573be8e144ddcc77b54cc92.camel@intel.com
>
> If the changelog explains _why_, and the code is actually commented, then calling
> into disallowed_hugepage_adjust() for all faults in a VM with mirrored roots is
> nonsensical, because the code won't match the comment.
Thanks a lot! It looks good to me.
> From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
> Date: Tue, 22 Apr 2025 10:21:12 +0800
> Subject: [PATCH] KVM: x86/mmu: Prevent hugepage promotion for mirror roots in
> fault path
>
> Disallow hugepage promotion in the TDP MMU for mirror roots as KVM doesn't
> currently support promoting S-EPT entries due to the complexity incurred
> by the TDX-Module's rules for hugepage promotion.
>
> - The current TDX-Module requires all 4KB leafs to be either all PENDING
> or all ACCEPTED before a successful promotion to 2MB. This requirement
> prevents successful page merging after partially converting a 2MB
> range from private to shared and then back to private, which is the
> primary scenario necessitating page promotion.
>
> - The TDX-Module effectively requires a break-before-make sequence (to
> satisfy its TLB flushing rules), i.e. creates a window of time where a
> different vCPU can encounter faults on a SPTE that KVM is trying to
> promote to a hugepage. To avoid unexpected BUSY errors, KVM would need
> to FREEZE the non-leaf SPTE before replacing it with a huge SPTE.
>
> Disable hugepage promotion for all map() operations, as supporting page
> promotion when building the initial image is still non-trivial, and the
> vast majority of images are ~4MB or less, i.e. the benefit of creating
> hugepages during TD build time is minimal.
>
> Signed-off-by: Edgecombe, Rick P <rick.p.edgecombe@...el.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@...el.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@...el.com>
> [sean: check root, add comment, rewrite changelog]
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> ---
> arch/x86/kvm/mmu/mmu.c | 3 ++-
> arch/x86/kvm/mmu/tdp_mmu.c | 12 +++++++++++-
> 2 files changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4ecbf216d96f..45650f70eeab 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3419,7 +3419,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
> cur_level == fault->goal_level &&
> is_shadow_present_pte(spte) &&
> !is_large_pte(spte) &&
> - spte_to_child_sp(spte)->nx_huge_page_disallowed) {
> + ((spte_to_child_sp(spte)->nx_huge_page_disallowed) ||
> + is_mirror_sp(spte_to_child_sp(spte)))) {
> /*
> * A small SPTE exists for this pfn, but FNAME(fetch),
> * direct_map(), or kvm_tdp_mmu_map() would like to create a
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 321dbde77d3f..0fe3be41594f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1232,7 +1232,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
> int r;
>
> - if (fault->nx_huge_page_workaround_enabled)
> + /*
> + * Don't replace a page table (non-leaf) SPTE with a huge SPTE
> + * (a.k.a. hugepage promotion) if the NX hugepage workaround is
> + * enabled, as doing so will cause significant thrashing if one
> + * or more leaf SPTEs needs to be executable.
> + *
> + * Disallow hugepage promotion for mirror roots as KVM doesn't
> + * (yet) support promoting S-EPT entries while holding mmu_lock
> + * for read (due to complexity induced by the TDX-Module APIs).
> + */
> + if (fault->nx_huge_page_workaround_enabled || is_mirror_sp(root))
A small nit:
Here, we check is_mirror_sp(root).
However, not far from here, in kvm_tdp_mmu_map(), we have another check of
is_mirror_sp(), which should get the same result since sp->role.is_mirror is
inherited from its parent.
if (is_mirror_sp(sp))
kvm_mmu_alloc_external_spt(vcpu, sp);
So, do you think we can save the is_mirror status in a local variable?
Like this:
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b524b44733b8..c54befec3042 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1300,6 +1300,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct kvm_mmu_page *root = tdp_mmu_get_root_for_fault(vcpu, fault);
+ bool is_mirror = root && is_mirror_sp(root);
struct kvm *kvm = vcpu->kvm;
struct tdp_iter iter;
struct kvm_mmu_page *sp;
@@ -1316,7 +1317,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
for_each_tdp_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
int r;
- if (fault->nx_huge_page_workaround_enabled)
+ /*
+ * Don't replace a page table (non-leaf) SPTE with a huge SPTE
+ * (a.k.a. hugepage promotion) if the NX hugepage workaround is
+ * enabled, as doing so will cause significant thrashing if one
+ * or more leaf SPTEs needs to be executable.
+ *
+ * Disallow hugepage promotion for mirror roots as KVM doesn't
+ * (yet) support promoting S-EPT entries while holding mmu_lock
+ * for read (due to complexity induced by the TDX-Module APIs).
+ */
+ if (fault->nx_huge_page_workaround_enabled || is_mirror)
disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
/*
@@ -1340,7 +1351,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
*/
sp = tdp_mmu_alloc_sp(vcpu);
tdp_mmu_init_child_sp(sp, &iter);
- if (is_mirror_sp(sp))
+ if (is_mirror)
kvm_mmu_alloc_external_spt(vcpu, sp);
Powered by blists - more mailing lists