linux-kernel - Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aF0Kg8FcHVMvsqSo@yzhao56-desk.sh.intel.com>
Date: Thu, 26 Jun 2025 16:53:23 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
CC: "Du, Fan" <fan.du@...el.com>, "Li, Xiaoyao" <xiaoyao.li@...el.com>,
	"Huang, Kai" <kai.huang@...el.com>, "Shutemov, Kirill"
	<kirill.shutemov@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "vbabka@...e.cz" <vbabka@...e.cz>, "Li, Zhiquan1"
	<zhiquan1.li@...el.com>, "quic_eberman@...cinc.com"
	<quic_eberman@...cinc.com>, "michael.roth@....com" <michael.roth@....com>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "Weiny, Ira" <ira.weiny@...el.com>,
	"pbonzini@...hat.com" <pbonzini@...hat.com>, "Peng, Chao P"
	<chao.p.peng@...el.com>, "Yamahata, Isaku" <isaku.yamahata@...el.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"binbin.wu@...ux.intel.com" <binbin.wu@...ux.intel.com>,
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "kvm@...r.kernel.org"
	<kvm@...r.kernel.org>, "Annapurve, Vishal" <vannapurve@...gle.com>,
	"tabba@...gle.com" <tabba@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>,
	"Miao, Jun" <jun.miao@...el.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is
 RUNNABLE

On Wed, Jun 25, 2025 at 10:47:47PM +0800, Edgecombe, Rick P wrote:
> On Wed, 2025-06-25 at 17:28 +0800, Yan Zhao wrote:
> > On Wed, Jun 25, 2025 at 02:35:59AM +0800, Edgecombe, Rick P wrote:
> > > On Tue, 2025-06-24 at 17:57 +0800, Yan Zhao wrote:
> > > > Could we provide the info via the private_max_mapping_level hook (i.e. via
> > > > tdx_gmem_private_max_mapping_level())?
> > > 
> > > This is one of the previous two methods discussed. Can you elaborate on what you
> > > are trying to say?
> > I don't get why we can't use the existing tdx_gmem_private_max_mapping_level()
> > to convey the max_level info at which a vendor hopes a GFN to be mapped.
> > 
> > Before TDX huge pages, tdx_gmem_private_max_mapping_level() always returns 4KB;
> > after TDX huge pages, it returns
> > - 4KB during the TD build stage
> > - at TD runtime: 4KB or 2MB
> > 
> > Why does KVM need to care how the vendor determines this max_level?
> > I think a vendor should have its freedom to decide based on software limitation,
> > guest's wishes, hardware bugs or whatever.
> 
> I don't see that big of a difference between "KVM" and "vendor". TDX code is KVM
> code. Just because it's in tdx.c doesn't mean it's ok for it to be hard to trace
> the logic.
> 
> I'm not sure what Sean's objection was to that approach, or if he objected to
> just the weird SIZE_MISMATCH behavior of TDX module. I think you already know
> why I don't prefer it:
>  - Requiring demote in the fault handler. This requires an additional write lock
> inside the mmu read lock, or TDX module changes. Although now I wonder if the
> interrupt error code related problems will get worse with this solution. The
> solution is currently not settled.
>  - Requiring passing args on the vCPU struct, which as you point out will work
> functionally today only because the prefault stuff will avoid seeing it. But
> it's fragile
>  - The page size behavior is a bit implicit
Hmm, strictly speaking, all the 3 are not the fault of
tdx_gmem_private_max_mapping_level().

With tdx_gmem_private_max_mapping_level() to pass in the level, we can track
KVM_LPAGE_GUEST_INHIBIT with tdx.c without changing lpage_info.
tdx.c then has the freedom to change KVM_LPAGE_GUEST_INHIBIT to some more
flexible scheme in future while keeping KVM MMU core intact.

But with lpage_info, looks we can save some memory.
The downside is that we may need to update TDX MMU core in case of future
changes.

> I'm coming back to this draft after PUCK. Sean shared his thoughts there. I'll
> try to summarize. He didn't like how the page size requirements were passed
> through the fault handler in a "transient" way. That "transient" property covers
> both of the two options for passing the size info through the fault handler that
> we were debating. He also didn't like how TDX arch requires KVM to map at a
> specific host size around accept. Michael Roth brought up that SNP has the same
> requirement, but it can do the zap and refault approach.
> 
> We then discussed this lpage_info idea. He was in favor of it, but not, I'd say,
> overly enthusiastic. In a "least worst option" kind of way.
> 
> I think the biggest downside is the MMU write lock. Our goal for this series is
> to help performance, not to get huge page sizes. So if we do this idea, we can't
> fully waive our hands that any optimization is pre-mature. It *is* an
> optimization. We need to either convince ourselves that the overall benefit is
> still there, or have a plan to adopt the guest to avoid 4k accepts. Which we
> were previously discussing of requiring anyway.
> 
> But I much prefer the deterministic behavior of this approach from a
> maintainability perspective.
> 
> > 
> > > > Or what about introducing a vendor hook in __kvm_mmu_max_mapping_level() for a
> > > > private fault?
> > > > 
> > > > > Maybe we could have EPT violations that contain 4k accept sizes first update the
> > > > > attribute for the GFN to be accepted or not, like have tdx.c call out to set
> > > > > kvm_lpage_info->disallow_lpage in the rarer case of 4k accept size? Or something
> > > > Something like kvm_lpage_info->disallow_lpage would disallow later page
> > > > promotion, though we don't support it right now.
> > > 
> > > Well I was originally thinking it would not set kvm_lpage_info->disallow_lpage
> > > directly, but rely on the logic that checks for mixed attributes. But more
> > > below...
> > > 
> > > > 
> > > > > like that. Maybe set a "accepted" attribute, or something. Not sure if could be
> > > > Setting "accepted" attribute in the EPT violation handler?
> > > > It's a little odd, as the accept operation is not yet completed.
> > > 
> > > I guess the question in both of these comments is: what is the life cycle. Guest
> > > could call TDG.MEM.PAGE.RELEASE to unaccept it as well. Oh, geez. It looks like
> > > TDG.MEM.PAGE.RELEASE will give the same size hints in the EPT violation. So an
> > > accept attribute is not going work, at least without TDX module changes.
> > > 
> > > 
> > > Actually, the problem we have doesn't fit the mixed attributes behavior. If many
> > > vCPU's accept at 2MB region at 4k page size, the entire 2MB range could be non-
> > > mixed and then individual accepts would fail.
> > > 
> > > 
> > > So instead there could be a KVM_LPAGE_GUEST_INHIBIT that doesn't get cleared
> > Set KVM_LPAGE_GUEST_INHIBIT via a TDVMCALL ?
> > 
> > Or just set the KVM_LPAGE_GUEST_INHIBIT when an EPT violation contains 4KB
> > level info?
> 
> Yes, that's the idea. 2MB accepts can behave like normal.
> 
> > 
> > I guess it's the latter one as it can avoid modification to both EDK2 and Linux
> > guest.  I observed ~2710 instances of "guest accepts at 4KB when KVM can map at
> > 2MB" during the boot-up of a TD with 4GB memory.
> 
> Oh, wow that is more than I expected. Did you notice how many vCPUs they were
> spread across? What memory size did you use? What was your guest memory
> configuration?
The guest memory is 4GB, 8 vCPUs.
The memory slots layout is:
slot 1: base gfn=0, npages=0x80000
slot 2: base gfn=0x100000, npages=0x80000
slot 3: base gfn=0xffc00, npages=0x400

The GFN spread for the ~2710 instances is:
GFNs 0x806-0x9ff (1 time for each of 506 pages)
GFNs 0x7e800-0x7e9ff (1 time for each of 512 pages)
GFN: 0x7d3ff~0x7e7fe (repeated private-to-shared, and shared-to-private are
                      conducted on this range), with the top 3 among them being:
     0x7d9da (476 times)
     0x7d9d9 (156 times)
     0x7d9d7 (974 times)

All those instances are from vCPU 0, when the guest is in EDK2 and during early
kernel boot.

Based on my observation, the count of these instances does not scale with guest
memory. In other words, the count remains roughly the same even when the guest
memory is increased to 8GB.

> > But does it mean TDX needs to hold write mmu_lock in the EPT violation handler
> > and set KVM_LPAGE_GUEST_INHIBIT on finding a violation carries 4KB level info?
> 
> I think so. I didn't check the reason, but the other similar code took it. Maybe
> not? If we don't need to take mmu write lock, then this idea seems like a clear
> winner to me.
Hmm,  setting KVM_LPAGE_GUEST_INHIBIT needs trying splitting to be followed.
So, if we don't want to support splitting under read mmu_lock, we need to take
write mmu_lock.

I drafted a change as below (will refine some parts of it later).
The average count to take write mmu_lock is 11 during VM boot.

There's no signiticant difference in the count of 2M mappings
During guest kerne booting to login, on average: 
before this patch: 1144 2M mappings 
after this patch:  1143 2M mappings.

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index f999c15d8d3e..d4e98728f600 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -322,4 +322,8 @@ static inline bool kvm_is_gfn_alias(struct kvm *kvm, gfn_t gfn)
 {
        return gfn & kvm_gfn_direct_bits(kvm);
 }
+
+void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
+bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
+
 #endif
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f0afee2e283a..28c511d8b372 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -721,6 +721,8 @@ static struct kvm_lpage_info *lpage_info_slot(gfn_t gfn,
  */
 #define KVM_LPAGE_MIXED_FLAG   BIT(31)

+#define KVM_LPAGE_GUEST_INHIBIT_FLAG   BIT(30)
+
 static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
                                            gfn_t gfn, int count)
 {
@@ -732,7 +734,8 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,

                old = linfo->disallow_lpage;
                linfo->disallow_lpage += count;
-               WARN_ON_ONCE((old ^ linfo->disallow_lpage) & KVM_LPAGE_MIXED_FLAG);
+               WARN_ON_ONCE((old ^ linfo->disallow_lpage) &
+                             (KVM_LPAGE_MIXED_FLAG | KVM_LPAGE_GUEST_INHIBIT_FLAG));
        }
 }

@@ -1653,13 +1656,15 @@ int kvm_split_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range)
        bool ret = 0;

        lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
-                           lockdep_is_held(&kvm->slots_lock));
+                           lockdep_is_held(&kvm->slots_lock) ||
+                           srcu_read_lock_held(&kvm->srcu));

        if (tdp_mmu_enabled)
                ret = kvm_tdp_mmu_gfn_range_split_boundary(kvm, range);

        return ret;
 }
+EXPORT_SYMBOL_GPL(kvm_split_boundary_leafs);

 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
@@ -7734,6 +7739,18 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
                vhost_task_stop(kvm->arch.nx_huge_page_recovery_thread);
 }

+bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
+{
+       return lpage_info_slot(gfn, slot, level)->disallow_lpage & KVM_LPAGE_GUEST_INHIBIT_FLAG;
+}
+EXPORT_SYMBOL_GPL(hugepage_test_guest_inhibit);
+
+void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
+{
+       lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_GUEST_INHIBIT_FLAG;
+}
+EXPORT_SYMBOL_GPL(hugepage_set_guest_inhibit);
+
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
                                int level)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 244fd22683db..4028423cf595 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1852,28 +1852,8 @@ int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
        if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE || level != PG_LEVEL_2M, kvm))
                return -EINVAL;

-       /*
-        * Split request with mmu_lock held for reading can only occur when one
-        * vCPU accepts at 2MB level while another vCPU accepts at 4KB level.
-        * Ignore this 4KB mapping request by setting violation_request_level to
-        * 2MB and returning -EBUSY for retry. Then the next fault at 2MB level
-        * would be a spurious fault. The vCPU accepting at 2MB will accept the
-        * whole 2MB range.
-        */
-       if (mmu_lock_shared) {
-               struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-               struct vcpu_tdx *tdx = to_tdx(vcpu);
-
-               if (KVM_BUG_ON(!vcpu, kvm))
-                       return -EOPNOTSUPP;
-
-               /* Request to map as 2MB leaf for the whole 2MB range */
-               tdx->violation_gfn_start = gfn_round_for_level(gfn, level);
-               tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
-               tdx->violation_request_level = level;
-
-               return -EBUSY;
-       }
+       if (mmu_lock_shared)
+               return -EOPNOTSUPP;

        ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
        if (ret <= 0)
@@ -1937,28 +1917,51 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
        return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
 }

-static inline void tdx_get_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
+static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
 {
        struct vcpu_tdx *tdx = to_tdx(vcpu);
+       struct kvm *kvm = vcpu->kvm;
+       gfn_t gfn = gpa_to_gfn(gpa);
+       struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
        int level = -1;
+       u64 eeq_info;

-       u64 eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
+       if (!slot)
+               return 0;

-       u32 eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
-                       TDX_EXT_EXIT_QUAL_INFO_SHIFT;
+       if ((tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK) !=
+           TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
+               return 0;

-       if (eeq_type == TDX_EXT_EXIT_QUAL_TYPE_ACCEPT) {
-               level = (eeq_info & GENMASK(2, 0)) + 1;
+       eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
+                   TDX_EXT_EXIT_QUAL_INFO_SHIFT;

-               tdx->violation_gfn_start = gfn_round_for_level(gpa_to_gfn(gpa), level);
-               tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
-               tdx->violation_request_level = level;
-       } else {
-               tdx->violation_gfn_start = -1;
-               tdx->violation_gfn_end = -1;
-               tdx->violation_request_level = -1;
+       level = (eeq_info & GENMASK(2, 0)) + 1;
+
+       if (level == PG_LEVEL_4K) {
+              if (!hugepage_test_guest_inhibit(slot, gfn, PG_LEVEL_2M)) {
+                       struct kvm_gfn_range gfn_range = {
+                               .start = gfn,
+                               .end = gfn + 1,
+                               .slot = slot,
+                               .may_block = true,
+                               .attr_filter = KVM_FILTER_PRIVATE,
+                       };
+
+                       scoped_guard(write_lock, &kvm->mmu_lock) {
+                               int ret;
+
+                               ret = kvm_split_boundary_leafs(kvm, &gfn_range);
+
+                               if (ret)
+                                       return ret;
+
+                               hugepage_set_guest_inhibit(slot, gfn, PG_LEVEL_2M);
+                       }
+              }
        }
+
+       return 0;
 }

 static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
@@ -1987,7 +1990,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
                 */
                exit_qual = EPT_VIOLATION_ACC_WRITE;

-               tdx_get_accept_level(vcpu, gpa);
+               if (tdx_check_accept_level(vcpu, gpa))
+                       return RET_PF_RETRY;

                /* Only private GPA triggers zero-step mitigation */
                local_retry = true;
@@ -3022,9 +3026,6 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)

        vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;

-       tdx->violation_gfn_start = -1;
-       tdx->violation_gfn_end = -1;
-       tdx->violation_request_level = -1;
        return 0;

 free_tdcx:
@@ -3373,14 +3374,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
                                       gfn_t gfn, bool prefetch)
 {
-       struct vcpu_tdx *tdx = to_tdx(vcpu);
-
-       if (unlikely((to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE) || prefetch))
+       if (unlikely((to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE)))
                return PG_LEVEL_4K;

-       if (gfn >= tdx->violation_gfn_start && gfn < tdx->violation_gfn_end)
-               return tdx->violation_request_level;
-
        return PG_LEVEL_2M;
 }

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index acd18a01f63d..3a3077666ee6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2610,6 +2610,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn

        return NULL;
 }
+EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);

 bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
 {