linux-kernel - Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aGNK2tO2W6+GWtt3@yzhao56-desk.sh.intel.com>
Date: Tue, 1 Jul 2025 10:41:30 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
CC: "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao"
	<xiaoyao.li@...el.com>, "Huang, Kai" <kai.huang@...el.com>, "Du, Fan"
	<fan.du@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "vbabka@...e.cz" <vbabka@...e.cz>, "Li, Zhiquan1"
	<zhiquan1.li@...el.com>, "Shutemov, Kirill" <kirill.shutemov@...el.com>,
	"michael.roth@....com" <michael.roth@....com>, "seanjc@...gle.com"
	<seanjc@...gle.com>, "Weiny, Ira" <ira.weiny@...el.com>, "Peng, Chao P"
	<chao.p.peng@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	"Yamahata, Isaku" <isaku.yamahata@...el.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "ackerleytng@...gle.com"
	<ackerleytng@...gle.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"Annapurve, Vishal" <vannapurve@...gle.com>, "tabba@...gle.com"
	<tabba@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun"
	<jun.miao@...el.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is
 RUNNABLE

On Tue, Jul 01, 2025 at 08:42:33AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-06-26 at 16:53 +0800, Yan Zhao wrote:
> > On Wed, Jun 25, 2025 at 10:47:47PM +0800, Edgecombe, Rick P wrote:
> > > On Wed, 2025-06-25 at 17:28 +0800, Yan Zhao wrote:
> > > > On Wed, Jun 25, 2025 at 02:35:59AM +0800, Edgecombe, Rick P wrote:
> > > > > On Tue, 2025-06-24 at 17:57 +0800, Yan Zhao wrote:
> > > > > 
> > > > I guess it's the latter one as it can avoid modification to both EDK2 and Linux
> > > > guest.  I observed ~2710 instances of "guest accepts at 4KB when KVM can map at
> > > > 2MB" during the boot-up of a TD with 4GB memory.
> > > 
> > > Oh, wow that is more than I expected. Did you notice how many vCPUs they were
> > > spread across? What memory size did you use? What was your guest memory
> > > configuration?
> > The guest memory is 4GB, 8 vCPUs.
> > The memory slots layout is:
> > slot 1: base gfn=0, npages=0x80000
> > slot 2: base gfn=0x100000, npages=0x80000
> > slot 3: base gfn=0xffc00, npages=0x400
> > 
> > The GFN spread for the ~2710 instances is:
> > GFNs 0x806-0x9ff (1 time for each of 506 pages)
> > GFNs 0x7e800-0x7e9ff (1 time for each of 512 pages)
> > GFN: 0x7d3ff~0x7e7fe (repeated private-to-shared, and shared-to-private are
> >                       conducted on this range), with the top 3 among them being:
> >      0x7d9da (476 times)
> >      0x7d9d9 (156 times)
> >      0x7d9d7 (974 times)
> > 
> > All those instances are from vCPU 0, when the guest is in EDK2 and during early
> > kernel boot.
> > 
> > Based on my observation, the count of these instances does not scale with guest
> > memory. In other words, the count remains roughly the same even when the guest
> > memory is increased to 8GB.
> 
> So the impact would be negligible. The mmu write lock would not meet much, if
> any, contention.
> 
> > 
> > > > But does it mean TDX needs to hold write mmu_lock in the EPT violation handler
> > > > and set KVM_LPAGE_GUEST_INHIBIT on finding a violation carries 4KB level info?
> > > 
> > > I think so. I didn't check the reason, but the other similar code took it. Maybe
> > > not? If we don't need to take mmu write lock, then this idea seems like a clear
> > > winner to me.
> > Hmm,  setting KVM_LPAGE_GUEST_INHIBIT needs trying splitting to be followed.
> > So, if we don't want to support splitting under read mmu_lock, we need to take
> > write mmu_lock.
> > 
> > I drafted a change as below (will refine some parts of it later).
> > The average count to take write mmu_lock is 11 during VM boot.
> > 
> > There's no signiticant difference in the count of 2M mappings
> > During guest kerne booting to login, on average: 
> > before this patch: 1144 2M mappings 
> > after this patch:  1143 2M mappings.
> 
> Oh, hmm. Well, it's not strong argument against.
> 
> > 
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index f999c15d8d3e..d4e98728f600 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -322,4 +322,8 @@ static inline bool kvm_is_gfn_alias(struct kvm *kvm, gfn_t gfn)
> >  {
> >         return gfn & kvm_gfn_direct_bits(kvm);
> >  }
> > +
> > +void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
> > +bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
> > +
> >  #endif
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index f0afee2e283a..28c511d8b372 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -721,6 +721,8 @@ static struct kvm_lpage_info *lpage_info_slot(gfn_t gfn,
> >   */
> >  #define KVM_LPAGE_MIXED_FLAG   BIT(31)
> > 
> > +#define KVM_LPAGE_GUEST_INHIBIT_FLAG   BIT(30)
> > +
> >  static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
> >                                             gfn_t gfn, int count)
> >  {
> > @@ -732,7 +734,8 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
> > 
> >                 old = linfo->disallow_lpage;
> >                 linfo->disallow_lpage += count;
> > -               WARN_ON_ONCE((old ^ linfo->disallow_lpage) & KVM_LPAGE_MIXED_FLAG);
> > +               WARN_ON_ONCE((old ^ linfo->disallow_lpage) &
> > +                             (KVM_LPAGE_MIXED_FLAG | KVM_LPAGE_GUEST_INHIBIT_FLAG));
> >         }
> >  }
> > 
> > @@ -1653,13 +1656,15 @@ int kvm_split_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range)
> >         bool ret = 0;
> > 
> >         lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
> > -                           lockdep_is_held(&kvm->slots_lock));
> > +                           lockdep_is_held(&kvm->slots_lock) ||
> > +                           srcu_read_lock_held(&kvm->srcu));
> > 
> >         if (tdp_mmu_enabled)
> >                 ret = kvm_tdp_mmu_gfn_range_split_boundary(kvm, range);
> > 
> >         return ret;
> >  }
> > +EXPORT_SYMBOL_GPL(kvm_split_boundary_leafs);
> > 
> >  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> >  {
> > @@ -7734,6 +7739,18 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> >                 vhost_task_stop(kvm->arch.nx_huge_page_recovery_thread);
> >  }
> > 
> > +bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
> > +{
> > +       return lpage_info_slot(gfn, slot, level)->disallow_lpage & KVM_LPAGE_GUEST_INHIBIT_FLAG;
> > +}
> > +EXPORT_SYMBOL_GPL(hugepage_test_guest_inhibit);
> > +
> > +void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
> > +{
> > +       lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_GUEST_INHIBIT_FLAG;
> > +}
> > +EXPORT_SYMBOL_GPL(hugepage_set_guest_inhibit);
> > +
> >  #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> >  static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
> >                                 int level)
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 244fd22683db..4028423cf595 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1852,28 +1852,8 @@ int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> >         if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE || level != PG_LEVEL_2M, kvm))
> >                 return -EINVAL;
> > 
> > -       /*
> > -        * Split request with mmu_lock held for reading can only occur when one
> > -        * vCPU accepts at 2MB level while another vCPU accepts at 4KB level.
> > -        * Ignore this 4KB mapping request by setting violation_request_level to
> > -        * 2MB and returning -EBUSY for retry. Then the next fault at 2MB level
> > -        * would be a spurious fault. The vCPU accepting at 2MB will accept the
> > -        * whole 2MB range.
> > -        */
> > -       if (mmu_lock_shared) {
> > -               struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> > -               struct vcpu_tdx *tdx = to_tdx(vcpu);
> > -
> > -               if (KVM_BUG_ON(!vcpu, kvm))
> > -                       return -EOPNOTSUPP;
> > -
> > -               /* Request to map as 2MB leaf for the whole 2MB range */
> > -               tdx->violation_gfn_start = gfn_round_for_level(gfn, level);
> > -               tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
> > -               tdx->violation_request_level = level;
> > -
> > -               return -EBUSY;
> > -       }
> > +       if (mmu_lock_shared)
> > +               return -EOPNOTSUPP;
> > 
> >         ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
> >         if (ret <= 0)
> > @@ -1937,28 +1917,51 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
> >         return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
> >  }
> > 
> > -static inline void tdx_get_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
> > +static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
> >  {
> >         struct vcpu_tdx *tdx = to_tdx(vcpu);
> > +       struct kvm *kvm = vcpu->kvm;
> > +       gfn_t gfn = gpa_to_gfn(gpa);
> > +       struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> >         int level = -1;
> > +       u64 eeq_info;
> > 
> > -       u64 eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
> > +       if (!slot)
> > +               return 0;
> > 
> > -       u32 eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> > -                       TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> > +       if ((tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK) !=
> > +           TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
> > +               return 0;
> > 
> > -       if (eeq_type == TDX_EXT_EXIT_QUAL_TYPE_ACCEPT) {
> > -               level = (eeq_info & GENMASK(2, 0)) + 1;
> > +       eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> > +                   TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> > 
> > -               tdx->violation_gfn_start = gfn_round_for_level(gpa_to_gfn(gpa), level);
> > -               tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
> > -               tdx->violation_request_level = level;
> > -       } else {
> > -               tdx->violation_gfn_start = -1;
> > -               tdx->violation_gfn_end = -1;
> > -               tdx->violation_request_level = -1;
> > +       level = (eeq_info & GENMASK(2, 0)) + 1;
> > +
> > +       if (level == PG_LEVEL_4K) {
> > +              if (!hugepage_test_guest_inhibit(slot, gfn, PG_LEVEL_2M)) {
> > +                       struct kvm_gfn_range gfn_range = {
> > +                               .start = gfn,
> > +                               .end = gfn + 1,
> > +                               .slot = slot,
> > +                               .may_block = true,
> > +                               .attr_filter = KVM_FILTER_PRIVATE,
> > +                       };
> > +
> > +                       scoped_guard(write_lock, &kvm->mmu_lock) {
> > +                               int ret;
> > +
> > +                               ret = kvm_split_boundary_leafs(kvm, &gfn_range);
> > +
> > +                               if (ret)
> > +                                       return ret;
> > +
> > +                               hugepage_set_guest_inhibit(slot, gfn, PG_LEVEL_2M);
> 
> 
> Can you explain what you found regarding the write lock need?
Here, the write lock protects 2 steps:
(1) update lpage_info.
(2) try splitting if there's any existing 2MB mapping.

The write mmu_lock is needed because lpage_info is read under read mmu_lock in
kvm_tdp_mmu_map().

kvm_tdp_mmu_map
  kvm_mmu_hugepage_adjust
    kvm_lpage_info_max_mapping_level

If we update the lpage_info with read mmu_lock, the other vCPUs may map at a
stale 2MB level even after lpage_info is updated by hugepage_set_guest_inhibit().

Therefore, we must perform splitting under the write mmu_lock to ensure there
are no 2MB mappings after hugepage_set_guest_inhibit().

Otherwise, during later mapping in __vmx_handle_ept_violation(), splitting at
fault path could be triggered as KVM MMU finds the goal level is 4KB while an
existing 2MB mapping is present.


> For most accept
> cases, we could fault in the PTE's on the read lock. And in the future we could

The actual mapping at 4KB level is still with read mmu_lock in
__vmx_handle_ept_violation().

> have a demote that could work under read lock, as we talked. So
> kvm_split_boundary_leafs() often or could be unneeded or work under read lock
> when needed.
Could we leave the "demote under read lock" as a future optimization? 


> What is the problem in hugepage_set_guest_inhibit() that requires the write
> lock?
As above, to avoid the other vCPUs reading stale mapping level and splitting
under read mmu_lock.

As guest_inhibit is set one-way, we could test it using
hugepage_test_guest_inhibit() without holding the lock. The chance to hold write
mmu_lock for hugepage_set_guest_inhibit() is then greatly reduced.
(in my testing, 11 during VM boot).
 
> But in any case, it seems like we have *a* solution here. It doesn't seem like
> there are any big downsides. Should we close it?
I think it's good, as long as Sean doesn't disagree :)


> > +                       }
> > +              }
> >         }
> > +
> > +       return 0;
> >  }
> > 
> >  static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> > @@ -1987,7 +1990,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> >                  */
> >                 exit_qual = EPT_VIOLATION_ACC_WRITE;
> > 
> > -               tdx_get_accept_level(vcpu, gpa);
> > +               if (tdx_check_accept_level(vcpu, gpa))
> > +                       return RET_PF_RETRY;
> > 
> >                 /* Only private GPA triggers zero-step mitigation */
> >                 local_retry = true;
> > @@ -3022,9 +3026,6 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> > 
> >         vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> > 
> > -       tdx->violation_gfn_start = -1;
> > -       tdx->violation_gfn_end = -1;
> > -       tdx->violation_request_level = -1;
> >         return 0;
> > 
> >  free_tdcx:
> > @@ -3373,14 +3374,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
> >  int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
> >                                        gfn_t gfn, bool prefetch)
> >  {
> > -       struct vcpu_tdx *tdx = to_tdx(vcpu);
> > -
> > -       if (unlikely((to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE) || prefetch))
> > +       if (unlikely((to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE)))
> >                 return PG_LEVEL_4K;
> > 
> > -       if (gfn >= tdx->violation_gfn_start && gfn < tdx->violation_gfn_end)
> > -               return tdx->violation_request_level;
> > -
> >         return PG_LEVEL_2M;
> >  }
> > 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index acd18a01f63d..3a3077666ee6 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2610,6 +2610,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn
> > 
> >         return NULL;
> >  }
> > +EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);
> > 
> >  bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
> >  {
>