linux-kernel - Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is RUNNABLE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dab82e2c91c8ad019cda835ef8d528a7101509fa.camel@intel.com>
Date: Tue, 1 Jul 2025 00:42:33 +0000
From: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To: "Zhao, Yan Y" <yan.y.zhao@...el.com>
CC: "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao"
	<xiaoyao.li@...el.com>, "Huang, Kai" <kai.huang@...el.com>, "Du, Fan"
	<fan.du@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
	"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com"
	<thomas.lendacky@....com>, "vbabka@...e.cz" <vbabka@...e.cz>, "Li, Zhiquan1"
	<zhiquan1.li@...el.com>, "Shutemov, Kirill" <kirill.shutemov@...el.com>,
	"michael.roth@....com" <michael.roth@....com>, "seanjc@...gle.com"
	<seanjc@...gle.com>, "Weiny, Ira" <ira.weiny@...el.com>, "Peng, Chao P"
	<chao.p.peng@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
	"Yamahata, Isaku" <isaku.yamahata@...el.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "binbin.wu@...ux.intel.com"
	<binbin.wu@...ux.intel.com>, "ackerleytng@...gle.com"
	<ackerleytng@...gle.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"Annapurve, Vishal" <vannapurve@...gle.com>, "tabba@...gle.com"
	<tabba@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun"
	<jun.miao@...el.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is
 RUNNABLE

On Thu, 2025-06-26 at 16:53 +0800, Yan Zhao wrote:
> On Wed, Jun 25, 2025 at 10:47:47PM +0800, Edgecombe, Rick P wrote:
> > On Wed, 2025-06-25 at 17:28 +0800, Yan Zhao wrote:
> > > On Wed, Jun 25, 2025 at 02:35:59AM +0800, Edgecombe, Rick P wrote:
> > > > On Tue, 2025-06-24 at 17:57 +0800, Yan Zhao wrote:
> > > > 
> > > I guess it's the latter one as it can avoid modification to both EDK2 and Linux
> > > guest.  I observed ~2710 instances of "guest accepts at 4KB when KVM can map at
> > > 2MB" during the boot-up of a TD with 4GB memory.
> > 
> > Oh, wow that is more than I expected. Did you notice how many vCPUs they were
> > spread across? What memory size did you use? What was your guest memory
> > configuration?
> The guest memory is 4GB, 8 vCPUs.
> The memory slots layout is:
> slot 1: base gfn=0, npages=0x80000
> slot 2: base gfn=0x100000, npages=0x80000
> slot 3: base gfn=0xffc00, npages=0x400
> 
> The GFN spread for the ~2710 instances is:
> GFNs 0x806-0x9ff (1 time for each of 506 pages)
> GFNs 0x7e800-0x7e9ff (1 time for each of 512 pages)
> GFN: 0x7d3ff~0x7e7fe (repeated private-to-shared, and shared-to-private are
>                       conducted on this range), with the top 3 among them being:
>      0x7d9da (476 times)
>      0x7d9d9 (156 times)
>      0x7d9d7 (974 times)
> 
> All those instances are from vCPU 0, when the guest is in EDK2 and during early
> kernel boot.
> 
> Based on my observation, the count of these instances does not scale with guest
> memory. In other words, the count remains roughly the same even when the guest
> memory is increased to 8GB.

So the impact would be negligible. The mmu write lock would not meet much, if
any, contention.

> 
> > > But does it mean TDX needs to hold write mmu_lock in the EPT violation handler
> > > and set KVM_LPAGE_GUEST_INHIBIT on finding a violation carries 4KB level info?
> > 
> > I think so. I didn't check the reason, but the other similar code took it. Maybe
> > not? If we don't need to take mmu write lock, then this idea seems like a clear
> > winner to me.
> Hmm,  setting KVM_LPAGE_GUEST_INHIBIT needs trying splitting to be followed.
> So, if we don't want to support splitting under read mmu_lock, we need to take
> write mmu_lock.
> 
> I drafted a change as below (will refine some parts of it later).
> The average count to take write mmu_lock is 11 during VM boot.
> 
> There's no signiticant difference in the count of 2M mappings
> During guest kerne booting to login, on average: 
> before this patch: 1144 2M mappings 
> after this patch:  1143 2M mappings.

Oh, hmm. Well, it's not strong argument against.

> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index f999c15d8d3e..d4e98728f600 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -322,4 +322,8 @@ static inline bool kvm_is_gfn_alias(struct kvm *kvm, gfn_t gfn)
>  {
>         return gfn & kvm_gfn_direct_bits(kvm);
>  }
> +
> +void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
> +bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
> +
>  #endif
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f0afee2e283a..28c511d8b372 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -721,6 +721,8 @@ static struct kvm_lpage_info *lpage_info_slot(gfn_t gfn,
>   */
>  #define KVM_LPAGE_MIXED_FLAG   BIT(31)
> 
> +#define KVM_LPAGE_GUEST_INHIBIT_FLAG   BIT(30)
> +
>  static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
>                                             gfn_t gfn, int count)
>  {
> @@ -732,7 +734,8 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
> 
>                 old = linfo->disallow_lpage;
>                 linfo->disallow_lpage += count;
> -               WARN_ON_ONCE((old ^ linfo->disallow_lpage) & KVM_LPAGE_MIXED_FLAG);
> +               WARN_ON_ONCE((old ^ linfo->disallow_lpage) &
> +                             (KVM_LPAGE_MIXED_FLAG | KVM_LPAGE_GUEST_INHIBIT_FLAG));
>         }
>  }
> 
> @@ -1653,13 +1656,15 @@ int kvm_split_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range)
>         bool ret = 0;
> 
>         lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
> -                           lockdep_is_held(&kvm->slots_lock));
> +                           lockdep_is_held(&kvm->slots_lock) ||
> +                           srcu_read_lock_held(&kvm->srcu));
> 
>         if (tdp_mmu_enabled)
>                 ret = kvm_tdp_mmu_gfn_range_split_boundary(kvm, range);
> 
>         return ret;
>  }
> +EXPORT_SYMBOL_GPL(kvm_split_boundary_leafs);
> 
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
> @@ -7734,6 +7739,18 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
>                 vhost_task_stop(kvm->arch.nx_huge_page_recovery_thread);
>  }
> 
> +bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
> +{
> +       return lpage_info_slot(gfn, slot, level)->disallow_lpage & KVM_LPAGE_GUEST_INHIBIT_FLAG;
> +}
> +EXPORT_SYMBOL_GPL(hugepage_test_guest_inhibit);
> +
> +void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
> +{
> +       lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_GUEST_INHIBIT_FLAG;
> +}
> +EXPORT_SYMBOL_GPL(hugepage_set_guest_inhibit);
> +
>  #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
>  static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
>                                 int level)
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 244fd22683db..4028423cf595 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1852,28 +1852,8 @@ int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
>         if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE || level != PG_LEVEL_2M, kvm))
>                 return -EINVAL;
> 
> -       /*
> -        * Split request with mmu_lock held for reading can only occur when one
> -        * vCPU accepts at 2MB level while another vCPU accepts at 4KB level.
> -        * Ignore this 4KB mapping request by setting violation_request_level to
> -        * 2MB and returning -EBUSY for retry. Then the next fault at 2MB level
> -        * would be a spurious fault. The vCPU accepting at 2MB will accept the
> -        * whole 2MB range.
> -        */
> -       if (mmu_lock_shared) {
> -               struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> -               struct vcpu_tdx *tdx = to_tdx(vcpu);
> -
> -               if (KVM_BUG_ON(!vcpu, kvm))
> -                       return -EOPNOTSUPP;
> -
> -               /* Request to map as 2MB leaf for the whole 2MB range */
> -               tdx->violation_gfn_start = gfn_round_for_level(gfn, level);
> -               tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
> -               tdx->violation_request_level = level;
> -
> -               return -EBUSY;
> -       }
> +       if (mmu_lock_shared)
> +               return -EOPNOTSUPP;
> 
>         ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
>         if (ret <= 0)
> @@ -1937,28 +1917,51 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
>         return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
>  }
> 
> -static inline void tdx_get_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
> +static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
>  {
>         struct vcpu_tdx *tdx = to_tdx(vcpu);
> +       struct kvm *kvm = vcpu->kvm;
> +       gfn_t gfn = gpa_to_gfn(gpa);
> +       struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>         int level = -1;
> +       u64 eeq_info;
> 
> -       u64 eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
> +       if (!slot)
> +               return 0;
> 
> -       u32 eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> -                       TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> +       if ((tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK) !=
> +           TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
> +               return 0;
> 
> -       if (eeq_type == TDX_EXT_EXIT_QUAL_TYPE_ACCEPT) {
> -               level = (eeq_info & GENMASK(2, 0)) + 1;
> +       eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> +                   TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> 
> -               tdx->violation_gfn_start = gfn_round_for_level(gpa_to_gfn(gpa), level);
> -               tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
> -               tdx->violation_request_level = level;
> -       } else {
> -               tdx->violation_gfn_start = -1;
> -               tdx->violation_gfn_end = -1;
> -               tdx->violation_request_level = -1;
> +       level = (eeq_info & GENMASK(2, 0)) + 1;
> +
> +       if (level == PG_LEVEL_4K) {
> +              if (!hugepage_test_guest_inhibit(slot, gfn, PG_LEVEL_2M)) {
> +                       struct kvm_gfn_range gfn_range = {
> +                               .start = gfn,
> +                               .end = gfn + 1,
> +                               .slot = slot,
> +                               .may_block = true,
> +                               .attr_filter = KVM_FILTER_PRIVATE,
> +                       };
> +
> +                       scoped_guard(write_lock, &kvm->mmu_lock) {
> +                               int ret;
> +
> +                               ret = kvm_split_boundary_leafs(kvm, &gfn_range);
> +
> +                               if (ret)
> +                                       return ret;
> +
> +                               hugepage_set_guest_inhibit(slot, gfn, PG_LEVEL_2M);


Can you explain what you found regarding the write lock need? For most accept
cases, we could fault in the PTE's on the read lock. And in the future we could
have a demote that could work under read lock, as we talked. So
kvm_split_boundary_leafs() often or could be unneeded or work under read lock
when needed.

What is the problem in hugepage_set_guest_inhibit() that requires the write
lock?

But in any case, it seems like we have *a* solution here. It doesn't seem like
there are any big downsides. Should we close it?

> +                       }
> +              }
>         }
> +
> +       return 0;
>  }
> 
>  static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> @@ -1987,7 +1990,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>                  */
>                 exit_qual = EPT_VIOLATION_ACC_WRITE;
> 
> -               tdx_get_accept_level(vcpu, gpa);
> +               if (tdx_check_accept_level(vcpu, gpa))
> +                       return RET_PF_RETRY;
> 
>                 /* Only private GPA triggers zero-step mitigation */
>                 local_retry = true;
> @@ -3022,9 +3026,6 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> 
>         vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> 
> -       tdx->violation_gfn_start = -1;
> -       tdx->violation_gfn_end = -1;
> -       tdx->violation_request_level = -1;
>         return 0;
> 
>  free_tdcx:
> @@ -3373,14 +3374,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
>  int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>                                        gfn_t gfn, bool prefetch)
>  {
> -       struct vcpu_tdx *tdx = to_tdx(vcpu);
> -
> -       if (unlikely((to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE) || prefetch))
> +       if (unlikely((to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE)))
>                 return PG_LEVEL_4K;
> 
> -       if (gfn >= tdx->violation_gfn_start && gfn < tdx->violation_gfn_end)
> -               return tdx->violation_request_level;
> -
>         return PG_LEVEL_2M;
>  }
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index acd18a01f63d..3a3077666ee6 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2610,6 +2610,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn
> 
>         return NULL;
>  }
> +EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);
> 
>  bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
>  {