[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aGNK2tO2W6+GWtt3@yzhao56-desk.sh.intel.com>
Date: Tue, 1 Jul 2025 10:41:30 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
CC: "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>, "Li, Xiaoyao"
<xiaoyao.li@...el.com>, "Huang, Kai" <kai.huang@...el.com>, "Du, Fan"
<fan.du@...el.com>, "Hansen, Dave" <dave.hansen@...el.com>,
"david@...hat.com" <david@...hat.com>, "thomas.lendacky@....com"
<thomas.lendacky@....com>, "vbabka@...e.cz" <vbabka@...e.cz>, "Li, Zhiquan1"
<zhiquan1.li@...el.com>, "Shutemov, Kirill" <kirill.shutemov@...el.com>,
"michael.roth@....com" <michael.roth@....com>, "seanjc@...gle.com"
<seanjc@...gle.com>, "Weiny, Ira" <ira.weiny@...el.com>, "Peng, Chao P"
<chao.p.peng@...el.com>, "pbonzini@...hat.com" <pbonzini@...hat.com>,
"Yamahata, Isaku" <isaku.yamahata@...el.com>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "binbin.wu@...ux.intel.com"
<binbin.wu@...ux.intel.com>, "ackerleytng@...gle.com"
<ackerleytng@...gle.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"Annapurve, Vishal" <vannapurve@...gle.com>, "tabba@...gle.com"
<tabba@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>, "Miao, Jun"
<jun.miao@...el.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
"x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 09/21] KVM: TDX: Enable 2MB mapping size after TD is
RUNNABLE
On Tue, Jul 01, 2025 at 08:42:33AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2025-06-26 at 16:53 +0800, Yan Zhao wrote:
> > On Wed, Jun 25, 2025 at 10:47:47PM +0800, Edgecombe, Rick P wrote:
> > > On Wed, 2025-06-25 at 17:28 +0800, Yan Zhao wrote:
> > > > On Wed, Jun 25, 2025 at 02:35:59AM +0800, Edgecombe, Rick P wrote:
> > > > > On Tue, 2025-06-24 at 17:57 +0800, Yan Zhao wrote:
> > > > >
> > > > I guess it's the latter one as it can avoid modification to both EDK2 and Linux
> > > > guest. I observed ~2710 instances of "guest accepts at 4KB when KVM can map at
> > > > 2MB" during the boot-up of a TD with 4GB memory.
> > >
> > > Oh, wow that is more than I expected. Did you notice how many vCPUs they were
> > > spread across? What memory size did you use? What was your guest memory
> > > configuration?
> > The guest memory is 4GB, 8 vCPUs.
> > The memory slots layout is:
> > slot 1: base gfn=0, npages=0x80000
> > slot 2: base gfn=0x100000, npages=0x80000
> > slot 3: base gfn=0xffc00, npages=0x400
> >
> > The GFN spread for the ~2710 instances is:
> > GFNs 0x806-0x9ff (1 time for each of 506 pages)
> > GFNs 0x7e800-0x7e9ff (1 time for each of 512 pages)
> > GFN: 0x7d3ff~0x7e7fe (repeated private-to-shared, and shared-to-private are
> > conducted on this range), with the top 3 among them being:
> > 0x7d9da (476 times)
> > 0x7d9d9 (156 times)
> > 0x7d9d7 (974 times)
> >
> > All those instances are from vCPU 0, when the guest is in EDK2 and during early
> > kernel boot.
> >
> > Based on my observation, the count of these instances does not scale with guest
> > memory. In other words, the count remains roughly the same even when the guest
> > memory is increased to 8GB.
>
> So the impact would be negligible. The mmu write lock would not meet much, if
> any, contention.
>
> >
> > > > But does it mean TDX needs to hold write mmu_lock in the EPT violation handler
> > > > and set KVM_LPAGE_GUEST_INHIBIT on finding a violation carries 4KB level info?
> > >
> > > I think so. I didn't check the reason, but the other similar code took it. Maybe
> > > not? If we don't need to take mmu write lock, then this idea seems like a clear
> > > winner to me.
> > Hmm, setting KVM_LPAGE_GUEST_INHIBIT needs trying splitting to be followed.
> > So, if we don't want to support splitting under read mmu_lock, we need to take
> > write mmu_lock.
> >
> > I drafted a change as below (will refine some parts of it later).
> > The average count to take write mmu_lock is 11 during VM boot.
> >
> > There's no signiticant difference in the count of 2M mappings
> > During guest kerne booting to login, on average:
> > before this patch: 1144 2M mappings
> > after this patch: 1143 2M mappings.
>
> Oh, hmm. Well, it's not strong argument against.
>
> >
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index f999c15d8d3e..d4e98728f600 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -322,4 +322,8 @@ static inline bool kvm_is_gfn_alias(struct kvm *kvm, gfn_t gfn)
> > {
> > return gfn & kvm_gfn_direct_bits(kvm);
> > }
> > +
> > +void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
> > +bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level);
> > +
> > #endif
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index f0afee2e283a..28c511d8b372 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -721,6 +721,8 @@ static struct kvm_lpage_info *lpage_info_slot(gfn_t gfn,
> > */
> > #define KVM_LPAGE_MIXED_FLAG BIT(31)
> >
> > +#define KVM_LPAGE_GUEST_INHIBIT_FLAG BIT(30)
> > +
> > static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
> > gfn_t gfn, int count)
> > {
> > @@ -732,7 +734,8 @@ static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
> >
> > old = linfo->disallow_lpage;
> > linfo->disallow_lpage += count;
> > - WARN_ON_ONCE((old ^ linfo->disallow_lpage) & KVM_LPAGE_MIXED_FLAG);
> > + WARN_ON_ONCE((old ^ linfo->disallow_lpage) &
> > + (KVM_LPAGE_MIXED_FLAG | KVM_LPAGE_GUEST_INHIBIT_FLAG));
> > }
> > }
> >
> > @@ -1653,13 +1656,15 @@ int kvm_split_boundary_leafs(struct kvm *kvm, struct kvm_gfn_range *range)
> > bool ret = 0;
> >
> > lockdep_assert_once(kvm->mmu_invalidate_in_progress ||
> > - lockdep_is_held(&kvm->slots_lock));
> > + lockdep_is_held(&kvm->slots_lock) ||
> > + srcu_read_lock_held(&kvm->srcu));
> >
> > if (tdp_mmu_enabled)
> > ret = kvm_tdp_mmu_gfn_range_split_boundary(kvm, range);
> >
> > return ret;
> > }
> > +EXPORT_SYMBOL_GPL(kvm_split_boundary_leafs);
> >
> > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > @@ -7734,6 +7739,18 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> > vhost_task_stop(kvm->arch.nx_huge_page_recovery_thread);
> > }
> >
> > +bool hugepage_test_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
> > +{
> > + return lpage_info_slot(gfn, slot, level)->disallow_lpage & KVM_LPAGE_GUEST_INHIBIT_FLAG;
> > +}
> > +EXPORT_SYMBOL_GPL(hugepage_test_guest_inhibit);
> > +
> > +void hugepage_set_guest_inhibit(struct kvm_memory_slot *slot, gfn_t gfn, int level)
> > +{
> > + lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_GUEST_INHIBIT_FLAG;
> > +}
> > +EXPORT_SYMBOL_GPL(hugepage_set_guest_inhibit);
> > +
> > #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> > static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
> > int level)
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 244fd22683db..4028423cf595 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1852,28 +1852,8 @@ int tdx_sept_split_private_spt(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > if (KVM_BUG_ON(to_kvm_tdx(kvm)->state != TD_STATE_RUNNABLE || level != PG_LEVEL_2M, kvm))
> > return -EINVAL;
> >
> > - /*
> > - * Split request with mmu_lock held for reading can only occur when one
> > - * vCPU accepts at 2MB level while another vCPU accepts at 4KB level.
> > - * Ignore this 4KB mapping request by setting violation_request_level to
> > - * 2MB and returning -EBUSY for retry. Then the next fault at 2MB level
> > - * would be a spurious fault. The vCPU accepting at 2MB will accept the
> > - * whole 2MB range.
> > - */
> > - if (mmu_lock_shared) {
> > - struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> > - struct vcpu_tdx *tdx = to_tdx(vcpu);
> > -
> > - if (KVM_BUG_ON(!vcpu, kvm))
> > - return -EOPNOTSUPP;
> > -
> > - /* Request to map as 2MB leaf for the whole 2MB range */
> > - tdx->violation_gfn_start = gfn_round_for_level(gfn, level);
> > - tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
> > - tdx->violation_request_level = level;
> > -
> > - return -EBUSY;
> > - }
> > + if (mmu_lock_shared)
> > + return -EOPNOTSUPP;
> >
> > ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
> > if (ret <= 0)
> > @@ -1937,28 +1917,51 @@ static inline bool tdx_is_sept_violation_unexpected_pending(struct kvm_vcpu *vcp
> > return !(eq & EPT_VIOLATION_PROT_MASK) && !(eq & EPT_VIOLATION_EXEC_FOR_RING3_LIN);
> > }
> >
> > -static inline void tdx_get_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
> > +static inline int tdx_check_accept_level(struct kvm_vcpu *vcpu, gpa_t gpa)
> > {
> > struct vcpu_tdx *tdx = to_tdx(vcpu);
> > + struct kvm *kvm = vcpu->kvm;
> > + gfn_t gfn = gpa_to_gfn(gpa);
> > + struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> > int level = -1;
> > + u64 eeq_info;
> >
> > - u64 eeq_type = tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK;
> > + if (!slot)
> > + return 0;
> >
> > - u32 eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> > - TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> > + if ((tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_TYPE_MASK) !=
> > + TDX_EXT_EXIT_QUAL_TYPE_ACCEPT)
> > + return 0;
> >
> > - if (eeq_type == TDX_EXT_EXIT_QUAL_TYPE_ACCEPT) {
> > - level = (eeq_info & GENMASK(2, 0)) + 1;
> > + eeq_info = (tdx->ext_exit_qualification & TDX_EXT_EXIT_QUAL_INFO_MASK) >>
> > + TDX_EXT_EXIT_QUAL_INFO_SHIFT;
> >
> > - tdx->violation_gfn_start = gfn_round_for_level(gpa_to_gfn(gpa), level);
> > - tdx->violation_gfn_end = tdx->violation_gfn_start + KVM_PAGES_PER_HPAGE(level);
> > - tdx->violation_request_level = level;
> > - } else {
> > - tdx->violation_gfn_start = -1;
> > - tdx->violation_gfn_end = -1;
> > - tdx->violation_request_level = -1;
> > + level = (eeq_info & GENMASK(2, 0)) + 1;
> > +
> > + if (level == PG_LEVEL_4K) {
> > + if (!hugepage_test_guest_inhibit(slot, gfn, PG_LEVEL_2M)) {
> > + struct kvm_gfn_range gfn_range = {
> > + .start = gfn,
> > + .end = gfn + 1,
> > + .slot = slot,
> > + .may_block = true,
> > + .attr_filter = KVM_FILTER_PRIVATE,
> > + };
> > +
> > + scoped_guard(write_lock, &kvm->mmu_lock) {
> > + int ret;
> > +
> > + ret = kvm_split_boundary_leafs(kvm, &gfn_range);
> > +
> > + if (ret)
> > + return ret;
> > +
> > + hugepage_set_guest_inhibit(slot, gfn, PG_LEVEL_2M);
>
>
> Can you explain what you found regarding the write lock need?
Here, the write lock protects 2 steps:
(1) update lpage_info.
(2) try splitting if there's any existing 2MB mapping.
The write mmu_lock is needed because lpage_info is read under read mmu_lock in
kvm_tdp_mmu_map().
kvm_tdp_mmu_map
kvm_mmu_hugepage_adjust
kvm_lpage_info_max_mapping_level
If we update the lpage_info with read mmu_lock, the other vCPUs may map at a
stale 2MB level even after lpage_info is updated by hugepage_set_guest_inhibit().
Therefore, we must perform splitting under the write mmu_lock to ensure there
are no 2MB mappings after hugepage_set_guest_inhibit().
Otherwise, during later mapping in __vmx_handle_ept_violation(), splitting at
fault path could be triggered as KVM MMU finds the goal level is 4KB while an
existing 2MB mapping is present.
> For most accept
> cases, we could fault in the PTE's on the read lock. And in the future we could
The actual mapping at 4KB level is still with read mmu_lock in
__vmx_handle_ept_violation().
> have a demote that could work under read lock, as we talked. So
> kvm_split_boundary_leafs() often or could be unneeded or work under read lock
> when needed.
Could we leave the "demote under read lock" as a future optimization?
> What is the problem in hugepage_set_guest_inhibit() that requires the write
> lock?
As above, to avoid the other vCPUs reading stale mapping level and splitting
under read mmu_lock.
As guest_inhibit is set one-way, we could test it using
hugepage_test_guest_inhibit() without holding the lock. The chance to hold write
mmu_lock for hugepage_set_guest_inhibit() is then greatly reduced.
(in my testing, 11 during VM boot).
> But in any case, it seems like we have *a* solution here. It doesn't seem like
> there are any big downsides. Should we close it?
I think it's good, as long as Sean doesn't disagree :)
> > + }
> > + }
> > }
> > +
> > + return 0;
> > }
> >
> > static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> > @@ -1987,7 +1990,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> > */
> > exit_qual = EPT_VIOLATION_ACC_WRITE;
> >
> > - tdx_get_accept_level(vcpu, gpa);
> > + if (tdx_check_accept_level(vcpu, gpa))
> > + return RET_PF_RETRY;
> >
> > /* Only private GPA triggers zero-step mitigation */
> > local_retry = true;
> > @@ -3022,9 +3026,6 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> >
> > vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> >
> > - tdx->violation_gfn_start = -1;
> > - tdx->violation_gfn_end = -1;
> > - tdx->violation_request_level = -1;
> > return 0;
> >
> > free_tdcx:
> > @@ -3373,14 +3374,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
> > int tdx_gmem_private_max_mapping_level(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
> > gfn_t gfn, bool prefetch)
> > {
> > - struct vcpu_tdx *tdx = to_tdx(vcpu);
> > -
> > - if (unlikely((to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE) || prefetch))
> > + if (unlikely((to_kvm_tdx(vcpu->kvm)->state != TD_STATE_RUNNABLE)))
> > return PG_LEVEL_4K;
> >
> > - if (gfn >= tdx->violation_gfn_start && gfn < tdx->violation_gfn_end)
> > - return tdx->violation_request_level;
> > -
> > return PG_LEVEL_2M;
> > }
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index acd18a01f63d..3a3077666ee6 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2610,6 +2610,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn
> >
> > return NULL;
> > }
> > +EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);
> >
> > bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
> > {
>
Powered by blists - more mailing lists