linux-kernel - Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aHWjPSIdp5B-2UBl@google.com>
Date: Mon, 14 Jul 2025 17:39:25 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Vitaly Kuznetsov <vkuznets@...hat.com>
Cc: Jeremi Piotrowski <jpiotrowski@...ux.microsoft.com>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, linux-kernel@...r.kernel.org, 
	alanjiang@...rosoft.com, chinang.ma@...rosoft.com, 
	andrea.pellegrini@...rosoft.com, Kevin Tian <kevin.tian@...el.com>, 
	"K. Y. Srinivasan" <kys@...rosoft.com>, Haiyang Zhang <haiyangz@...rosoft.com>, Wei Liu <wei.liu@...nel.org>, 
	Dexuan Cui <decui@...rosoft.com>, linux-hyperv@...r.kernel.org, 
	Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org
Subject: Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes

On Wed, Jul 09, 2025, Vitaly Kuznetsov wrote:
> Jeremi Piotrowski <jpiotrowski@...ux.microsoft.com> writes:
> 
> > On 27/06/2025 10:31, Vitaly Kuznetsov wrote:
> >> Jeremi Piotrowski <jpiotrowski@...ux.microsoft.com> writes:
> >> 
> >>> Use Hyper-V's HvCallFlushGuestPhysicalAddressSpace for local TLB flushes.
> >>> This makes any KVM_REQ_TLB_FLUSH_CURRENT (such as on root alloc) visible to
> >>> all CPUs which means we no longer need to do a KVM_REQ_TLB_FLUSH on CPU
> >>> migration.
> >>>
> >>> The goal is to avoid invept-global in KVM_REQ_TLB_FLUSH. Hyper-V uses a
> >>> shadow page table for the nested hypervisor (KVM) and has to invalidate all
> >>> EPT roots when invept-global is issued.

What all does "invalidate" mean here?  E.g. is this "just" a hardware TLB flush,
or is Hyper-V blasting away and rebuilding page tables?  If it's the latter, that
seems like a Hyper-V bug/problem.

> >>> This has a performance impact on all nested VMs.  KVM issues
> >>> KVM_REQ_TLB_FLUSH on CPU migration, and under load the performance hit
> >>> causes vCPUs to use up more of their slice of CPU time, leading to more
> >>> CPU migrations. This has a snowball effect and causes CPU usage spikes.

Is the performance hit due to flushing *hardware* TLBs, or due to Hyper-V needing
to rebuilding shadow page tables?

> >>> By issuing the hypercall we are now guaranteed that any root modification
> >>> that requires a local TLB flush becomes visible to all CPUs. The same
> >>> hypercall is already used in kvm_arch_flush_remote_tlbs and
> >>> kvm_arch_flush_remote_tlbs_range.  The KVM expectation is that roots are
> >>> flushed locally on alloc and we achieve consistency on migration by
> >>> flushing all roots - the new behavior of achieving consistency on alloc on
> >>> Hyper-V is a superset of the expected guarantees. This makes the
> >>> KVM_REQ_TLB_FLUSH on CPU migration no longer necessary on Hyper-V.
> >> 
> >> Sounds reasonable overall, my only concern (not sure if valid or not) is
> >> that using the hypercall for local flushes is going to be more expensive
> >> than invept-context we do today and thus while the performance is
> >> improved for the scenario when vCPUs are migrating a lot, we will take a
> >> hit in other cases.
> >> 
> >
> 
> Sorry for delayed reply!
> 
> ....
> 
> >>>  		return;
> >>>  
> >>> -	if (enable_ept)
> >>> +	if (enable_ept) {
> >>> +		/*
> >>> +		 * hyperv_flush_guest_mapping() has the semantics of
> >>> +		 * invept-single across all pCPUs. This makes root
> >>> +		 * modifications consistent across pCPUs, so an invept-global
> >>> +		 * on migration is no longer required.

Unfortunately, this isn't quite right.  If vCPU0 and vCPU1 share an EPT root,
APIC virtualization is enabled, and vCPU0 is running with x2APIC but vCPU1 is
running with xAPIC, then KVM needs to flush TLBs if vCPU1 is loaded on a "new"
pCPU, because vCPU0 could have inserted non-vAPIC TLB entries for that pCPU.

Hrm, but KVM doesn't actually handle that properly.  KVM only forces a TLB flush
if the vCPU wasn't already loaded, so if vCPU0 and vCPU1 are running on the same
pCPU, i.e. vCPU1 isn't being migrated to the pCPU that was previously running
vCPU0, then I believe vCPU1 could consume stale TLB entries.

Setting that aside for the moment, I would much prefer to elide this TLB flush
whenver possible, irrespective of whether KVM is running on bare metal or in a
VM, and irrespective of the host hypervisor.  And then if/when SVM is converted
to use per-vCPU ASIDs[*], give SVM the exact same treatment.  More below.

[*] https://lore.kernel.org/all/aFXrFKvZcJ3dN4k_@google.com

> >> HvCallFlushGuestPhysicalAddressSpace sounds like a heavy operation as it
> >> affects all processors. Is there any visible perfomance impact of this
> >> change when there are no migrations (e.g. with vCPU pinning)? Or do we
> >> believe that Hyper-V actually handles invept-context the exact same way?
> >> 
> > I'm going to have to do some more investigation to answer that - do you have an
> > idea of a workload that would be sensitive to tlb flushes that I could compare
> > this on?
> >
> > In terms of cost, Hyper-V needs to invalidate the VMs shadow page table for a root
> > and do the tlb flush. The first part is CPU intensive but is the same in both cases
> > (hypercall and invept-single). The tlb flush part will require a bit more work for
> > the hypercall as it needs to happen on all cores, and the tlb will now be empty
> > for that root.
> >
> > My assumption is that these local tlb flushes are rather rare as they will
> > only happen when:
> > - new root is allocated
> > - we need to switch to a special root
> >
> 
> KVM's MMU is an amazing maze so I'd appreciate if someone more
> knowledgeble corrects me;t my understanding is that we call
> *_flush_tlb_current() from two places:
> 
> kvm_mmu_load() and this covers the two cases above. These should not be
> common under normal circumstances but can be frequent in some special
> cases, e.g. when running a nested setup. Given that we're already
> running on top of Hyper-V, this means 3+ level nesting which I don't
> believe anyone really cares about.

Heh, don't be too sure about that.  People just love running "containers" inside
VMs, without thinking too hard about what they're doing :-)

In general, I don't like effectively turning KVM_REQ_TLB_FLUSH_CURRENT into
kvm_flush_remote_tlbs(), and I *really* don't like doing so for one specific
setup.  It's hard enough to capture the differences between KVM's various TLB
flushes hooks/requests, and special casing KVM-on-Hyper-V is just asking for
unique bugs.

Conceptually, I _think_ this is pretty straightforward: when a root is allocated,
flush the root on all *pCPUs*.  KVM currently flushes the root when a vCPU first
uses a root, which necessitates flushing on migration.

Alternatively, KVM could track which pCPUs a vCPU has run on, but that would get
expensive, and blasting a flush on alloc should be much simpler.

The two wrinkles I can think of are the x2APIC vs. xAPIC problem above (which I
think needs to be handled no matter what), and CPU hotplug (which is easy enough
to handle, I just didn't type it up).

It'll take more work than the below, e.g. to have VMX's construct_eptp() pull the
level and A/D bits from kvm_mmu_page (vendor code can get at the kvm_mmu_page with
root_to_sp()), but for the core concept/skeleton, I think this is it?

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6e838cb6c9e1..298130445182 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3839,6 +3839,37 @@ void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu)
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_free_guest_mode_roots);
 
+struct kvm_tlb_flush_root {
+       struct kvm *kvm;
+       hpa_t root;
+};
+
+static void kvm_flush_tlb_root(void *__data)
+{
+       struct kvm_tlb_flush_root *data = __data;
+
+       kvm_x86_call(flush_tlb_root)(data->kvm, data->root);
+}
+
+void kvm_mmu_flush_all_tlbs_root(struct kvm *kvm, struct kvm_mmu_page *root)
+{
+       struct kvm_tlb_flush_root data = {
+               .kvm = kvm,
+               .root = __pa(root->spt),
+       };
+
+       /*
+        * Flush any TLB entries for the new root, the provenance of the root
+        * is unknown.  Even if KVM ensures there are no stale TLB entries
+        * for a freed root, in theory another hypervisor could have left
+        * stale entries.  Flushing on alloc also allows KVM to skip the TLB
+        * flush when freeing a root (see kvm_tdp_mmu_put_root()), and flushing
+        * TLBs on all CPUs allows KVM to elide TLB flushes when a vCPU is
+        * migrated to a different pCPU.
+        */
+       on_each_cpu(kvm_flush_tlb_root, &data, 1);
+}
+
 static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
                            u8 level)
 {
@@ -3852,7 +3883,8 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
        WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);
 
        sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
-       ++sp->root_count;
+       if (!sp->root_count++)
+               kvm_mmu_flush_all_tlbs_root(vcpu->kvm, sp);
 
        return __pa(sp->spt);
 }
@@ -5961,15 +5993,6 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
        kvm_mmu_sync_roots(vcpu);
 
        kvm_mmu_load_pgd(vcpu);
-
-       /*
-        * Flush any TLB entries for the new root, the provenance of the root
-        * is unknown.  Even if KVM ensures there are no stale TLB entries
-        * for a freed root, in theory another hypervisor could have left
-        * stale entries.  Flushing on alloc also allows KVM to skip the TLB
-        * flush when freeing a root (see kvm_tdp_mmu_put_root()).
-        */
-       kvm_x86_call(flush_tlb_current)(vcpu);
 out:
        return r;
 }
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 65f3c89d7c5d..3cbf0d612f5e 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -167,6 +167,8 @@ static inline bool is_mirror_sp(const struct kvm_mmu_page *sp)
        return sp->role.is_mirror;
 }
 
+void kvm_mmu_flush_all_tlbs_root(struct kvm *kvm, struct kvm_mmu_page *root);
+
 static inline void kvm_mmu_alloc_external_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 {
        /*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7f3d7229b2c1..3ff36d09b4fa 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -302,6 +302,7 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
         */
        refcount_set(&root->tdp_mmu_root_count, 2);
        list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots);
+       kvm_mmu_flush_all_tlbs_root(vcpu->kvm, root);
 
 out_spin_unlock:
        spin_unlock(&kvm->arch.tdp_mmu_pages_lock);