[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aJKW9gTeyh0-pvcg@google.com>
Date: Tue, 5 Aug 2025 16:42:46 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Jeremi Piotrowski <jpiotrowski@...ux.microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@...hat.com>, Dave Hansen <dave.hansen@...ux.intel.com>,
linux-kernel@...r.kernel.org, alanjiang@...rosoft.com,
chinang.ma@...rosoft.com, andrea.pellegrini@...rosoft.com,
Kevin Tian <kevin.tian@...el.com>, "K. Y. Srinivasan" <kys@...rosoft.com>,
Haiyang Zhang <haiyangz@...rosoft.com>, Wei Liu <wei.liu@...nel.org>,
Dexuan Cui <decui@...rosoft.com>, linux-hyperv@...r.kernel.org,
Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org
Subject: Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes
On Tue, Aug 05, 2025, Jeremi Piotrowski wrote:
> On 05/08/2025 01:09, Sean Christopherson wrote:
> > On Mon, Aug 04, 2025, Vitaly Kuznetsov wrote:
> >> Sean Christopherson <seanjc@...gle.com> writes:
> >>> +void kvm_mmu_flush_all_tlbs_root(struct kvm *kvm, struct kvm_mmu_page *root)
> >>> +{
> >>> + struct kvm_tlb_flush_root data = {
> >>> + .kvm = kvm,
> >>> + .root = __pa(root->spt),
> >>> + };
> >>> +
> >>> + /*
> >>> + * Flush any TLB entries for the new root, the provenance of the root
> >>> + * is unknown. Even if KVM ensures there are no stale TLB entries
> >>> + * for a freed root, in theory another hypervisor could have left
> >>> + * stale entries. Flushing on alloc also allows KVM to skip the TLB
> >>> + * flush when freeing a root (see kvm_tdp_mmu_put_root()), and flushing
> >>> + * TLBs on all CPUs allows KVM to elide TLB flushes when a vCPU is
> >>> + * migrated to a different pCPU.
> >>> + */
> >>> + on_each_cpu(kvm_flush_tlb_root, &data, 1);
> >>
> >> Would it make sense to complement this with e.g. a CPU mask tracking all
> >> the pCPUs where the VM has ever been seen running (+ a flush when a new
> >> one is added to it)?
> >>
> >> I'm worried about the potential performance impact for a case when a
> >> huge host is running a lot of small VMs in 'partitioning' mode
> >> (i.e. when all vCPUs are pinned). Additionally, this may have a negative
> >> impact on RT use-cases where each unnecessary interruption can be seen
> >> problematic.
> >
> > Oof, right. And it's not even a VM-to-VM noisy neighbor problem, e.g. a few
> > vCPUs using nested TDP could generate a lot of noist IRQs through a VM. Hrm.
> >
> > So I think the basic idea is so flawed/garbage that even enhancing it with per-VM
> > pCPU tracking wouldn't work. I do think you've got the right idea with a pCPU mask
> > though, but instead of using a mask to scope IPIs, use it to elide TLB flushes.
>
> Sorry for the delay in replying, I've been sidetracked a bit.
No worries, I guarantee my delays will make your delays pale in comparison :-D
> I like this idea more, not special casing the TLB flushing approach per hypervisor is
> preferable.
>
> >
> > With the TDP MMU, KVM can have at most 6 non-nested roots active at any given time:
> > SMM vs. non-SMM, 4-level vs. 5-level, L1 vs. L2. Allocating a cpumask for each
> > TDP MMU root seems reasonable. Then on task migration, instead of doing a global
> > INVEPT, only INVEPT the current and prev_roots (because getting a new root will
> > trigger a flush in kvm_mmu_load()), and skip INVEPT on TDP MMU roots if the pCPU
> > has already done a flush for the root.
>
> Just to make sure I follow: current+prev_roots do you mean literally those
> (i.e. cached prev roots) or all roots on kvm->arch.tdp_mmu_roots?
The former, i.e. "root" and all "prev_roots" entries in a kvm_mmu structure.
> So this would mean: on pCPU migration, check if current mmu has is_tdp_mmu_active()
> and then perform the INVEPT-single over roots instead of INVEPT-global. Otherwise stick
> to the KVM_REQ_TLB_FLUSH.
No, KVM would still need to ensure shadow roots are flushed as well, because KVM
doesn't flush TLBs when switching to a previous root (see fast_pgd_switch()).
More at the bottom.
> Would there need to be a check for is_guest_mode(), or that the switch is
> coming from the vmx/nested.c? I suppose not because nested doesn't seem to
> use TDP MMU.
Nested can use the TDP MMU, though there's practically no code in KVM that explicitly
deals with this scenario. If L1 is using legacy shadow paging, i.e. is NOT using
EPT/NPT, then KVM will use the TDP MMU to map L2 (with kvm_mmu_page_role.guest_mode=1
to differentiate from the L1 TDP MMU).
> > Or we could do the optimized tracking for all roots. x86 supports at most 8192
> > CPUs, which means 1KiB per root. That doesn't seem at all painful given that
> > each shadow pages consumes 4KiB...
>
> Similar question here: which all roots would need to be tracked+flushed for shadow
> paging? pae_roots?
Same general answer, "root" and all "prev_roots" entries. KVM uses up to two
"struct kvm_mmu" instances to actually map memory into the guest: root_mmu and
guest_mmu. The third instance, nested_mmu, is used to model gva->gpa translations
for L2, i.e. is used only to walk L2 stage-1 page tables, and is never used to
map memory into the guest, i.e. can't have entries in hardware TLBs.
The basic gist is to add a cpumask in each root, and then elide TLB flushes on
pCPU migration if KVM has flushed the root at least once. Patch 5/5 in the attached
set of patches provides a *very* rough sketch. Hopefully its enough to convey the
core idea.
Patches 1-4 compile, but are otherwise untested. I'll post patches 1-3 as a small
series once their tested, as those cleanups are worth doing irrespective of any
optimizations we make to pCPU migration.
P.S. everyone and their mother thinks guest_mmu and nested_mmu are terrible names,
but no one has come up with names good enough to convince everyone to get out from
behind the bikeshed :-)
View attachment "0001-KVM-VMX-Hoist-construct_eptp-up-in-vmx.c.patch" of type "text/x-diff" (1724 bytes)
View attachment "0002-KVM-nVMX-Hardcode-dummy-EPTP-used-for-early-nested-c.patch" of type "text/x-diff" (2484 bytes)
View attachment "0003-KVM-VMX-Use-kvm_mmu_page-role-to-construct-EPTP-not-.patch" of type "text/x-diff" (2625 bytes)
View attachment "0004-KVM-VMX-Flush-only-active-EPT-roots-on-pCPU-migratio.patch" of type "text/x-diff" (1999 bytes)
View attachment "0005-KVM-VMX-Sketch-in-possible-framework-for-eliding-TLB.patch" of type "text/x-diff" (3486 bytes)
Powered by blists - more mailing lists