linux-kernel - Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aJE9x_pjBVIdiEJN@google.com>
Date: Mon, 4 Aug 2025 16:09:59 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Vitaly Kuznetsov <vkuznets@...hat.com>
Cc: Jeremi Piotrowski <jpiotrowski@...ux.microsoft.com>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, linux-kernel@...r.kernel.org, 
	alanjiang@...rosoft.com, chinang.ma@...rosoft.com, 
	andrea.pellegrini@...rosoft.com, Kevin Tian <kevin.tian@...el.com>, 
	"K. Y. Srinivasan" <kys@...rosoft.com>, Haiyang Zhang <haiyangz@...rosoft.com>, Wei Liu <wei.liu@...nel.org>, 
	Dexuan Cui <decui@...rosoft.com>, linux-hyperv@...r.kernel.org, 
	Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org
Subject: Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes

On Mon, Aug 04, 2025, Vitaly Kuznetsov wrote:
> Sean Christopherson <seanjc@...gle.com> writes:
> > It'll take more work than the below, e.g. to have VMX's construct_eptp() pull the
> > level and A/D bits from kvm_mmu_page (vendor code can get at the kvm_mmu_page with
> > root_to_sp()), but for the core concept/skeleton, I think this is it?
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6e838cb6c9e1..298130445182 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3839,6 +3839,37 @@ void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_mmu_free_guest_mode_roots);
> >  
> > +struct kvm_tlb_flush_root {
> > +       struct kvm *kvm;
> > +       hpa_t root;
> > +};
> > +
> > +static void kvm_flush_tlb_root(void *__data)
> > +{
> > +       struct kvm_tlb_flush_root *data = __data;
> > +
> > +       kvm_x86_call(flush_tlb_root)(data->kvm, data->root);
> > +}
> > +
> > +void kvm_mmu_flush_all_tlbs_root(struct kvm *kvm, struct kvm_mmu_page *root)
> > +{
> > +       struct kvm_tlb_flush_root data = {
> > +               .kvm = kvm,
> > +               .root = __pa(root->spt),
> > +       };
> > +
> > +       /*
> > +        * Flush any TLB entries for the new root, the provenance of the root
> > +        * is unknown.  Even if KVM ensures there are no stale TLB entries
> > +        * for a freed root, in theory another hypervisor could have left
> > +        * stale entries.  Flushing on alloc also allows KVM to skip the TLB
> > +        * flush when freeing a root (see kvm_tdp_mmu_put_root()), and flushing
> > +        * TLBs on all CPUs allows KVM to elide TLB flushes when a vCPU is
> > +        * migrated to a different pCPU.
> > +        */
> > +       on_each_cpu(kvm_flush_tlb_root, &data, 1);
> 
> Would it make sense to complement this with e.g. a CPU mask tracking all
> the pCPUs where the VM has ever been seen running (+ a flush when a new
> one is added to it)?
> 
> I'm worried about the potential performance impact for a case when a
> huge host is running a lot of small VMs in 'partitioning' mode
> (i.e. when all vCPUs are pinned). Additionally, this may have a negative
> impact on RT use-cases where each unnecessary interruption can be seen
> problematic. 

Oof, right.  And it's not even a VM-to-VM noisy neighbor problem, e.g. a few
vCPUs using nested TDP could generate a lot of noist IRQs through a VM.  Hrm.

So I think the basic idea is so flawed/garbage that even enhancing it with per-VM
pCPU tracking wouldn't work.  I do think you've got the right idea with a pCPU mask
though, but instead of using a mask to scope IPIs, use it to elide TLB flushes.

With the TDP MMU, KVM can have at most 6 non-nested roots active at any given time:
SMM vs. non-SMM, 4-level vs. 5-level, L1 vs. L2.  Allocating a cpumask for each
TDP MMU root seems reasonable.  Then on task migration, instead of doing a global
INVEPT, only INVEPT the current and prev_roots (because getting a new root will
trigger a flush in kvm_mmu_load()), and skip INVEPT on TDP MMU roots if the pCPU
has already done a flush for the root.

Or we could do the optimized tracking for all roots.  x86 supports at most 8192
CPUs, which means 1KiB per root.  That doesn't seem at all painful given that
each shadow pages consumes 4KiB...