linux-kernel - Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aJ87AGnK9J0mafoi@LAPTOP-I1KNRUTF.localdomain>
Date: Fri, 15 Aug 2025 15:49:52 +0200
From: Jeremi Piotrowski <jpiotrowski@...ux.microsoft.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Vitaly Kuznetsov <vkuznets@...hat.com>,
	Dave Hansen <dave.hansen@...ux.intel.com>,
	linux-kernel@...r.kernel.org, alanjiang@...rosoft.com,
	chinang.ma@...rosoft.com, andrea.pellegrini@...rosoft.com,
	Kevin Tian <kevin.tian@...el.com>,
	"K. Y. Srinivasan" <kys@...rosoft.com>,
	Haiyang Zhang <haiyangz@...rosoft.com>,
	Wei Liu <wei.liu@...nel.org>, Dexuan Cui <decui@...rosoft.com>,
	linux-hyperv@...r.kernel.org, Paolo Bonzini <pbonzini@...hat.com>,
	kvm@...r.kernel.org
Subject: Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB
 flushes

On Tue, Aug 05, 2025 at 04:42:46PM -0700, Sean Christopherson wrote:
> On Tue, Aug 05, 2025, Jeremi Piotrowski wrote:
> > On 05/08/2025 01:09, Sean Christopherson wrote:
> > > On Mon, Aug 04, 2025, Vitaly Kuznetsov wrote:
> > >> Sean Christopherson <seanjc@...gle.com> writes:

(snip)

> > > 
> > > Oof, right.  And it's not even a VM-to-VM noisy neighbor problem, e.g. a few
> > > vCPUs using nested TDP could generate a lot of noist IRQs through a VM.  Hrm.
> > > 
> > > So I think the basic idea is so flawed/garbage that even enhancing it with per-VM
> > > pCPU tracking wouldn't work.  I do think you've got the right idea with a pCPU mask
> > > though, but instead of using a mask to scope IPIs, use it to elide TLB flushes.
> > 
> > Sorry for the delay in replying, I've been sidetracked a bit.
> 
> No worries, I guarantee my delays will make your delays pale in comparison :-D
> 
> > I like this idea more, not special casing the TLB flushing approach per hypervisor is
> > preferable.
> > 
> > > 
> > > With the TDP MMU, KVM can have at most 6 non-nested roots active at any given time:
> > > SMM vs. non-SMM, 4-level vs. 5-level, L1 vs. L2.  Allocating a cpumask for each
> > > TDP MMU root seems reasonable.  Then on task migration, instead of doing a global
> > > INVEPT, only INVEPT the current and prev_roots (because getting a new root will
> > > trigger a flush in kvm_mmu_load()), and skip INVEPT on TDP MMU roots if the pCPU
> > > has already done a flush for the root.
> > 
> > Just to make sure I follow: current+prev_roots do you mean literally those
> > (i.e. cached prev roots) or all roots on kvm->arch.tdp_mmu_roots?
> 
> The former, i.e. "root" and all "prev_roots" entries in a kvm_mmu structure.
> 
> > So this would mean: on pCPU migration, check if current mmu has is_tdp_mmu_active()
> > and then perform the INVEPT-single over roots instead of INVEPT-global. Otherwise stick
> > to the KVM_REQ_TLB_FLUSH.
> 
> No, KVM would still need to ensure shadow roots are flushed as well, because KVM
> doesn't flush TLBs when switching to a previous root (see fast_pgd_switch()).
> More at the bottom.
> 
> > Would there need to be a check for is_guest_mode(), or that the switch is
> > coming from the vmx/nested.c? I suppose not because nested doesn't seem to
> > use TDP MMU.
> 
> Nested can use the TDP MMU, though there's practically no code in KVM that explicitly
> deals with this scenario.  If L1 is using legacy shadow paging, i.e. is NOT using
> EPT/NPT, then KVM will use the TDP MMU to map L2 (with kvm_mmu_page_role.guest_mode=1
> to differentiate from the L1 TDP MMU).
> 
> > > Or we could do the optimized tracking for all roots.  x86 supports at most 8192
> > > CPUs, which means 1KiB per root.  That doesn't seem at all painful given that
> > > each shadow pages consumes 4KiB...
> > 
> > Similar question here: which all roots would need to be tracked+flushed for shadow
> > paging? pae_roots?
> 
> Same general answer, "root" and all "prev_roots" entries.  KVM uses up to two
> "struct kvm_mmu" instances to actually map memory into the guest: root_mmu and
> guest_mmu.  The third instance, nested_mmu, is used to model gva->gpa translations
> for L2, i.e. is used only to walk L2 stage-1 page tables, and is never used to
> map memory into the guest, i.e. can't have entries in hardware TLBs.
> 
> The basic gist is to add a cpumask in each root, and then elide TLB flushes on
> pCPU migration if KVM has flushed the root at least once.  Patch 5/5 in the attached
> set of patches provides a *very* rough sketch.  Hopefully its enough to convey the
> core idea.
> 
> Patches 1-4 compile, but are otherwise untested.  I'll post patches 1-3 as a small
> series once their tested, as those cleanups are worth doing irrespective of any
> optimizations we make to pCPU migration.
> 

Thanks for the detailed explanation and the patches Sean!
I started working on extending patch 5, wanted to post it here to make sure I'm
on the right track.

It works in testing so far and shows promising performance - it gets rid of all
the pathological cases I saw before.

I haven't checked whether I broke SVM yet, and I need figure out a way to
always keep the cpumask "offstack" so that we don't blow up every struct
kvm_mmu_page instance with an inline cpumask - it needs to stay optional.

I also came across kvm_mmu_is_dummy_root(), that check is included in
root_to_sp(). Can you think of any other checks that we might need to handle?

View attachment "0001-KVM-VMX-Sketch-in-possible-framework-for-eliding-TLB.patch" of type "text/x-diff" (8065 bytes)