linux-kernel - Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB flushes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ce7ef1f0-c098-4669-85f3-b6ebb437a568@linux.microsoft.com>
Date: Tue, 5 Aug 2025 20:04:09 +0200
From: Jeremi Piotrowski <jpiotrowski@...ux.microsoft.com>
To: Sean Christopherson <seanjc@...gle.com>,
 Vitaly Kuznetsov <vkuznets@...hat.com>
Cc: Dave Hansen <dave.hansen@...ux.intel.com>, linux-kernel@...r.kernel.org,
 alanjiang@...rosoft.com, chinang.ma@...rosoft.com,
 andrea.pellegrini@...rosoft.com, Kevin Tian <kevin.tian@...el.com>,
 "K. Y. Srinivasan" <kys@...rosoft.com>,
 Haiyang Zhang <haiyangz@...rosoft.com>, Wei Liu <wei.liu@...nel.org>,
 Dexuan Cui <decui@...rosoft.com>, linux-hyperv@...r.kernel.org,
 Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org
Subject: Re: [RFC PATCH 1/1] KVM: VMX: Use Hyper-V EPT flush for local TLB
 flushes

On 05/08/2025 01:09, Sean Christopherson wrote:
> On Mon, Aug 04, 2025, Vitaly Kuznetsov wrote:
>> Sean Christopherson <seanjc@...gle.com> writes:
>>> It'll take more work than the below, e.g. to have VMX's construct_eptp() pull the
>>> level and A/D bits from kvm_mmu_page (vendor code can get at the kvm_mmu_page with
>>> root_to_sp()), but for the core concept/skeleton, I think this is it?
>>>
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index 6e838cb6c9e1..298130445182 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -3839,6 +3839,37 @@ void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu)
>>>  }
>>>  EXPORT_SYMBOL_GPL(kvm_mmu_free_guest_mode_roots);
>>>  
>>> +struct kvm_tlb_flush_root {
>>> +       struct kvm *kvm;
>>> +       hpa_t root;
>>> +};
>>> +
>>> +static void kvm_flush_tlb_root(void *__data)
>>> +{
>>> +       struct kvm_tlb_flush_root *data = __data;
>>> +
>>> +       kvm_x86_call(flush_tlb_root)(data->kvm, data->root);
>>> +}
>>> +
>>> +void kvm_mmu_flush_all_tlbs_root(struct kvm *kvm, struct kvm_mmu_page *root)
>>> +{
>>> +       struct kvm_tlb_flush_root data = {
>>> +               .kvm = kvm,
>>> +               .root = __pa(root->spt),
>>> +       };
>>> +
>>> +       /*
>>> +        * Flush any TLB entries for the new root, the provenance of the root
>>> +        * is unknown.  Even if KVM ensures there are no stale TLB entries
>>> +        * for a freed root, in theory another hypervisor could have left
>>> +        * stale entries.  Flushing on alloc also allows KVM to skip the TLB
>>> +        * flush when freeing a root (see kvm_tdp_mmu_put_root()), and flushing
>>> +        * TLBs on all CPUs allows KVM to elide TLB flushes when a vCPU is
>>> +        * migrated to a different pCPU.
>>> +        */
>>> +       on_each_cpu(kvm_flush_tlb_root, &data, 1);
>>
>> Would it make sense to complement this with e.g. a CPU mask tracking all
>> the pCPUs where the VM has ever been seen running (+ a flush when a new
>> one is added to it)?
>>
>> I'm worried about the potential performance impact for a case when a
>> huge host is running a lot of small VMs in 'partitioning' mode
>> (i.e. when all vCPUs are pinned). Additionally, this may have a negative
>> impact on RT use-cases where each unnecessary interruption can be seen
>> problematic. 
> 
> Oof, right.  And it's not even a VM-to-VM noisy neighbor problem, e.g. a few
> vCPUs using nested TDP could generate a lot of noist IRQs through a VM.  Hrm.
> 
> So I think the basic idea is so flawed/garbage that even enhancing it with per-VM
> pCPU tracking wouldn't work.  I do think you've got the right idea with a pCPU mask
> though, but instead of using a mask to scope IPIs, use it to elide TLB flushes.

Sorry for the delay in replying, I've been sidetracked a bit.

I like this idea more, not special casing the TLB flushing approach per hypervisor is
preferable.

> 
> With the TDP MMU, KVM can have at most 6 non-nested roots active at any given time:
> SMM vs. non-SMM, 4-level vs. 5-level, L1 vs. L2.  Allocating a cpumask for each
> TDP MMU root seems reasonable.  Then on task migration, instead of doing a global
> INVEPT, only INVEPT the current and prev_roots (because getting a new root will
> trigger a flush in kvm_mmu_load()), and skip INVEPT on TDP MMU roots if the pCPU
> has already done a flush for the root.

Just to make sure I follow: current+prev_roots do you mean literally those (i.e. cached prev roots)
or all roots on kvm->arch.tdp_mmu_roots?

So this would mean: on pCPU migration, check if current mmu has is_tdp_mmu_active()
and then perform the INVEPT-single over roots instead of INVEPT-global. Otherwise stick
to the KVM_REQ_TLB_FLUSH.

Would there need to be a check for is_guest_mode(), or that the switch is coming from
the vmx/nested.c? I suppose not because nested doesn't seem to use TDP MMU.
> 
> Or we could do the optimized tracking for all roots.  x86 supports at most 8192
> CPUs, which means 1KiB per root.  That doesn't seem at all painful given that
> each shadow pages consumes 4KiB...

Similar question here: which all roots would need to be tracked+flushed for shadow
paging? pae_roots?