[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5f42d1ef-f6b7-c339-32b9-f4cf48c21841@redhat.com>
Date: Mon, 14 Feb 2022 17:34:02 +0100
From: Paolo Bonzini <pbonzini@...hat.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
vkuznets@...hat.com, mlevitsk@...hat.com, dmatlack@...gle.com
Subject: Re: [PATCH 12/12] KVM: x86: do not unload MMU roots on all role
changes
On 2/11/22 19:48, Sean Christopherson wrote:
> On Wed, Feb 09, 2022, Paolo Bonzini wrote:
>> kvm_mmu_reset_context is called on all role changes and right now it
>> calls kvm_mmu_unload. With the legacy MMU this is a relatively cheap
>> operation; the previous PGDs remains in the hash table and is picked
>> up immediately on the next page fault. With the TDP MMU, however, the
>> roots are thrown away for good and a full rebuild of the page tables is
>> necessary, which is many times more expensive.
>>
>> Fortunately, throwing away the roots is not necessary except when
>> the manual says a TLB flush is required:
>>
>> - changing CR0.PG from 1 to 0 (because it flushes the TLB according to
>> the x86 architecture specification)
>>
>> - changing CPUID (which changes the interpretation of page tables in
>> ways not reflected by the role).
>>
>> - changing CR4.SMEP from 0 to 1 (not doing so actually breaks access.c!)
>>
>> Except for these cases, once the MMU has updated the CPU/MMU roles
>> and metadata it is enough to force-reload the current value of CR3.
>> KVM will look up the cached roots for an entry with the right role and
>> PGD, and only if the cache misses a new root will be created.
>>
>> Measuring with vmexit.flat from kvm-unit-tests shows the following
>> improvement:
>>
>> TDP legacy shadow
>> before 46754 5096 5150
>> after 4879 4875 5006
>>
>> which is for very small page tables. The impact is however much larger
>> when running as an L1 hypervisor, because the new page tables cause
>> extra work for L0 to shadow them.
>>
>> Reported-by: Brad Spengler <spender@...ecurity.com>
>> Signed-off-by: Paolo Bonzini <pbonzini@...hat.com>
>> ---
>> arch/x86/kvm/mmu/mmu.c | 7 ++++---
>> arch/x86/kvm/x86.c | 27 ++++++++++++++++++---------
>> 2 files changed, 22 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 38b40ddcaad7..dbd4e98ba426 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -5020,8 +5020,8 @@ EXPORT_SYMBOL_GPL(kvm_init_mmu);
>> void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>> {
>> /*
>> - * Invalidate all MMU roles to force them to reinitialize as CPUID
>> - * information is factored into reserved bit calculations.
>> + * Invalidate all MMU roles and roots to force them to reinitialize,
>> + * as CPUID information is factored into reserved bit calculations.
>> *
>> * Correctly handling multiple vCPU models with respect to paging and
>> * physical address properties) in a single VM would require tracking
>> @@ -5034,6 +5034,7 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>> vcpu->arch.root_mmu.mmu_role.ext.valid = 0;
>> vcpu->arch.guest_mmu.mmu_role.ext.valid = 0;
>> vcpu->arch.nested_mmu.mmu_role.ext.valid = 0;
>> + kvm_mmu_unload(vcpu);
>> kvm_mmu_reset_context(vcpu);
>>
>> /*
>> @@ -5045,8 +5046,8 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>>
>> void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
>> {
>> - kvm_mmu_unload(vcpu);
>> kvm_init_mmu(vcpu);
>> + kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
>
> This is too risky IMO, there are far more flows than just MOV CR0/CR4 that are
> affected, e.g. SMM transitions, KVM_SET_SREG, etc...
SMM exit does flush the TLB because RSM clears CR0.PG (I did check this
:)). SMM re-entry then does not need to flush. But I don't think SMM
exit should flush the TLB *for non-SMM roles*.
For KVM_SET_SREGS I'm not sure if it should flush the TLB, but I agree
it is certainly safer to keep it that way.
> Given that kvm_post_set_cr{0,4}() and kvm_vcpu_reset() explicitly handle CR0.PG
> and CR4.SMEP toggling, I highly doubt the other flows are correct in all instances.
> The call to kvm_mmu_new_pgd() is also
*white noise*
> To minimize risk, we should leave kvm_mmu_reset_context() as is (rename it if
> necessary) and instead add a new helper to handle kvm_post_set_cr{0,4}(). In
> the future we can/should work on avoiding unload in all paths, but again, future
> problem.
I disagree on this. There aren't many calls to kvm_mmu_reset_context.
>>
>> - if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
>> + if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) {
>> + /* Flush the TLB if CR0 is changed 1 -> 0. */
>> + if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG))
>> + kvm_mmu_unload(vcpu);
>
> Calling kvm_mmu_unload() instead of requesting a flush isn't coherent with respect
> to the comment, or with SMEP handling. And the SMEP handling isn't coherent with
> respect to the changelog. Please elaborate :-)
Yep, will do (the CR0.PG=0 case is similar to the CR0.PCIDE=0 case
below). Using kvm_mmu_unload() avoids loading a cached root just to
throw it away immediately after, but I can change this to a new
KVM_REQ_MMU_UPDATE_ROOT flag that does
kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
By the way, I have a possibly stupid question. In kvm_set_cr3 (called
e.g. from emulator_set_cr()) there is
if (cr3 != kvm_read_cr3(vcpu))
kvm_mmu_new_pgd(vcpu, cr3);
What makes this work if mmu_is_nested(vcpu)? Should this also have an
"if (... & !tdp_enabled)"?
>> - else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
>> + if ((cr4 & X86_CR4_SMEP) && !(old_cr4 & X86_CR4_SMEP))
>> + kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
>
> This mishandles CR4.PGE. Per the comment above, the if-elif-elif sequence relies
> on kvm_mmu_reset_context being a superset of KVM_REQ_TLB_FLUSH_GUEST.
>
> For both CR0 and CR4, I think we should disassociate the TLB flush logic from the
> MMU role logic, e.g. CR4.PGE _could_ be part of the role, but it's not because KVM
> doesn't emulate global pages.
Makes sense, yes. It also needs to handle flushing the current PCID
when changing CR4.PAE (previously done for "free" by
kvm_mmu_reset_context), but I agree with the idea.
Paolo
> This is what I'm thinking, assuming CR0.PG 1=>0 really only needs a flush.
>
>
> ---
> arch/x86/kvm/mmu/mmu.c | 4 ++--
> arch/x86/kvm/x86.c | 42 +++++++++++++++++++++++++++++-------------
> 2 files changed, 31 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e41834748d52..c477c519c784 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5041,8 +5041,8 @@ EXPORT_SYMBOL_GPL(kvm_init_mmu);
> void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
> {
> /*
> - * Invalidate all MMU roles to force them to reinitialize as CPUID
> - * information is factored into reserved bit calculations.
> + * Invalidate all MMU roles and roots to force them to reinitialize,
> + * as CPUID information is factored into reserved bit calculations.
> *
> * Correctly handling multiple vCPU models with respect to paging and
> * physical address properties) in a single VM would require tracking
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 782dc9cd31d8..b8dad04301ee 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -863,15 +863,28 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3)
> }
> EXPORT_SYMBOL_GPL(load_pdptrs);
>
> +static void kvm_post_set_cr_reinit_mmu(struct kvm_vcpu *vcpu)
> +{
> + kvm_mmu_init(vcpu);
> + kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
> +}
> +
> void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
> {
> if ((cr0 ^ old_cr0) & X86_CR0_PG) {
> kvm_clear_async_pf_completion_queue(vcpu);
> kvm_async_pf_hash_reset(vcpu);
> +
> + /*
> + * Clearing CR0.PG is architecturally defined as flushing the
> + * TLB from the guest's perspective.
> + */
> + if (!(cr0 & X86_CR0_PG))
> + kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> }
>
> if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
> - kvm_mmu_reset_context(vcpu);
> + kvm_post_set_cr_reinit_mmu(vcpu);
>
> if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
> kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
> @@ -1055,26 +1068,29 @@ EXPORT_SYMBOL_GPL(kvm_is_valid_cr4);
> void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned long cr4)
> {
> /*
> - * If any role bit is changed, the MMU needs to be reset.
> - *
> * If CR4.PCIDE is changed 1 -> 0, the guest TLB must be flushed.
> * If CR4.PCIDE is changed 0 -> 1, there is no need to flush the TLB
> * according to the SDM; however, stale prev_roots could be reused
> * incorrectly in the future after a MOV to CR3 with NOFLUSH=1, so we
> * free them all. KVM_REQ_MMU_RELOAD is fit for the both cases; it
> * is slow, but changing CR4.PCIDE is a rare case.
> - *
> - * If CR4.PGE is changed, the guest TLB must be flushed.
> - *
> - * Note: resetting MMU is a superset of KVM_REQ_MMU_RELOAD and
> - * KVM_REQ_MMU_RELOAD is a superset of KVM_REQ_TLB_FLUSH_GUEST, hence
> - * the usage of "else if".
> */
> - if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
> - kvm_mmu_reset_context(vcpu);
> - else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
> + if ((cr4 ^ old_cr4) & X86_CR4_PCIDE) {
> kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> - else if ((cr4 ^ old_cr4) & X86_CR4_PGE)
> + return;
> + }
> +
> + /* If any role bit is changed, the MMU needs to be reinitialized. */
> + if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
> + kvm_post_set_cr_reinit_mmu(vcpu);
> +
> + /*
> + * Setting SMEP or toggling PGE is architecturally defined as flushing
> + * the TLB from the guest's perspective. Note, because the shadow MMU
> + * ignores global pages, CR4.PGE is not part of KVM_MMU_CR4_ROLE_BITS.
> + */
> + if (((cr4 ^ old_cr4) & X86_CR4_PGE) ||
> + ((cr4 & X86_CR4_SMEP) && !(old_cr4 & X86_CR4_SMEP)))
> kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> }
> EXPORT_SYMBOL_GPL(kvm_post_set_cr4);
>
> base-commit: a8c36d04d70d0b15e696561e1a2134fcbdd3a3bd
> --
>
Powered by blists - more mailing lists