[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANgfPd-PiULe2eSaSK1BiHgWZ8z54vJfjbc5GaDfexi8x5eorQ@mail.gmail.com>
Date: Wed, 25 May 2022 16:12:28 -0700
From: Ben Gardon <bgardon@...gle.com>
To: kvm <kvm@...r.kernel.org>, Paolo Bonzini <pbonzini@...hat.com>
Cc: LKML <linux-kernel@...r.kernel.org>, Peter Xu <peterx@...hat.com>,
Sean Christopherson <seanjc@...gle.com>,
David Matlack <dmatlack@...gle.com>,
Jim Mattson <jmattson@...gle.com>,
David Dunn <daviddunn@...gle.com>,
Jing Zhang <jingzhangos@...gle.com>,
Junaid Shahid <junaids@...gle.com>
Subject: Re: [PATCH] KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging
On Wed, May 25, 2022 at 4:09 PM Ben Gardon <bgardon@...gle.com> wrote:
>
> When disabling dirty logging, zap non-leaf parent entries to allow
> replacement with huge pages instead of recursing and zapping all of the
> child, leaf entries. This reduces the number of TLB flushes required.
>
> Currently disabling dirty logging with the TDP MMU is extremely slow.
> On a 96 vCPU / 96G VM backed with gigabyte pages, it takes ~200 seconds
> to disable dirty logging with the TDP MMU, as opposed to ~4 seconds with
> the shadow MMU. This patch reduces the disable dirty log time with the
> TDP MMU to ~3 seconds.
>
After Sean pointed out that I was changing the zapping scheme to zap
non-leaf SPTEs in my in-place promotion series, I started to wonder if
that would provide good-enough performance without the complexity of
in-place promo. Turns out it does! This relatively simple patch gives
essentially the same disable time as the in-place promo series.
The main downside to this approach is that it does cause all the vCPUs
to take page faults, so it may still be worth investigating in-place
promotion.
> Testing:
> Ran KVM selftests and kvm-unit-tests on an Intel Haswell. This
> patch introduced no new failures.
>
> Signed-off-by: Ben Gardon <bgardon@...gle.com>
> ---
> arch/x86/kvm/mmu/tdp_iter.c | 9 +++++++++
> arch/x86/kvm/mmu/tdp_iter.h | 1 +
> arch/x86/kvm/mmu/tdp_mmu.c | 38 +++++++++++++++++++++++++++++++------
> 3 files changed, 42 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> index 6d3b3e5a5533..ee4802d7b36c 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.c
> +++ b/arch/x86/kvm/mmu/tdp_iter.c
> @@ -145,6 +145,15 @@ static bool try_step_up(struct tdp_iter *iter)
> return true;
> }
>
> +/*
> + * Step the iterator back up a level in the paging structure. Should only be
> + * used when the iterator is below the root level.
> + */
> +void tdp_iter_step_up(struct tdp_iter *iter)
> +{
> + WARN_ON(!try_step_up(iter));
> +}
> +
> /*
> * Step to the next SPTE in a pre-order traversal of the paging structure.
> * To get to the next SPTE, the iterator either steps down towards the goal
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index f0af385c56e0..adfca0cf94d3 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -114,5 +114,6 @@ void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
> int min_level, gfn_t next_last_level_gfn);
> void tdp_iter_next(struct tdp_iter *iter);
> void tdp_iter_restart(struct tdp_iter *iter);
> +void tdp_iter_step_up(struct tdp_iter *iter);
>
> #endif /* __KVM_X86_MMU_TDP_ITER_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 841feaa48be5..7b9265d67131 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1742,12 +1742,12 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> gfn_t start = slot->base_gfn;
> gfn_t end = start + slot->npages;
> struct tdp_iter iter;
> + int max_mapping_level;
> kvm_pfn_t pfn;
>
> rcu_read_lock();
>
> tdp_root_for_each_pte(iter, root, start, end) {
> -retry:
> if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
> continue;
>
> @@ -1755,15 +1755,41 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> !is_last_spte(iter.old_spte, iter.level))
> continue;
>
> + /*
> + * This is a leaf SPTE. Check if the PFN it maps can
> + * be mapped at a higher level.
> + */
> pfn = spte_to_pfn(iter.old_spte);
> - if (kvm_is_reserved_pfn(pfn) ||
> - iter.level >= kvm_mmu_max_mapping_level(kvm, slot, iter.gfn,
> - pfn, PG_LEVEL_NUM))
> +
> + if (kvm_is_reserved_pfn(pfn))
> continue;
>
> + max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> + iter.gfn, pfn, PG_LEVEL_NUM);
> +
> + WARN_ON(max_mapping_level < iter.level);
> +
> + /*
> + * If this page is already mapped at the highest
> + * viable level, there's nothing more to do.
> + */
> + if (max_mapping_level == iter.level)
> + continue;
> +
> + /*
> + * The page can be remapped at a higher level, so step
> + * up to zap the parent SPTE.
> + */
> + while (max_mapping_level > iter.level)
> + tdp_iter_step_up(&iter);
> +
> /* Note, a successful atomic zap also does a remote TLB flush. */
> - if (tdp_mmu_zap_spte_atomic(kvm, &iter))
> - goto retry;
> + tdp_mmu_zap_spte_atomic(kvm, &iter);
> +
> + /*
> + * If the atomic zap fails, the iter will recurse back into
> + * the same subtree to retry.
> + */
> }
>
> rcu_read_unlock();
> --
> 2.36.1.124.g0e6072fb45-goog
>
Powered by blists - more mailing lists