linux-kernel - Re: [PATCH] KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANgfPd-uAXu6aCoEji2BeiHWJnw-PrictuQLOYqLKrQ47WFydw@mail.gmail.com>
Date:   Thu, 26 May 2022 08:57:43 -0700
From:   Ben Gardon <bgardon@...gle.com>
To:     Yuan Yao <yuan.yao@...ux.intel.com>
Cc:     kvm <kvm@...r.kernel.org>, Paolo Bonzini <pbonzini@...hat.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Peter Xu <peterx@...hat.com>,
        Sean Christopherson <seanjc@...gle.com>,
        David Matlack <dmatlack@...gle.com>,
        Jim Mattson <jmattson@...gle.com>,
        David Dunn <daviddunn@...gle.com>,
        Jing Zhang <jingzhangos@...gle.com>,
        Junaid Shahid <junaids@...gle.com>
Subject: Re: [PATCH] KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging

On Wed, May 25, 2022 at 6:30 PM Yuan Yao <yuan.yao@...ux.intel.com> wrote:
>
> On Wed, May 25, 2022 at 11:09:04PM +0000, Ben Gardon wrote:
> > When disabling dirty logging, zap non-leaf parent entries to allow
> > replacement with huge pages instead of recursing and zapping all of the
> > child, leaf entries. This reduces the number of TLB flushes required.
> >
> > Currently disabling dirty logging with the TDP MMU is extremely slow.
> > On a 96 vCPU / 96G VM backed with gigabyte pages, it takes ~200 seconds
> > to disable dirty logging with the TDP MMU, as opposed to ~4 seconds with
> > the shadow MMU. This patch reduces the disable dirty log time with the
> > TDP MMU to ~3 seconds.
> >
> > Testing:
> > Ran KVM selftests and kvm-unit-tests on an Intel Haswell. This
> > patch introduced no new failures.
> >
> > Signed-off-by: Ben Gardon <bgardon@...gle.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_iter.c |  9 +++++++++
> >  arch/x86/kvm/mmu/tdp_iter.h |  1 +
> >  arch/x86/kvm/mmu/tdp_mmu.c  | 38 +++++++++++++++++++++++++++++++------
> >  3 files changed, 42 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> > index 6d3b3e5a5533..ee4802d7b36c 100644
> > --- a/arch/x86/kvm/mmu/tdp_iter.c
> > +++ b/arch/x86/kvm/mmu/tdp_iter.c
> > @@ -145,6 +145,15 @@ static bool try_step_up(struct tdp_iter *iter)
> >       return true;
> >  }
> >
> > +/*
> > + * Step the iterator back up a level in the paging structure. Should only be
> > + * used when the iterator is below the root level.
> > + */
> > +void tdp_iter_step_up(struct tdp_iter *iter)
> > +{
> > +     WARN_ON(!try_step_up(iter));
> > +}
> > +
> >  /*
> >   * Step to the next SPTE in a pre-order traversal of the paging structure.
> >   * To get to the next SPTE, the iterator either steps down towards the goal
> > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> > index f0af385c56e0..adfca0cf94d3 100644
> > --- a/arch/x86/kvm/mmu/tdp_iter.h
> > +++ b/arch/x86/kvm/mmu/tdp_iter.h
> > @@ -114,5 +114,6 @@ void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
> >                   int min_level, gfn_t next_last_level_gfn);
> >  void tdp_iter_next(struct tdp_iter *iter);
> >  void tdp_iter_restart(struct tdp_iter *iter);
> > +void tdp_iter_step_up(struct tdp_iter *iter);
> >
> >  #endif /* __KVM_X86_MMU_TDP_ITER_H */
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 841feaa48be5..7b9265d67131 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1742,12 +1742,12 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> >       gfn_t start = slot->base_gfn;
> >       gfn_t end = start + slot->npages;
> >       struct tdp_iter iter;
> > +     int max_mapping_level;
> >       kvm_pfn_t pfn;
> >
> >       rcu_read_lock();
> >
> >       tdp_root_for_each_pte(iter, root, start, end) {
> > -retry:
> >               if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
> >                       continue;
> >
> > @@ -1755,15 +1755,41 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> >                   !is_last_spte(iter.old_spte, iter.level))
> >                       continue;
> >
> > +             /*
> > +              * This is a leaf SPTE. Check if the PFN it maps can
> > +              * be mapped at a higher level.
> > +              */
> >               pfn = spte_to_pfn(iter.old_spte);
> > -             if (kvm_is_reserved_pfn(pfn) ||
> > -                 iter.level >= kvm_mmu_max_mapping_level(kvm, slot, iter.gfn,
> > -                                                         pfn, PG_LEVEL_NUM))
> > +
> > +             if (kvm_is_reserved_pfn(pfn))
> >                       continue;
> >
> > +             max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
> > +                             iter.gfn, pfn, PG_LEVEL_NUM);
> > +
> > +             WARN_ON(max_mapping_level < iter.level);
> > +
> > +             /*
> > +              * If this page is already mapped at the highest
> > +              * viable level, there's nothing more to do.
> > +              */
> > +             if (max_mapping_level == iter.level)
> > +                     continue;
> > +
> > +             /*
> > +              * The page can be remapped at a higher level, so step
> > +              * up to zap the parent SPTE.
> > +              */
> > +             while (max_mapping_level > iter.level)
> > +                     tdp_iter_step_up(&iter);
>
> So the benefit from this is:
> Before: Zap 512 ptes in 4K level page table do TLB flush 512 times.
> Now: Zap higher level 1 2MB level pte do TLB flush 1 time, event
>      it also handles all 512 lower level 4K ptes, but just atomic operation
>      there, see handle_removed_pt().
>
> Is my understanding correct ?

Yes, that's exactly right.

>
> > +
> >               /* Note, a successful atomic zap also does a remote TLB flush. */
> > -             if (tdp_mmu_zap_spte_atomic(kvm, &iter))
> > -                     goto retry;
> > +             tdp_mmu_zap_spte_atomic(kvm, &iter);
> > +
> > +             /*
> > +              * If the atomic zap fails, the iter will recurse back into
> > +              * the same subtree to retry.
> > +              */
> >       }
> >
> >       rcu_read_unlock();
> > --
> > 2.36.1.124.g0e6072fb45-goog
> >