linux-kernel - Re: [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANgfPd_JXJOYhMGKTHfZM0f1x_KpOzJi8fKVv9awLbbpoHdOMw@mail.gmail.com>
Date:   Tue, 1 Mar 2022 09:57:05 -0800
From:   Ben Gardon <bgardon@...gle.com>
To:     Sean Christopherson <seanjc@...gle.com>
Cc:     Paolo Bonzini <pbonzini@...hat.com>,
        Christian Borntraeger <borntraeger@...ux.ibm.com>,
        Janosch Frank <frankja@...ux.ibm.com>,
        Claudio Imbrenda <imbrenda@...ux.ibm.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>,
        David Hildenbrand <david@...hat.com>,
        kvm <kvm@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>,
        David Matlack <dmatlack@...gle.com>,
        Mingwei Zhang <mizhang@...gle.com>
Subject: Re: [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker

On Fri, Feb 25, 2022 at 4:16 PM Sean Christopherson <seanjc@...gle.com> wrote:
>
> Zap defunct roots, a.k.a. roots that have been invalidated after their
> last reference was initially dropped, asynchronously via the system work
> queue instead of forcing the work upon the unfortunate task that happened
> to drop the last reference.
>
> If a vCPU task drops the last reference, the vCPU is effectively blocked
> by the host for the entire duration of the zap.  If the root being zapped
> happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
> being active, the zap can take several hundred seconds.  Unsurprisingly,
> most guests are unhappy if a vCPU disappears for hundreds of seconds.
>
> E.g. running a synthetic selftest that triggers a vCPU root zap with
> ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
> Offloading the zap to a worker drops the block time to <100ms.
>
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>

Reviewed-by: Ben Gardon <bgardon@...gle.com>

> ---
>  arch/x86/kvm/mmu/mmu_internal.h |  8 +++-
>  arch/x86/kvm/mmu/tdp_mmu.c      | 65 ++++++++++++++++++++++++++++-----
>  2 files changed, 63 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index be063b6c91b7..1bff453f7cbe 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -65,7 +65,13 @@ struct kvm_mmu_page {
>                 struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
>                 tdp_ptep_t ptep;
>         };
> -       DECLARE_BITMAP(unsync_child_bitmap, 512);
> +       union {
> +               DECLARE_BITMAP(unsync_child_bitmap, 512);
> +               struct {
> +                       struct work_struct tdp_mmu_async_work;
> +                       void *tdp_mmu_async_data;
> +               };
> +       };

At some point (probably not in this series since it's so long already)
it would be good to organize kvm_mmu_page. It looks like we've got
quite a few anonymous unions in there for TDP / Shadow MMU fields.

>
>         struct list_head lpage_disallowed_link;
>  #ifdef CONFIG_X86_32
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index ec28a88c6376..4151e61245a7 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -81,6 +81,38 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
>  static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
>                              bool shared);
>
> +static void tdp_mmu_zap_root_async(struct work_struct *work)
> +{
> +       struct kvm_mmu_page *root = container_of(work, struct kvm_mmu_page,
> +                                                tdp_mmu_async_work);
> +       struct kvm *kvm = root->tdp_mmu_async_data;
> +
> +       read_lock(&kvm->mmu_lock);
> +
> +       /*
> +        * A TLB flush is not necessary as KVM performs a local TLB flush when
> +        * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
> +        * to a different pCPU.  Note, the local TLB flush on reuse also
> +        * invalidates any paging-structure-cache entries, i.e. TLB entries for
> +        * intermediate paging structures, that may be zapped, as such entries
> +        * are associated with the ASID on both VMX and SVM.
> +        */
> +       tdp_mmu_zap_root(kvm, root, true);
> +
> +       /*
> +        * Drop the refcount using kvm_tdp_mmu_put_root() to test its logic for
> +        * avoiding an infinite loop.  By design, the root is reachable while
> +        * it's being asynchronously zapped, thus a different task can put its
> +        * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for an
> +        * asynchronously zapped root is unavoidable.
> +        */
> +       kvm_tdp_mmu_put_root(kvm, root, true);
> +
> +       read_unlock(&kvm->mmu_lock);
> +
> +       kvm_put_kvm(kvm);
> +}
> +
>  void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
>                           bool shared)
>  {
> @@ -142,15 +174,26 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
>         refcount_set(&root->tdp_mmu_root_count, 1);
>
>         /*
> -        * Zap the root, then put the refcount "acquired" above.   Recursively
> -        * call kvm_tdp_mmu_put_root() to test the above logic for avoiding an
> -        * infinite loop by freeing invalid roots.  By design, the root is
> -        * reachable while it's being zapped, thus a different task can put its
> -        * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for a
> -        * defunct root is unavoidable.
> +        * Attempt to acquire a reference to KVM itself.  If KVM is alive, then
> +        * zap the root asynchronously in a worker, otherwise it must be zapped
> +        * directly here.  Wait to do this check until after the refcount is
> +        * reset so that tdp_mmu_zap_root() can safely yield.
> +        *
> +        * In both flows, zap the root, then put the refcount "acquired" above.
> +        * When putting the reference, use kvm_tdp_mmu_put_root() to test the
> +        * above logic for avoiding an infinite loop by freeing invalid roots.
> +        * By design, the root is reachable while it's being zapped, thus a
> +        * different task can put its last reference, i.e. flowing through
> +        * kvm_tdp_mmu_put_root() for a defunct root is unavoidable.
>          */
> -       tdp_mmu_zap_root(kvm, root, shared);
> -       kvm_tdp_mmu_put_root(kvm, root, shared);
> +       if (kvm_get_kvm_safe(kvm)) {
> +               root->tdp_mmu_async_data = kvm;
> +               INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_async);
> +               schedule_work(&root->tdp_mmu_async_work);
> +       } else {
> +               tdp_mmu_zap_root(kvm, root, shared);
> +               kvm_tdp_mmu_put_root(kvm, root, shared);
> +       }
>  }
>
>  enum tdp_mmu_roots_iter_type {
> @@ -954,7 +997,11 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
>
>         /*
>          * Zap all roots, including invalid roots, as all SPTEs must be dropped
> -        * before returning to the caller.
> +        * before returning to the caller.  Zap directly even if the root is
> +        * also being zapped by a worker.  Walking zapped top-level SPTEs isn't
> +        * all that expensive and mmu_lock is already held, which means the
> +        * worker has yielded, i.e. flushing the work instead of zapping here
> +        * isn't guaranteed to be any faster.
>          *
>          * A TLB flush is unnecessary, KVM zaps everything if and only the VM
>          * is being destroyed or the userspace VMM has exited.  In both cases,
> --
> 2.35.1.574.g5d30c73bfb-goog
>