linux-kernel - Re: [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via asynchronous worker

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YiKwFznqqiB9VRyn@google.com>
Date:   Sat, 5 Mar 2022 00:34:31 +0000
From:   Sean Christopherson <seanjc@...gle.com>
To:     Paolo Bonzini <pbonzini@...hat.com>
Cc:     linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>,
        David Hildenbrand <david@...hat.com>,
        David Matlack <dmatlack@...gle.com>,
        Ben Gardon <bgardon@...gle.com>,
        Mingwei Zhang <mizhang@...gle.com>
Subject: Re: [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via
 asynchronous worker

On Fri, Mar 04, 2022, Paolo Bonzini wrote:
> On 3/4/22 17:02, Sean Christopherson wrote:
> > On Fri, Mar 04, 2022, Paolo Bonzini wrote:
> > > On 3/3/22 22:32, Sean Christopherson wrote:
> > > I didn't remove the paragraph from the commit message, but I think it's
> > > unnecessary now.  The workqueue is flushed in kvm_mmu_zap_all_fast() and
> > > kvm_mmu_uninit_tdp_mmu(), unlike the buggy patch, so it doesn't need to take
> > > a reference to the VM.
> > > 
> > > I think I don't even need to check kvm->users_count in the defunct root
> > > case, as long as kvm_mmu_uninit_tdp_mmu() flushes and destroys the workqueue
> > > before it checks that the lists are empty.
> > 
> > Yes, that should work.  IIRC, the WARN_ONs will tell us/you quite quickly if
> > we're wrong :-)  mmu_notifier_unregister() will call the "slow" kvm_mmu_zap_all()
> > and thus ensure all non-root pages zapped, but "leaking" a worker will trigger
> > the WARN_ON that there are no roots on the list.
> 
> Good, for the record these are the commit messages I have:
> 
>     KVM: x86/mmu: Zap invalidated roots via asynchronous worker
>     Use the system worker threads to zap the roots invalidated
>     by the TDP MMU's "fast zap" mechanism, implemented by
>     kvm_tdp_mmu_invalidate_all_roots().
>     At this point, apart from allowing some parallelism in the zapping of
>     roots, the workqueue is a glorified linked list: work items are added and
>     flushed entirely within a single kvm->slots_lock critical section.  However,
>     the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots()
>     assumes that it owns a reference to all invalid roots; therefore, no
>     one can set the invalid bit outside kvm_mmu_zap_all_fast().  Putting the
>     invalidated roots on a linked list... erm, on a workqueue ensures that
>     tdp_mmu_zap_root_work() only puts back those extra references that
>     kvm_mmu_zap_all_invalidated_roots() had gifted to it.
> 
> and
> 
>     KVM: x86/mmu: Zap defunct roots via asynchronous worker
>     Zap defunct roots, a.k.a. roots that have been invalidated after their
>     last reference was initially dropped, asynchronously via the existing work
>     queue instead of forcing the work upon the unfortunate task that happened
>     to drop the last reference.
>     If a vCPU task drops the last reference, the vCPU is effectively blocked
>     by the host for the entire duration of the zap.  If the root being zapped
>     happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
>     being active, the zap can take several hundred seconds.  Unsurprisingly,
>     most guests are unhappy if a vCPU disappears for hundreds of seconds.
>     E.g. running a synthetic selftest that triggers a vCPU root zap with
>     ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
>     Offloading the zap to a worker drops the block time to <100ms.
>     There is an important nuance to this change.  If the same work item
>     was queued twice before the work function has run, it would only
>     execute once and one reference would be leaked.  Therefore, now that
>     queueing items is not anymore protected by write_lock(&kvm->mmu_lock),
>     kvm_tdp_mmu_invalidate_all_roots() has to check root->role.invalid and
>     skip already invalid roots.  On the other hand, kvm_mmu_zap_all_fast()
>     must return only after those skipped roots have been zapped as well.
>     These two requirements can be satisfied only if _all_ places that
>     change invalid to true now schedule the worker before releasing the
>     mmu_lock.  There are just two, kvm_tdp_mmu_put_root() and
>     kvm_tdp_mmu_invalidate_all_roots().

Very nice!