linux-kernel - Re: [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via asynchronous worker

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8b8c28cf-cf54-f889-be7d-afc9f5430ecd@redhat.com>
Date:   Fri, 4 Mar 2022 19:11:55 +0100
From:   Paolo Bonzini <pbonzini@...hat.com>
To:     Sean Christopherson <seanjc@...gle.com>
Cc:     linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>,
        David Hildenbrand <david@...hat.com>,
        David Matlack <dmatlack@...gle.com>,
        Ben Gardon <bgardon@...gle.com>,
        Mingwei Zhang <mizhang@...gle.com>
Subject: Re: [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via
 asynchronous worker

On 3/4/22 17:02, Sean Christopherson wrote:
> On Fri, Mar 04, 2022, Paolo Bonzini wrote:
>> On 3/3/22 22:32, Sean Christopherson wrote:
>> I didn't remove the paragraph from the commit message, but I think it's
>> unnecessary now.  The workqueue is flushed in kvm_mmu_zap_all_fast() and
>> kvm_mmu_uninit_tdp_mmu(), unlike the buggy patch, so it doesn't need to take
>> a reference to the VM.
>>
>> I think I don't even need to check kvm->users_count in the defunct root
>> case, as long as kvm_mmu_uninit_tdp_mmu() flushes and destroys the workqueue
>> before it checks that the lists are empty.
> 
> Yes, that should work.  IIRC, the WARN_ONs will tell us/you quite quickly if
> we're wrong :-)  mmu_notifier_unregister() will call the "slow" kvm_mmu_zap_all()
> and thus ensure all non-root pages zapped, but "leaking" a worker will trigger
> the WARN_ON that there are no roots on the list.

Good, for the record these are the commit messages I have:

     KVM: x86/mmu: Zap invalidated roots via asynchronous worker
     
     Use the system worker threads to zap the roots invalidated
     by the TDP MMU's "fast zap" mechanism, implemented by
     kvm_tdp_mmu_invalidate_all_roots().
     
     At this point, apart from allowing some parallelism in the zapping of
     roots, the workqueue is a glorified linked list: work items are added and
     flushed entirely within a single kvm->slots_lock critical section.  However,
     the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots()
     assumes that it owns a reference to all invalid roots; therefore, no
     one can set the invalid bit outside kvm_mmu_zap_all_fast().  Putting the
     invalidated roots on a linked list... erm, on a workqueue ensures that
     tdp_mmu_zap_root_work() only puts back those extra references that
     kvm_mmu_zap_all_invalidated_roots() had gifted to it.

and

     KVM: x86/mmu: Zap defunct roots via asynchronous worker
     
     Zap defunct roots, a.k.a. roots that have been invalidated after their
     last reference was initially dropped, asynchronously via the existing work
     queue instead of forcing the work upon the unfortunate task that happened
     to drop the last reference.
     
     If a vCPU task drops the last reference, the vCPU is effectively blocked
     by the host for the entire duration of the zap.  If the root being zapped
     happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
     being active, the zap can take several hundred seconds.  Unsurprisingly,
     most guests are unhappy if a vCPU disappears for hundreds of seconds.
     
     E.g. running a synthetic selftest that triggers a vCPU root zap with
     ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
     Offloading the zap to a worker drops the block time to <100ms.
     
     There is an important nuance to this change.  If the same work item
     was queued twice before the work function has run, it would only
     execute once and one reference would be leaked.  Therefore, now that
     queueing items is not anymore protected by write_lock(&kvm->mmu_lock),
     kvm_tdp_mmu_invalidate_all_roots() has to check root->role.invalid and
     skip already invalid roots.  On the other hand, kvm_mmu_zap_all_fast()
     must return only after those skipped roots have been zapped as well.
     These two requirements can be satisfied only if _all_ places that
     change invalid to true now schedule the worker before releasing the
     mmu_lock.  There are just two, kvm_tdp_mmu_put_root() and
     kvm_tdp_mmu_invalidate_all_roots().

Paolo