[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20497464-0606-7ea5-89b8-8f5cd56a1a68@redhat.com>
Date: Sat, 5 Mar 2022 20:53:42 +0100
From: Paolo Bonzini <pbonzini@...hat.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
Vitaly Kuznetsov <vkuznets@...hat.com>,
Wanpeng Li <wanpengli@...cent.com>,
Jim Mattson <jmattson@...gle.com>,
Joerg Roedel <joro@...tes.org>,
David Hildenbrand <david@...hat.com>,
David Matlack <dmatlack@...gle.com>,
Ben Gardon <bgardon@...gle.com>,
Mingwei Zhang <mizhang@...gle.com>
Subject: Re: [PATCH v4 21/30] KVM: x86/mmu: Zap invalidated roots via
asynchronous worker
On 3/5/22 01:34, Sean Christopherson wrote:
> On Fri, Mar 04, 2022, Paolo Bonzini wrote:
>> On 3/4/22 17:02, Sean Christopherson wrote:
>>> On Fri, Mar 04, 2022, Paolo Bonzini wrote:
>>>> On 3/3/22 22:32, Sean Christopherson wrote:
>>>> I didn't remove the paragraph from the commit message, but I think it's
>>>> unnecessary now. The workqueue is flushed in kvm_mmu_zap_all_fast() and
>>>> kvm_mmu_uninit_tdp_mmu(), unlike the buggy patch, so it doesn't need to take
>>>> a reference to the VM.
>>>>
>>>> I think I don't even need to check kvm->users_count in the defunct root
>>>> case, as long as kvm_mmu_uninit_tdp_mmu() flushes and destroys the workqueue
>>>> before it checks that the lists are empty.
>>>
>>> Yes, that should work. IIRC, the WARN_ONs will tell us/you quite quickly if
>>> we're wrong :-) mmu_notifier_unregister() will call the "slow" kvm_mmu_zap_all()
>>> and thus ensure all non-root pages zapped, but "leaking" a worker will trigger
>>> the WARN_ON that there are no roots on the list.
>>
>> Good, for the record these are the commit messages I have:
I'm seeing some hangs in ~50% of installation jobs, both Windows and
Linux. I have not yet tried to reproduce outside the automated tests,
or to bisect, but I'll try to push at least the first part of the series
for 5.18.
Paolo
>> KVM: x86/mmu: Zap invalidated roots via asynchronous worker
>> Use the system worker threads to zap the roots invalidated
>> by the TDP MMU's "fast zap" mechanism, implemented by
>> kvm_tdp_mmu_invalidate_all_roots().
>> At this point, apart from allowing some parallelism in the zapping of
>> roots, the workqueue is a glorified linked list: work items are added and
>> flushed entirely within a single kvm->slots_lock critical section. However,
>> the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots()
>> assumes that it owns a reference to all invalid roots; therefore, no
>> one can set the invalid bit outside kvm_mmu_zap_all_fast(). Putting the
>> invalidated roots on a linked list... erm, on a workqueue ensures that
>> tdp_mmu_zap_root_work() only puts back those extra references that
>> kvm_mmu_zap_all_invalidated_roots() had gifted to it.
>>
>> and
>>
>> KVM: x86/mmu: Zap defunct roots via asynchronous worker
>> Zap defunct roots, a.k.a. roots that have been invalidated after their
>> last reference was initially dropped, asynchronously via the existing work
>> queue instead of forcing the work upon the unfortunate task that happened
>> to drop the last reference.
>> If a vCPU task drops the last reference, the vCPU is effectively blocked
>> by the host for the entire duration of the zap. If the root being zapped
>> happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
>> being active, the zap can take several hundred seconds. Unsurprisingly,
>> most guests are unhappy if a vCPU disappears for hundreds of seconds.
>> E.g. running a synthetic selftest that triggers a vCPU root zap with
>> ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
>> Offloading the zap to a worker drops the block time to <100ms.
>> There is an important nuance to this change. If the same work item
>> was queued twice before the work function has run, it would only
>> execute once and one reference would be leaked. Therefore, now that
>> queueing items is not anymore protected by write_lock(&kvm->mmu_lock),
>> kvm_tdp_mmu_invalidate_all_roots() has to check root->role.invalid and
>> skip already invalid roots. On the other hand, kvm_mmu_zap_all_fast()
>> must return only after those skipped roots have been zapped as well.
>> These two requirements can be satisfied only if _all_ places that
>> change invalid to true now schedule the worker before releasing the
>> mmu_lock. There are just two, kvm_tdp_mmu_put_root() and
>> kvm_tdp_mmu_invalidate_all_roots().
>
> Very nice!
>
Powered by blists - more mailing lists