[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANgfPd-pRRoAD3=eWgFOVmaHJnpnbUhwT84p4oJp1NovhxUCow@mail.gmail.com>
Date: Mon, 28 Feb 2022 16:43:03 -0800
From: Ben Gardon <bgardon@...gle.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>,
Christian Borntraeger <borntraeger@...ux.ibm.com>,
Janosch Frank <frankja@...ux.ibm.com>,
Claudio Imbrenda <imbrenda@...ux.ibm.com>,
Vitaly Kuznetsov <vkuznets@...hat.com>,
Wanpeng Li <wanpengli@...cent.com>,
Jim Mattson <jmattson@...gle.com>,
Joerg Roedel <joro@...tes.org>,
David Hildenbrand <david@...hat.com>,
kvm <kvm@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>,
David Matlack <dmatlack@...gle.com>,
Mingwei Zhang <mizhang@...gle.com>
Subject: Re: [PATCH v3 21/28] KVM: x86/mmu: Zap roots in two passes to avoid
inducing RCU stalls
On Fri, Feb 25, 2022 at 4:16 PM Sean Christopherson <seanjc@...gle.com> wrote:
>
> When zapping a TDP MMU root, perform the zap in two passes to avoid
> zapping an entire top-level SPTE while holding RCU, which can induce RCU
> stalls. In the first pass, zap SPTEs at PG_LEVEL_1G, and then
> zap top-level entries in the second pass.
>
> With 4-level paging, zapping a PGD that is fully populated with 4kb leaf
> SPTEs take up to ~7 or so seconds (time varies based on kernel config,
> number of (v)CPUs, etc...). With 5-level paging, that time can balloon
> well into hundreds of seconds.
>
> Before remote TLB flushes were omitted, the problem was even worse as
> waiting for all active vCPUs to respond to the IPI introduced significant
> overhead for VMs with large numbers of vCPUs.
>
> By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass,
> the amount of work that is done without dropping RCU protection is
> strictly bounded, with the worst case latency for a single operation
> being less than 100ms.
>
> Zapping at 1gb in the first pass is not arbitrary. First and foremost,
> KVM relies on being able to zap 1gb shadow pages in a single shot when
> when repacing a shadow page with a hugepage. Zapping a 1gb shadow page
> that is fully populated with 4kb dirty SPTEs also triggers the worst case
> latency due writing back the struct page accessed/dirty bits for each 4kb
> page, i.e. the two-pass approach is guaranteed to work so long as KVM can
> cleany zap a 1gb shadow page.
>
> rcu: INFO: rcu_sched self-detected stall on CPU
> rcu: 52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000
> softirq=15759/15759 fqs=5058
> (t=21016 jiffies g=66453 q=238577)
> NMI backtrace for cpu 52
> Call Trace:
> ...
> mark_page_accessed+0x266/0x2f0
> kvm_set_pfn_accessed+0x31/0x40
> handle_removed_tdp_mmu_page+0x259/0x2e0
> __handle_changed_spte+0x223/0x2c0
> handle_removed_tdp_mmu_page+0x1c1/0x2e0
> __handle_changed_spte+0x223/0x2c0
> handle_removed_tdp_mmu_page+0x1c1/0x2e0
> __handle_changed_spte+0x223/0x2c0
> zap_gfn_range+0x141/0x3b0
> kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130
> kvm_mmu_zap_all_fast+0x121/0x190
> kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
> kvm_page_track_flush_slot+0x5c/0x80
> kvm_arch_flush_shadow_memslot+0xe/0x10
> kvm_set_memslot+0x172/0x4e0
> __kvm_set_memory_region+0x337/0x590
> kvm_vm_ioctl+0x49c/0xf80
>
> Reported-by: David Matlack <dmatlack@...gle.com>
> Cc: Ben Gardon <bgardon@...gle.com>
> Cc: Mingwei Zhang <mizhang@...gle.com>
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
Reviewed-by: Ben Gardon <bgardon@...gle.com>
Nice. This is very well explained in the comments and commit
description. Thanks for fixing this.
> ---
> arch/x86/kvm/mmu/tdp_mmu.c | 51 +++++++++++++++++++++++++-------------
> 1 file changed, 34 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index b838cfa984ad..ec28a88c6376 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -802,14 +802,36 @@ static inline gfn_t tdp_mmu_max_gfn_host(void)
> return 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> }
>
> -static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> - bool shared)
> +static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> + bool shared, int zap_level)
> {
> struct tdp_iter iter;
>
> gfn_t end = tdp_mmu_max_gfn_host();
> gfn_t start = 0;
>
> + for_each_tdp_pte_min_level(iter, root, zap_level, start, end) {
> +retry:
> + if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
> + continue;
> +
> + if (!is_shadow_present_pte(iter.old_spte))
> + continue;
> +
> + if (iter.level > zap_level)
> + continue;
> +
> + if (!shared)
> + tdp_mmu_set_spte(kvm, &iter, 0);
> + else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0))
> + goto retry;
> + }
> +}
> +
> +static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> + bool shared)
> +{
> +
> /*
> * The root must have an elevated refcount so that it's reachable via
> * mmu_notifier callbacks, which allows this path to yield and drop
> @@ -827,22 +849,17 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> rcu_read_lock();
>
> /*
> - * No need to try to step down in the iterator when zapping an entire
> - * root, zapping an upper-level SPTE will recurse on its children.
> + * To avoid RCU stalls due to recursively removing huge swaths of SPs,
> + * split the zap into two passes. On the first pass, zap at the 1gb
> + * level, and then zap top-level SPs on the second pass. "1gb" is not
> + * arbitrary, as KVM must be able to zap a 1gb shadow page without
> + * inducing a stall to allow in-place replacement with a 1gb hugepage.
> + *
> + * Because zapping a SP recurses on its children, stepping down to
> + * PG_LEVEL_4K in the iterator itself is unnecessary.
> */
> - for_each_tdp_pte_min_level(iter, root, root->role.level, start, end) {
> -retry:
> - if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
> - continue;
> -
> - if (!is_shadow_present_pte(iter.old_spte))
> - continue;
> -
> - if (!shared)
> - tdp_mmu_set_spte(kvm, &iter, 0);
> - else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0))
> - goto retry;
> - }
> + __tdp_mmu_zap_root(kvm, root, shared, PG_LEVEL_1G);
> + __tdp_mmu_zap_root(kvm, root, shared, root->role.level);
>
> rcu_read_unlock();
> }
> --
> 2.35.1.574.g5d30c73bfb-goog
>
Powered by blists - more mailing lists