[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOUHufbAKpv95k6rVedstjD_7JzP0RrbOD652gyZh2vbAjGPOg@mail.gmail.com>
Date: Thu, 23 Feb 2023 11:08:21 -0700
From: Yu Zhao <yuzhao@...gle.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Paolo Bonzini <pbonzini@...hat.com>,
Jonathan Corbet <corbet@....net>,
Michael Larabel <michael@...haellarabel.com>,
kvmarm@...ts.linux.dev, kvm@...r.kernel.org,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, linuxppc-dev@...ts.ozlabs.org, x86@...nel.org,
linux-mm@...gle.com
Subject: Re: [PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young()
On Thu, Feb 23, 2023 at 10:43 AM Sean Christopherson <seanjc@...gle.com> wrote:
>
> On Thu, Feb 16, 2023, Yu Zhao wrote:
> > An existing selftest can quickly demonstrate the effectiveness of this
> > patch. On a generic workstation equipped with 128 CPUs and 256GB DRAM:
>
> Not my area of maintenance, but a non-existent changelog (for all intents and
> purposes) for a change of this size and complexity is not acceptable.
Will fix.
> > $ sudo max_guest_memory_test -c 64 -m 250 -s 250
> >
> > MGLRU run2
> > ---------------
> > Before ~600s
> > After ~50s
> > Off ~250s
> >
> > kswapd (MGLRU before)
> > 100.00% balance_pgdat
> > 100.00% shrink_node
> > 100.00% shrink_one
> > 99.97% try_to_shrink_lruvec
> > 99.06% evict_folios
> > 97.41% shrink_folio_list
> > 31.33% folio_referenced
> > 31.06% rmap_walk_file
> > 30.89% folio_referenced_one
> > 20.83% __mmu_notifier_clear_flush_young
> > 20.54% kvm_mmu_notifier_clear_flush_young
> > => 19.34% _raw_write_lock
> >
> > kswapd (MGLRU after)
> > 100.00% balance_pgdat
> > 100.00% shrink_node
> > 100.00% shrink_one
> > 99.97% try_to_shrink_lruvec
> > 99.51% evict_folios
> > 71.70% shrink_folio_list
> > 7.08% folio_referenced
> > 6.78% rmap_walk_file
> > 6.72% folio_referenced_one
> > 5.60% lru_gen_look_around
> > => 1.53% __mmu_notifier_test_clear_young
>
> Do you happen to know how much of the improvement is due to batching, and how
> much is due to using a walkless walk?
No. I have three benchmarks running at the moment:
1. Windows SQL server guest on x86 host,
2. Apache Spark guest on arm64 host, and
3. Memcached guest on ppc64 host.
If you are really interested in that, I can reprioritize -- I need to
stop 1) and use that machine to get the number for you.
> > @@ -5699,6 +5797,9 @@ static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, c
> > if (arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG))
> > caps |= BIT(LRU_GEN_NONLEAF_YOUNG);
> >
> > + if (kvm_arch_has_test_clear_young() && get_cap(LRU_GEN_SPTE_WALK))
> > + caps |= BIT(LRU_GEN_SPTE_WALK);
>
> As alluded to in patch 1, unless batching the walks even if KVM does _not_ support
> a lockless walk is somehow _worse_ than using the existing mmu_notifier_clear_flush_young(),
> I think batching the calls should be conditional only on LRU_GEN_SPTE_WALK. Or
> if we want to avoid batching when there are no mmu_notifier listeners, probe
> mmu_notifiers. But don't call into KVM directly.
I'm not sure I fully understand. Let's present the problem on the MM
side: assuming KVM supports lockless walks, batching can still be
worse (very unlikely), because GFNs can exhibit no memory locality at
all. So this option allows userspace to disable batching.
I fully understand why you don't want MM to call into KVM directly. No
acceptable ways to set up a clear interface between MM and KVM other
than the MMU notifier?
Powered by blists - more mailing lists