linux-kernel - Re: [Patch v4 16/18] KVM: x86/mmu: Allocate numa aware page tables during page fault

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALzav=cKY37njK0=jsw7fiUsqLWVw0ir0LEYr6O=R+NPk2nVHw@mail.gmail.com>
Date:   Tue, 28 Mar 2023 17:28:08 -0700
From:   David Matlack <dmatlack@...gle.com>
To:     Vipin Sharma <vipinsh@...gle.com>
Cc:     seanjc@...gle.com, pbonzini@...hat.com, bgardon@...gle.com,
        jmattson@...gle.com, mizhang@...gle.com, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [Patch v4 16/18] KVM: x86/mmu: Allocate numa aware page tables
 during page fault

On Tue, Mar 28, 2023 at 5:21 PM David Matlack <dmatlack@...gle.com> wrote:
>
> On Mon, Mar 06, 2023 at 02:41:25PM -0800, Vipin Sharma wrote:
> > Allocate page tables on the preferred NUMA node via memory cache during
> > page faults. If memory cache doesn't have a preferred NUMA node (node
> > value is set to NUMA_NO_NODE) then fallback to the default logic where
> > pages are selected based on thread's mempolicy. Also, free NUMA aware
> > page caches, mmu_shadow_page_cache, when memory shrinker is invoked.
> >
> > Allocate root pages based on the current thread's NUMA node as there is
> > no way to know which will be the ideal NUMA node in long run.
> >
> > This commit allocate page tables to be on the same NUMA node as the
> > physical page pointed by them, even if a vCPU causing page fault is on a
> > different NUMA node. If memory is not available on the requested NUMA
> > node then the other nearest NUMA node is selected by default. NUMA aware
> > page tables can be beneficial in cases where a thread touches lot of far
> > memory initially and then divide work among multiple threads. VMs
> > generally take advantage of NUMA architecture for faster memory access
> > by moving threads to the NUMA node of the memory they are accessing.
> > This change will help them in accessing pages faster.
> >
> > Downside of this change is that an experimental workload can be created
> > where a guest threads are always accessing remote memory and not the one
> > local to them. This will cause performance to degrade compared to VMs
> > where numa aware page tables are not enabled. Ideally, these VMs when
> > using non-uniform memory access machine should generally be taking
> > advantage of NUMA architecture to improve their performance in the first
> > place.
> >
> > Signed-off-by: Vipin Sharma <vipinsh@...gle.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  2 +-
> >  arch/x86/kvm/mmu/mmu.c          | 63 ++++++++++++++++++++++++---------
> >  arch/x86/kvm/mmu/mmu_internal.h | 24 ++++++++++++-
> >  arch/x86/kvm/mmu/paging_tmpl.h  |  4 +--
> >  arch/x86/kvm/mmu/tdp_mmu.c      | 14 +++++---
> >  include/linux/kvm_types.h       |  6 ++++
> >  virt/kvm/kvm_main.c             |  2 +-
> >  7 files changed, 88 insertions(+), 27 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 64de083cd6b9..77d3aa368e5e 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -787,7 +787,7 @@ struct kvm_vcpu_arch {
> >       struct kvm_mmu *walk_mmu;
> >
> >       struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> > -     struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> > +     struct kvm_mmu_memory_cache mmu_shadow_page_cache[MAX_NUMNODES];
>
> I think we need an abstraction for a NUMA-aware mmu cache, since there
> is more than one by the end of this series.
>
> e.g. A wrapper struct (struct kvm_mmu_numa_memory_cache) or make
> NUMA-awareness an optional feature within kvm_mmu_memory_cache, plus
> common helper functions for operations like initializing, topping-up,
> and freeing.
>
> I have some ideas I want to try but I ran out of time today.
>
> >       struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> >       struct kvm_mmu_memory_cache mmu_page_header_cache;
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index d96afc849ee8..86f0d74d35ed 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -702,7 +702,7 @@ static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache)
> >
> >  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> >  {
> > -     int r;
> > +     int r, nid = KVM_MMU_DEFAULT_CACHE_INDEX;
> >
> >       /* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
> >       r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
> > @@ -710,7 +710,16 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> >       if (r)
> >               return r;
> >
> > -     r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache, PT64_ROOT_MAX_LEVEL);
> > +     if (kvm_numa_aware_page_table_enabled(vcpu->kvm)) {
> > +             for_each_online_node(nid) {
>
> Blegh. This is going to potentially waste a lot of memory. Yes the
> shrinker can free it, but the next fault will re-allocate all the online
> node caches.
>
> The reason we have to top-up all nodes is because KVM tops up caches
> before faulting in the PFN, and there is concern that changing this will
> increase the rate of guest page-fault retries [1].
>
> I think we should revisit that concern. Can we do any testing to
> validate that hypothesis? Or can we convince ourselves that re-ordering
> is ok?
>
> [1] https://lore.kernel.org/kvm/CAHVum0cjqsdG2NEjRF3ZRtUY2t2=Tb9H4OyOz9wpmsrN--Sjhg@mail.gmail.com/

Ah I forgot about patch 18 reducing the default cache size. So at the
end of this series, even with topping up every node, the maximum
number of objects per cache will be 4 * num_online_nodes. So only
hosts with more than 10 online NUMA nodes would have larger caches
than today (40). That seems more reasonable to me.