[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YuhoJUoPBOu5eMz8@google.com>
Date: Mon, 1 Aug 2022 23:56:21 +0000
From: Sean Christopherson <seanjc@...gle.com>
To: David Matlack <dmatlack@...gle.com>
Cc: Vipin Sharma <vipinsh@...gle.com>, pbonzini@...hat.com,
kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] KVM: x86/mmu: Make page tables for eager page splitting
NUMA aware
On Mon, Aug 01, 2022, David Matlack wrote:
> On Mon, Aug 01, 2022 at 08:19:28AM -0700, Vipin Sharma wrote:
> That being said, KVM currently has a gap where a guest doing a lot of
> remote memory accesses when touching memory for the first time will
> cause KVM to allocate the TDP page tables on the arguably wrong node.
Userspace can solve this by setting the NUMA policy on a VMA or shared-object
basis. E.g. create dedicated memslots for each NUMA node, then bind each of the
backing stores to the appropriate host node.
If there is a gap, e.g. a backing store we want to use doesn't properly support
mempolicy for shared mappings, then we should enhance the backing store.
> > We can improve TDP MMU eager page splitting by making
> > tdp_mmu_alloc_sp_for_split() NUMA-aware. Specifically, when splitting a
> > huge page, allocate the new lower level page tables on the same node as the
> > huge page.
> >
> > __get_free_page() is replaced by alloc_page_nodes(). This introduces two
> > functional changes.
> >
> > 1. __get_free_page() removes gfp flag __GFP_HIGHMEM via call to
> > __get_free_pages(). This should not be an issue as __GFP_HIGHMEM flag is
> > not passed in tdp_mmu_alloc_sp_for_split() anyway.
> >
> > 2. __get_free_page() calls alloc_pages() and use thread's mempolicy for
> > the NUMA node allocation. From this commit, thread's mempolicy will not
> > be used and first preference will be to allocate on the node where huge
> > page was present.
>
> It would be worth noting that userspace could change the mempolicy of
> the thread doing eager splitting to prefer allocating from the target
> NUMA node, as an alternative approach.
>
> I don't prefer the alternative though since it bleeds details from KVM
> into userspace, such as the fact that enabling dirty logging does eager
> page splitting, which allocates page tables.
As above, if userspace is cares about vNUMA, then it already needs to be aware of
some of KVM/kernel details. Separate memslots aren't strictly necessary, e.g.
userspace could stitch together contiguous VMAs to create a single mega-memslot,
but that seems like it'd be more work than just creating separate memslots.
And because eager page splitting for dirty logging runs with mmu_lock held for,
userspace might also benefit from per-node memslots as it can do the splitting on
multiple tasks/CPUs.
Regardless of what we do, the behavior needs to be document, i.e. KVM details will
bleed into userspace. E.g. if KVM is overriding the per-task NUMA policy, then
that should be documented.
> It's also unnecessary since KVM can infer an appropriate NUMA placement
> without the help of userspace, and I can't think of a reason for userspace to
> prefer a different policy.
I can't think of a reason why userspace would want to have a different policy for
the task that's enabling dirty logging, but I also can't think of a reason why
KVM should go out of its way to ignore that policy.
IMO this is a "bug" in dirty_log_perf_test, though it's probably a good idea to
document how to effectively configure vNUMA-aware memslots.
Powered by blists - more mailing lists