[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<AS2PR08MB978689001CDED54ABD33FAF7F75AA@AS2PR08MB9786.eurprd08.prod.outlook.com>
Date: Mon, 28 Jul 2025 02:54:42 +0000
From: Justin He <Justin.He@....com>
To: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
CC: "Rafael J. Wysocki" <rafael@...nel.org>, Danilo Krummrich
<dakr@...nel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>
Subject: RE: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping
to reduce max_distance
Hi Greg
> -----Original Message-----
> From: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
> Sent: Tuesday, July 22, 2025 1:45 PM
> To: Justin He <Justin.He@....com>
> Cc: Rafael J. Wysocki <rafael@...nel.org>; Danilo Krummrich
> <dakr@...nel.org>; linux-kernel@...r.kernel.org
> Subject: Re: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node
> mapping to reduce max_distance
>
> On Tue, Jul 22, 2025 at 04:14:18AM +0000, Jia He wrote:
> > pcpu_embed_first_chunk() allocates the first percpu chunk via
> > pcpu_fc_alloc() and used as-is without being mapped into vmalloc area.
> > On NUMA systems, this can lead to a sparse CPU->unit mapping,
> > resulting in a large physical address span (max_distance) and
> > excessive vmalloc space requirements.
>
> Why is the subject line "mm: percpu:" when this is driver-core code?
>
> And if it is mm code, please cc: the mm maintainers and list please.
>
Ok, thanks
> > For example, on an arm64 N2 server with 256 CPUs, the memory layout
> > includes:
> > [ 0.000000] NUMA: NODE_DATA [mem 0x100fffff0b00-0x100fffffffff]
> > [ 0.000000] NUMA: NODE_DATA [mem 0x500fffff0b00-0x500fffffffff]
> > [ 0.000000] NUMA: NODE_DATA [mem 0x600fffff0b00-0x600fffffffff]
> > [ 0.000000] NUMA: NODE_DATA [mem 0x700ffffbcb00-0x700ffffcbfff]
> >
> > With the following NUMA distance matrix:
> > node distances:
> > node 0 1 2 3
> > 0: 10 16 22 22
> > 1: 16 10 22 22
> > 2: 22 22 10 16
> > 3: 22 22 16 10
> >
> > In this configuration, pcpu_embed_first_chunk() computes a large
> > max_distance:
> > percpu: max_distance=0x5fffbfac0000 too large for vmalloc space
> > 0x7bff70000000
> >
> > As a result, the allocator falls back to pcpu_page_first_chunk(),
> > which uses page-by-page allocation with nr_groups = 1, leading to
> > degraded performance.
>
> But that's intentional, you don't want to go across the nodes, right?
My intention is to
>
> > This patch introduces a normalized CPU-to-NUMA node mapping to
> > mitigate the issue. Distances of 10 and 16 are treated as local
> > (LOCAL_DISTANCE),
>
> Why? What is this going to now break on those systems that assumed that
> those were NOT local?
The normalization only affects percpu allocations - possibly only dynamic ones.
Other mechanisms, such as cpu_to_node_map, remain unaffected and continue
to function as before in those contexts.
>
> > allowing CPUs from nearby nodes to be grouped together. Consequently,
> > nr_groups will be 2 and pcpu_fc_alloc() uses the normalized node ID to
> > allocate memory from a common node.
> >
> > For example:
> > - cpu0 belongs to node 0
> > - cpu64 belongs to node 1
> > Both CPUs are considered local and will allocate memory from node 0.
> > This normalization reduces max_distance:
> > percpu: max_distance=0x500000380000, ~64% of vmalloc space
> > 0x7bff70000000
> >
> > In addition, add a flag _need_norm_ to indicate the normalization is
> > needed iff when cpu_to_norm_node_map[] is different from
> cpu_to_node_map[].
> >
> > Signed-off-by: Jia He <justin.he@....com>
>
> I think this needs a lot of testing and verification and acks from maintainers of
> other arches that can say "this also works for us" before we can take it, as it
> has the potential to make major changes to systems.
Ok, understood.
>
> What did you test this on?
>
This was conducted on an Arm64 N2 server with 256 CPUs and 64 GB of memory.
(Apologies, but I am not authorized to disclose the exact hardware specifications.)
> > ---
> > drivers/base/arch_numa.c | 47
> > +++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 46 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c index
> > c99f2ab105e5..f746d88239e9 100644
> > --- a/drivers/base/arch_numa.c
> > +++ b/drivers/base/arch_numa.c
> > @@ -17,6 +17,8 @@
> > #include <asm/sections.h>
> >
> > static int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] =
> > NUMA_NO_NODE };
> > +static int cpu_to_norm_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] =
> > +NUMA_NO_NODE }; static bool need_norm;
>
> Shouldn't these be marked __initdata as you don't touch them afterward?
Yes
---
Cheers,
Justin He(Jia He)
Powered by blists - more mailing lists