[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2025072202-june-cable-d658@gregkh>
Date: Tue, 22 Jul 2025 07:45:06 +0200
From: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
To: Jia He <justin.he@....com>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>,
Danilo Krummrich <dakr@...nel.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node
mapping to reduce max_distance
On Tue, Jul 22, 2025 at 04:14:18AM +0000, Jia He wrote:
> pcpu_embed_first_chunk() allocates the first percpu chunk via
> pcpu_fc_alloc() and used as-is without being mapped into vmalloc area. On
> NUMA systems, this can lead to a sparse CPU->unit mapping, resulting in a
> large physical address span (max_distance) and excessive vmalloc space
> requirements.
Why is the subject line "mm: percpu:" when this is driver-core code?
And if it is mm code, please cc: the mm maintainers and list please.
> For example, on an arm64 N2 server with 256 CPUs, the memory layout
> includes:
> [ 0.000000] NUMA: NODE_DATA [mem 0x100fffff0b00-0x100fffffffff]
> [ 0.000000] NUMA: NODE_DATA [mem 0x500fffff0b00-0x500fffffffff]
> [ 0.000000] NUMA: NODE_DATA [mem 0x600fffff0b00-0x600fffffffff]
> [ 0.000000] NUMA: NODE_DATA [mem 0x700ffffbcb00-0x700ffffcbfff]
>
> With the following NUMA distance matrix:
> node distances:
> node 0 1 2 3
> 0: 10 16 22 22
> 1: 16 10 22 22
> 2: 22 22 10 16
> 3: 22 22 16 10
>
> In this configuration, pcpu_embed_first_chunk() computes a large
> max_distance:
> percpu: max_distance=0x5fffbfac0000 too large for vmalloc space 0x7bff70000000
>
> As a result, the allocator falls back to pcpu_page_first_chunk(), which
> uses page-by-page allocation with nr_groups = 1, leading to degraded
> performance.
But that's intentional, you don't want to go across the nodes, right?
> This patch introduces a normalized CPU-to-NUMA node mapping to mitigate
> the issue. Distances of 10 and 16 are treated as local (LOCAL_DISTANCE),
Why? What is this going to now break on those systems that assumed that
those were NOT local?
> allowing CPUs from nearby nodes to be grouped together. Consequently,
> nr_groups will be 2 and pcpu_fc_alloc() uses the normalized node ID to
> allocate memory from a common node.
>
> For example:
> - cpu0 belongs to node 0
> - cpu64 belongs to node 1
> Both CPUs are considered local and will allocate memory from node 0.
> This normalization reduces max_distance:
> percpu: max_distance=0x500000380000, ~64% of vmalloc space 0x7bff70000000
>
> In addition, add a flag _need_norm_ to indicate the normalization is needed
> iff when cpu_to_norm_node_map[] is different from cpu_to_node_map[].
>
> Signed-off-by: Jia He <justin.he@....com>
I think this needs a lot of testing and verification and acks from
maintainers of other arches that can say "this also works for us" before
we can take it, as it has the potential to make major changes to
systems.
What did you test this on?
> ---
> drivers/base/arch_numa.c | 47 +++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 46 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
> index c99f2ab105e5..f746d88239e9 100644
> --- a/drivers/base/arch_numa.c
> +++ b/drivers/base/arch_numa.c
> @@ -17,6 +17,8 @@
> #include <asm/sections.h>
>
> static int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
> +static int cpu_to_norm_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
> +static bool need_norm;
Shouldn't these be marked __initdata as you don't touch them afterward?
thanks,
greg k-h
Powered by blists - more mailing lists