lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2025072202-june-cable-d658@gregkh>
Date: Tue, 22 Jul 2025 07:45:06 +0200
From: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
To: Jia He <justin.he@....com>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>,
	Danilo Krummrich <dakr@...nel.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node
 mapping to  reduce max_distance

On Tue, Jul 22, 2025 at 04:14:18AM +0000, Jia He wrote:
> pcpu_embed_first_chunk() allocates the first percpu chunk via
> pcpu_fc_alloc() and used as-is without being mapped into vmalloc area. On
> NUMA systems, this can lead to a sparse CPU->unit mapping, resulting in a
> large physical address span (max_distance) and excessive vmalloc space
> requirements.

Why is the subject line "mm: percpu:" when this is driver-core code?

And if it is mm code, please cc: the mm maintainers and list please.

> For example, on an arm64 N2 server with 256 CPUs, the memory layout
> includes:
> [    0.000000] NUMA: NODE_DATA [mem 0x100fffff0b00-0x100fffffffff]
> [    0.000000] NUMA: NODE_DATA [mem 0x500fffff0b00-0x500fffffffff]
> [    0.000000] NUMA: NODE_DATA [mem 0x600fffff0b00-0x600fffffffff]
> [    0.000000] NUMA: NODE_DATA [mem 0x700ffffbcb00-0x700ffffcbfff]
> 
> With the following NUMA distance matrix:
> node distances:
> node   0   1   2   3
>   0:  10  16  22  22
>   1:  16  10  22  22
>   2:  22  22  10  16
>   3:  22  22  16  10
> 
> In this configuration, pcpu_embed_first_chunk() computes a large
> max_distance:
> percpu: max_distance=0x5fffbfac0000 too large for vmalloc space 0x7bff70000000
> 
> As a result, the allocator falls back to pcpu_page_first_chunk(), which
> uses page-by-page allocation with nr_groups = 1, leading to degraded
> performance.

But that's intentional, you don't want to go across the nodes, right?

> This patch introduces a normalized CPU-to-NUMA node mapping to mitigate
> the issue. Distances of 10 and 16 are treated as local (LOCAL_DISTANCE),

Why?  What is this going to now break on those systems that assumed that
those were NOT local?

> allowing CPUs from nearby nodes to be grouped together. Consequently,
> nr_groups will be 2 and pcpu_fc_alloc() uses the normalized node ID to
> allocate memory from a common node.
> 
> For example:
> - cpu0 belongs to node 0
> - cpu64 belongs to node 1
> Both CPUs are considered local and will allocate memory from node 0.
> This normalization reduces max_distance:
> percpu: max_distance=0x500000380000, ~64% of vmalloc space 0x7bff70000000
> 
> In addition, add a flag _need_norm_ to indicate the normalization is needed
> iff when cpu_to_norm_node_map[] is different from cpu_to_node_map[].
> 
> Signed-off-by: Jia He <justin.he@....com>

I think this needs a lot of testing and verification and acks from
maintainers of other arches that can say "this also works for us" before
we can take it, as it has the potential to make major changes to
systems.

What did you test this on?


> ---
>  drivers/base/arch_numa.c | 47 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 46 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
> index c99f2ab105e5..f746d88239e9 100644
> --- a/drivers/base/arch_numa.c
> +++ b/drivers/base/arch_numa.c
> @@ -17,6 +17,8 @@
>  #include <asm/sections.h>
>  
>  static int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
> +static int cpu_to_norm_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
> +static bool need_norm;

Shouldn't these be marked __initdata as you don't touch them afterward?

thanks,

greg k-h

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ