lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250722041418.2024870-1-justin.he@arm.com>
Date: Tue, 22 Jul 2025 04:14:18 +0000
From: Jia He <justin.he@....com>
To: Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
	"Rafael J. Wysocki" <rafael@...nel.org>,
	Danilo Krummrich <dakr@...nel.org>
Cc: linux-kernel@...r.kernel.org,
	Jia He <justin.he@....com>
Subject: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to  reduce max_distance

pcpu_embed_first_chunk() allocates the first percpu chunk via
pcpu_fc_alloc() and used as-is without being mapped into vmalloc area. On
NUMA systems, this can lead to a sparse CPU->unit mapping, resulting in a
large physical address span (max_distance) and excessive vmalloc space
requirements.

For example, on an arm64 N2 server with 256 CPUs, the memory layout
includes:
[    0.000000] NUMA: NODE_DATA [mem 0x100fffff0b00-0x100fffffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x500fffff0b00-0x500fffffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x600fffff0b00-0x600fffffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x700ffffbcb00-0x700ffffcbfff]

With the following NUMA distance matrix:
node distances:
node   0   1   2   3
  0:  10  16  22  22
  1:  16  10  22  22
  2:  22  22  10  16
  3:  22  22  16  10

In this configuration, pcpu_embed_first_chunk() computes a large
max_distance:
percpu: max_distance=0x5fffbfac0000 too large for vmalloc space 0x7bff70000000

As a result, the allocator falls back to pcpu_page_first_chunk(), which
uses page-by-page allocation with nr_groups = 1, leading to degraded
performance.

This patch introduces a normalized CPU-to-NUMA node mapping to mitigate
the issue. Distances of 10 and 16 are treated as local (LOCAL_DISTANCE),
allowing CPUs from nearby nodes to be grouped together. Consequently,
nr_groups will be 2 and pcpu_fc_alloc() uses the normalized node ID to
allocate memory from a common node.

For example:
- cpu0 belongs to node 0
- cpu64 belongs to node 1
Both CPUs are considered local and will allocate memory from node 0.
This normalization reduces max_distance:
percpu: max_distance=0x500000380000, ~64% of vmalloc space 0x7bff70000000

In addition, add a flag _need_norm_ to indicate the normalization is needed
iff when cpu_to_norm_node_map[] is different from cpu_to_node_map[].

Signed-off-by: Jia He <justin.he@....com>
---
 drivers/base/arch_numa.c | 47 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 46 insertions(+), 1 deletion(-)

diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index c99f2ab105e5..f746d88239e9 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -17,6 +17,8 @@
 #include <asm/sections.h>
 
 static int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
+static int cpu_to_norm_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
+static bool need_norm;
 
 bool numa_off;
 
@@ -149,9 +151,40 @@ int early_cpu_to_node(int cpu)
 	return cpu_to_node_map[cpu];
 }
 
+int __init early_cpu_to_norm_node(int cpu)
+{
+	return cpu_to_norm_node_map[cpu];
+}
+
 static int __init pcpu_cpu_distance(unsigned int from, unsigned int to)
 {
-	return node_distance(early_cpu_to_node(from), early_cpu_to_node(to));
+	int distance = node_distance(early_cpu_to_node(from), early_cpu_to_node(to));
+
+	if (distance > LOCAL_DISTANCE && distance < REMOTE_DISTANCE && !need_norm)
+		need_norm = true;
+
+	return distance;
+}
+
+static int __init pcpu_cpu_norm_distance(unsigned int from, unsigned int to)
+{
+	int distance = pcpu_cpu_distance(from, to);
+
+	if (distance >= REMOTE_DISTANCE)
+		return REMOTE_DISTANCE;
+
+	/*
+	 * If the distance is in the range [LOCAL_DISTANCE, REMOTE_DISTANCE),
+	 * normalize the node map, choose the first local numa node id as its
+	 * normalized node id.
+	 */
+	if (cpu_to_norm_node_map[from] == NUMA_NO_NODE)
+		cpu_to_norm_node_map[from] = cpu_to_node_map[from];
+
+	if (cpu_to_norm_node_map[to] == NUMA_NO_NODE)
+		cpu_to_norm_node_map[to] = cpu_to_norm_node_map[from];
+
+	return LOCAL_DISTANCE;
 }
 
 void __init setup_per_cpu_areas(void)
@@ -169,6 +202,18 @@ void __init setup_per_cpu_areas(void)
 					    PERCPU_DYNAMIC_RESERVE, PAGE_SIZE,
 					    pcpu_cpu_distance,
 					    early_cpu_to_node);
+
+		if (rc < 0 && need_norm) {
+			/* Try the normalized node distance again */
+			pr_info("PERCPU: %s allocator, trying the normalization mode\n",
+				   pcpu_fc_names[pcpu_chosen_fc]);
+
+			rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
+						    PERCPU_DYNAMIC_RESERVE, PAGE_SIZE,
+						    pcpu_cpu_norm_distance,
+						    early_cpu_to_norm_node);
+		}
+
 #ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
 		if (rc < 0)
 			pr_warn("PERCPU: %s allocator failed (%d), falling back to page size\n",
-- 
2.34.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ