linux-kernel - [PATCH] sched/topology: Correctly propagate NUMA flag to scheduling domains

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20241109145628.112617-1-arighi@nvidia.com>
Date: Sat,  9 Nov 2024 15:56:28 +0100
From: Andrea Righi <arighi@...dia.com>
To: Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Juri Lelli <juri.lelli@...hat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>,
	Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>,
	Tejun Heo <tj@...nel.org>,
	David Vernet <void@...ifault.com>,
	linux-kernel@...r.kernel.org
Subject: [PATCH] sched/topology: Correctly propagate NUMA flag to scheduling domains

A scheduling domain can degenerate a parent NUMA domain if the CPUs
perfectly overlap, without inheriting the SD_NUMA flag.

This can result in the creation of a single NUMA domain that includes
all CPUs, even when the CPUs are spread across multiple NUMA nodes,
which may result in sub-optimal scheduling decisions.

Example:

$ vng -v --cpu 16,sockets=4,cores=2,threads=2 \
      -m 4G --numa 2G,cpus=0-7 --numa 2G,cpus=8-15
 ...
$ lscpu -e
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
  0    0      0    0 0:0:0:0          yes
  1    0      0    0 0:0:0:0          yes
  2    0      0    1 1:1:1:0          yes
  3    0      0    1 1:1:1:0          yes
  4    0      1    2 2:2:2:1          yes
  5    0      1    2 2:2:2:1          yes
  6    0      1    3 3:3:3:1          yes
  7    0      1    3 3:3:3:1          yes
  8    1      2    4 4:4:4:2          yes
  9    1      2    4 4:4:4:2          yes
 10    1      2    5 5:5:5:2          yes
 11    1      2    5 5:5:5:2          yes
 12    1      3    6 6:6:6:3          yes
 13    1      3    6 6:6:6:3          yes
 14    1      3    7 7:7:7:3          yes
 15    1      3    7 7:7:7:3          yes

Without this change:
  sd_llc[cpu0] spans cpus=0-3
  sd_numa[cpu0] spans cpus=0-15
  ...
  sd_llc[cpu15] spans cpus=12-15
  sd_numa[cpu15] spans cpus=0-15

With this change:
 - sd_llc[cpu0] spans cpus=0-3
 - sd_numa[cpu0] spans cpus=0-7
  ...
  sd_llc[cpu15] spans cpus=12-15
  sd_numa[cpu15] spans cpus=8-15

This also allows re-using sd_numa from the sched_ext built-in CPU idle
selection policy, instead of relying on the NUMA cpumasks [1].

[1] https://lore.kernel.org/lkml/20241108000136.184909-1-arighi@nvidia.com/

Signed-off-by: Andrea Righi <arighi@...dia.com>
---
 kernel/sched/topology.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9748a4c8d668..e0fe493b7ae0 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -755,6 +755,13 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 			 */
 			if (parent->flags & SD_PREFER_SIBLING)
 				tmp->flags |= SD_PREFER_SIBLING;
+			/*
+			 * Transfer SD_NUMA to the child in case of a
+			 * degenerate NUMA parent.
+			 */
+			if (parent->flags & SD_NUMA)
+				tmp->flags |= SD_NUMA;
+
 			destroy_sched_domain(parent);
 		} else
 			tmp = tmp->parent;
@@ -1974,6 +1981,7 @@ void sched_init_numa(int offline_node)
 	 */
 	tl[i++] = (struct sched_domain_topology_level){
 		.mask = sd_numa_mask,
+		.sd_flags = cpu_numa_flags,
 		.numa_level = 0,
 		SD_INIT_NAME(NODE)
 	};
-- 
2.47.0