[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241109145628.112617-1-arighi@nvidia.com>
Date: Sat, 9 Nov 2024 15:56:28 +0100
From: Andrea Righi <arighi@...dia.com>
To: Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Juri Lelli <juri.lelli@...hat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>,
Tejun Heo <tj@...nel.org>,
David Vernet <void@...ifault.com>,
linux-kernel@...r.kernel.org
Subject: [PATCH] sched/topology: Correctly propagate NUMA flag to scheduling domains
A scheduling domain can degenerate a parent NUMA domain if the CPUs
perfectly overlap, without inheriting the SD_NUMA flag.
This can result in the creation of a single NUMA domain that includes
all CPUs, even when the CPUs are spread across multiple NUMA nodes,
which may result in sub-optimal scheduling decisions.
Example:
$ vng -v --cpu 16,sockets=4,cores=2,threads=2 \
-m 4G --numa 2G,cpus=0-7 --numa 2G,cpus=8-15
...
$ lscpu -e
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
0 0 0 0 0:0:0:0 yes
1 0 0 0 0:0:0:0 yes
2 0 0 1 1:1:1:0 yes
3 0 0 1 1:1:1:0 yes
4 0 1 2 2:2:2:1 yes
5 0 1 2 2:2:2:1 yes
6 0 1 3 3:3:3:1 yes
7 0 1 3 3:3:3:1 yes
8 1 2 4 4:4:4:2 yes
9 1 2 4 4:4:4:2 yes
10 1 2 5 5:5:5:2 yes
11 1 2 5 5:5:5:2 yes
12 1 3 6 6:6:6:3 yes
13 1 3 6 6:6:6:3 yes
14 1 3 7 7:7:7:3 yes
15 1 3 7 7:7:7:3 yes
Without this change:
sd_llc[cpu0] spans cpus=0-3
sd_numa[cpu0] spans cpus=0-15
...
sd_llc[cpu15] spans cpus=12-15
sd_numa[cpu15] spans cpus=0-15
With this change:
- sd_llc[cpu0] spans cpus=0-3
- sd_numa[cpu0] spans cpus=0-7
...
sd_llc[cpu15] spans cpus=12-15
sd_numa[cpu15] spans cpus=8-15
This also allows re-using sd_numa from the sched_ext built-in CPU idle
selection policy, instead of relying on the NUMA cpumasks [1].
[1] https://lore.kernel.org/lkml/20241108000136.184909-1-arighi@nvidia.com/
Signed-off-by: Andrea Righi <arighi@...dia.com>
---
kernel/sched/topology.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9748a4c8d668..e0fe493b7ae0 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -755,6 +755,13 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
*/
if (parent->flags & SD_PREFER_SIBLING)
tmp->flags |= SD_PREFER_SIBLING;
+ /*
+ * Transfer SD_NUMA to the child in case of a
+ * degenerate NUMA parent.
+ */
+ if (parent->flags & SD_NUMA)
+ tmp->flags |= SD_NUMA;
+
destroy_sched_domain(parent);
} else
tmp = tmp->parent;
@@ -1974,6 +1981,7 @@ void sched_init_numa(int offline_node)
*/
tl[i++] = (struct sched_domain_topology_level){
.mask = sd_numa_mask,
+ .sd_flags = cpu_numa_flags,
.numa_level = 0,
SD_INIT_NAME(NODE)
};
--
2.47.0
Powered by blists - more mailing lists