linux-kernel - [PATCH v3 3/3] sched/fair: Take sched_domain into account in task_numa

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20250113073050.2811925-4-zhouchuyi@bytedance.com>
Date: Mon, 13 Jan 2025 15:30:50 +0800
From: Chuyi Zhou <zhouchuyi@...edance.com>
To: mingo@...hat.com,
	peterz@...radead.org,
	juri.lelli@...hat.com,
	vincent.guittot@...aro.org,
	dietmar.eggemann@....com,
	rostedt@...dmis.org,
	bsegall@...gle.com,
	mgorman@...e.de,
	vschneid@...hat.com,
	longman@...hat.com,
	riel@...riel.com
Cc: chengming.zhou@...ux.dev,
	kprateek.nayak@....com,
	linux-kernel@...r.kernel.org,
	Chuyi Zhou <zhouchuyi@...edance.com>
Subject: [PATCH v3 3/3] sched/fair: Take sched_domain into account in task_numa_migrate

When we attempt to migrate a task in task_numa_migrate(), we need to
consider the scheduling domain. Specifically:

When searching for the best_cpu, we should skip CPUs that are not in the
current scheduling domain, such as isolated CPUs. Now we only search for
suitable CPUs in p->cpus_ptr, but this is not sufficient. Cpuset configured
partitions are always reflected in each member task's cpumask. However, for
the isolcpus= kernel command line option, the isolated CPUs are simply
omitted from sched_domains without further restrictions on tasks' cpumasks.
If a task's cpumask includes isolated CPUs, the task may be migrated to an
isolated cpu.

In update_numa_stats(), skip CPUs that are not in the scheduling domain.
update_numa_stats() is used to be compatible with standard load balancing.
For CPUs that do not participate in load balancing, such as isolated cpus,
we should also skip them.

This patch tries to fix the above issue by considering src_rq->rd->span in
task_numa_migrate(). Note that src_cpu itself may be in an isolated domain
too, and its rd may point to def_root_domain, the span may not be what we
expected. In such cases, bail out early by checking whether sd_numa is
null.

Signed-off-by: Chuyi Zhou <zhouchuyi@...edance.com>
---
 kernel/sched/fair.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 53fd95129b48..764797dd3744 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2120,12 +2120,13 @@ static void update_numa_stats(struct task_numa_env *env,
 			      struct numa_stats *ns, int nid,
 			      bool find_idle)
 {
+	cpumask_t *span = cpu_rq(env->src_cpu)->rd->span;
 	int cpu, idle_core = -1;
 
 	memset(ns, 0, sizeof(*ns));
 	ns->idle_cpu = -1;
 
-	cpumask_copy(env->cpus, cpumask_of_node(nid));
+	cpumask_and(env->cpus, span, cpumask_of_node(nid));
 
 	rcu_read_lock();
 	for_each_cpu(cpu, env->cpus) {
@@ -2435,10 +2436,12 @@ static bool task_numa_compare(struct task_numa_env *env,
 static void task_numa_find_cpu(struct task_numa_env *env,
 				long taskimp, long groupimp)
 {
+	cpumask_t *span = cpu_rq(env->src_cpu)->rd->span;
 	bool maymove = false;
 	int cpu;
 
 	cpumask_and(env->cpus, cpumask_of_node(env->dst_nid), env->p->cpus_ptr);
+	cpumask_and(env->cpus, env->cpus, span);
 
 	/*
 	 * If dst node has spare capacity, then check if there is an
@@ -2503,10 +2506,10 @@ static void task_numa_migrate(struct task_struct *p)
 		.best_cpu = -1,
 	};
 	unsigned long taskweight, groupweight;
+	struct rq *best_rq, *src_rq;
 	struct sched_domain *sd;
 	long taskimp, groupimp;
 	struct numa_group *ng;
-	struct rq *best_rq;
 	int nid, ret, dist;
 
 	/*
@@ -2530,6 +2533,9 @@ static void task_numa_migrate(struct task_struct *p)
 	 * balance domains, some of which do not cross NUMA boundaries.
 	 * Tasks that are "trapped" in such domains cannot be migrated
 	 * elsewhere, so there is no point in (re)trying.
+	 *
+	 * Another situation is that src_cpu is in the isolated domain,
+	 * if so, bail out early.
 	 */
 	if (unlikely(!sd)) {
 		sched_setnuma(p, task_node(p));
@@ -2541,6 +2547,7 @@ static void task_numa_migrate(struct task_struct *p)
 	 */
 	preempt_disable();
 
+	src_rq = cpu_rq(env.src_cpu);
 	env.cpus = this_cpu_cpumask_var_ptr(numa_balance_mask);
 	env.dst_nid = p->numa_preferred_nid;
 	dist = env.dist = node_distance(env.src_nid, env.dst_nid);
@@ -2567,6 +2574,10 @@ static void task_numa_migrate(struct task_struct *p)
 			if (nid == env.src_nid || nid == p->numa_preferred_nid)
 				continue;
 
+			if (unlikely(!cpumask_intersects(src_rq->rd->span,
+						cpumask_of_node(nid))))
+				continue;
+
 			dist = node_distance(env.src_nid, env.dst_nid);
 			if (sched_numa_topology_type == NUMA_BACKPLANE &&
 						dist != env.dist) {
-- 
2.20.1