linux-kernel - [RESEND RFC PATCH v2 09/29] sched/fair: Rotate the CPU resposible for busy load balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20251208092744.32737-9-kprateek.nayak@amd.com>
Date: Mon, 8 Dec 2025 09:26:55 +0000
From: K Prateek Nayak <kprateek.nayak@....com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Anna-Maria Behnsen <anna-maria@...utronix.de>,
	Frederic Weisbecker <frederic@...nel.org>, Thomas Gleixner
	<tglx@...utronix.de>
CC: <linux-kernel@...r.kernel.org>, Dietmar Eggemann
	<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben Segall
	<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
	<vschneid@...hat.com>, K Prateek Nayak <kprateek.nayak@....com>, "Gautham R.
 Shenoy" <gautham.shenoy@....com>, Swapnil Sapkal <swapnil.sapkal@....com>,
	Shrikanth Hegde <sshegde@...ux.ibm.com>, Chen Yu <yu.c.chen@...el.com>
Subject: [RESEND RFC PATCH v2 09/29] sched/fair: Rotate the CPU resposible for busy load balancing

The group_balance_cpu() currently always returns the fist CPU from the
group_balance_mask(). This puts the burden of busy balancing on the same
set of CPUs when the system is under heavy load.

Rotate the CPU responsilble for busy load balancing across all the CPUs
in group_balance_cpu(). The "busy_balance_cpu" in "sg->scg" shows the
CPU currently responsible for busy balancing in the group. Since
"sg->sgc" is shared by all the CPUs of group_balance_cpu(), all CPUs of
group will see the same "busy_balance_cpu".

The current "busy_balance_cpu" is responsible for updating the shared
variable with the next CPU on the mask once it is done attempting
balancing.

Although there is an unlikely chance of the current "busy_balance_cpu"
being unable to perform load balancing in a timely manner if it is
running with softirqs disabled, it is no worse than current scenario
where the first CPU of group_balance_mask() could also be unavailable
for a long time to perform load balancing.

Any hotplug / cpuset operation will rebuild the sched domain hierarchy
which will reset the "busy_balance_cpu" to the first CPU on the updated
group_balance_mask().

Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
---
 kernel/sched/fair.c     | 24 +++++++++++++++++++++++-
 kernel/sched/sched.h    |  1 +
 kernel/sched/topology.c |  5 ++++-
 3 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8f5745495974..e3935903d9c5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11747,7 +11747,7 @@ static int should_we_balance(struct lb_env *env)
 	if (idle_smt != -1)
 		return idle_smt == env->dst_cpu;
 
-	/* Are we the first CPU of this group ? */
+	/* Are we the busy load balancing CPU of this group ? */
 	return group_balance_cpu(sg) == env->dst_cpu;
 }
 
@@ -11773,6 +11773,22 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
 	}
 }
 
+static void update_busy_balance_cpu(int this_cpu, struct lb_env *env)
+{
+	struct sched_group *group = env->sd->groups;
+	int balance_cpu = group_balance_cpu(group);
+
+	/*
+	 * Only the current CPU responsible for busy load balancing
+	 * should update the "busy_balance_cpu" for next instance.
+	 */
+	if (this_cpu != balance_cpu)
+		return;
+
+	balance_cpu = cpumask_next_wrap(balance_cpu, group_balance_mask(group));
+	WRITE_ONCE(group->sgc->busy_balance_cpu, balance_cpu);
+}
+
 /*
  * This flag serializes load-balancing passes over large domains
  * (above the NODE topology level) - only one load-balancing instance
@@ -12075,6 +12091,12 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 out:
 	if (need_unlock)
 		atomic_set_release(&sched_balance_running, 0);
+	/*
+	 * If this was a successful busy balancing attempt,
+	 * update the "busy_balance_cpu" of the group.
+	 */
+	if (!idle && continue_balancing)
+		update_busy_balance_cpu(this_cpu, &env);
 
 	return ld_moved;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b419a4d98461..659e712f348f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2100,6 +2100,7 @@ struct sched_group_capacity {
 	unsigned long		max_capacity;		/* Max per-CPU capacity in group */
 	unsigned long		next_update;
 	int			imbalance;		/* XXX unrelated to capacity but shared group state */
+	int			busy_balance_cpu;
 
 	int			id;
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 14be90af9761..8870b38d4072 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -810,7 +810,7 @@ enum s_alloc {
  */
 int group_balance_cpu(struct sched_group *sg)
 {
-	return cpumask_first(group_balance_mask(sg));
+	return READ_ONCE(sg->sgc->busy_balance_cpu);
 }
 
 
@@ -992,6 +992,8 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 	cpu = cpumask_first(mask);
 
 	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
+	sg->sgc->busy_balance_cpu = cpu;
+
 	if (atomic_inc_return(&sg->sgc->ref) == 1)
 		cpumask_copy(group_balance_mask(sg), mask);
 	else
@@ -1211,6 +1213,7 @@ static struct sched_group *get_group(int cpu, struct sd_data *sdd)
 
 	sg = *per_cpu_ptr(sdd->sg, cpu);
 	sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
+	sg->sgc->busy_balance_cpu = cpu;
 
 	/* Increase refcounts for claim_allocations: */
 	already_visited = atomic_inc_return(&sg->ref) > 1;
-- 
2.43.0