linux-kernel - [RFC PATCH 16/19] sched/fair: Convert sched_balance_nohz_idle() to use nohz_shared

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250904041516.3046-17-kprateek.nayak@amd.com>
Date: Thu, 4 Sep 2025 04:15:12 +0000
From: K Prateek Nayak <kprateek.nayak@....com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Anna-Maria Behnsen <anna-maria@...utronix.de>,
	Frederic Weisbecker <frederic@...nel.org>, Thomas Gleixner
	<tglx@...utronix.de>, <linux-kernel@...r.kernel.org>
CC: Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
	<rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman
	<mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, K Prateek Nayak
	<kprateek.nayak@....com>, "Gautham R. Shenoy" <gautham.shenoy@....com>,
	Swapnil Sapkal <swapnil.sapkal@....com>
Subject: [RFC PATCH 16/19] sched/fair: Convert sched_balance_nohz_idle() to use nohz_shared_list

Convert the main nohz idle load balancing loop in
sched_balance_nohz_idle() to use the distributed nohz idle tracking
mechanism via "nohz_shared_list".

The nifty trick to balance the nohz owner at the very end using
for_each_cpu_wrap() is lost during this transition. Special care is
taken to ensure nohz.{needs_update,has_blocked} are set correctly for a
reattempt if the balance_cpu turns bust towards the end of nohz
balancing preserving the current behavior.

Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
---
 kernel/sched/fair.c | 62 ++++++++++++++++++++++++++++++++++-----------
 1 file changed, 47 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d309cb73d428..c7ac8e7094ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12685,27 +12685,59 @@ static int sched_balance_nohz_idle(int balancing_cpu, unsigned int flags, unsign
 {
 	/* Earliest time when we have to do rebalance again */
 	unsigned long next_balance = start + 60*HZ;
+	struct sched_domain_shared *sds;
 	unsigned int update_flags = 0;
-	int target_cpu;
 
-	/*
-	 * Start with the next CPU after the balancing CPU so we will end with
-	 * balancing CPU and let a chance for other idle cpu to pull load.
-	 */
-	for_each_cpu_wrap(target_cpu, nohz.idle_cpus_mask, balancing_cpu + 1) {
-		if (!idle_cpu(target_cpu))
+	rcu_read_lock();
+	list_for_each_entry_rcu(sds, &nohz_shared_list, nohz_list_node) {
+		int target_cpu;
+
+		/* No idle CPUs in this domain */
+		if (!atomic_read(&sds->nr_idle_cpus))
 			continue;
 
-		/*
-		 * If balancing CPU gets work to do, stop the load balancing
-		 * work being done for other CPUs. Next load balancing owner
-		 * will pick it up.
-		 */
-		if (!idle_cpu(balancing_cpu) && need_resched())
-			return -EBUSY;
+		for_each_cpu(target_cpu, sds->idle_cpus_mask) {
+			/* Deal with the balancing CPU at the end. */
+			if (balancing_cpu == target_cpu)
+				continue;
+
+			if (!idle_cpu(target_cpu))
+				continue;
 
-		update_flags |= sched_balance_idle_rq(cpu_rq(target_cpu), flags, &next_balance);
+			/*
+			 * If balancing CPU gets work to do, stop the load balancing
+			 * work being done for other CPUs. Next load balancing owner
+			 * will pick it up.
+			 */
+			if (!idle_cpu(balancing_cpu) && need_resched()) {
+				rcu_read_unlock();
+				return -EBUSY;
+			}
+
+			update_flags |= sched_balance_idle_rq(cpu_rq(target_cpu),
+						flags, &next_balance);
+		}
 	}
+	rcu_read_unlock();
+
+	/*
+	 * If we reach here, all CPUs have been balance and it is time
+	 * to balance the balancing_cpu.
+	 *
+	 * If coincidentally the balancing CPU turns busy at this point
+	 * and is the only nohz idle CPU, we still need to set
+	 * nohz.{needs_update,has_blocked} since the CPU can transition
+	 * back to nohz idle before the tick hits.
+	 *
+	 * In the above case, rq->nohz_tick_stopped is never cleared and
+	 * nohz_balance_enter_idle() skips setting nohz.has_blocked.
+	 * Return -EBUSY instructing the caller to reset the nohz
+	 * signals allowing a reattempt.
+	 */
+	if (!idle_cpu(balancing_cpu) && need_resched())
+		return -EBUSY;
+
+	update_flags |= sched_balance_idle_rq(cpu_rq(balancing_cpu), flags, &next_balance);
 
 	/*
 	 * next_balance will be updated only when there is a need.
-- 
2.34.1