linux-kernel - [PATCH] sched/fair: reduce false sharing on sched_balance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250423174634.3009657-1-edumazet@google.com>
Date: Wed, 23 Apr 2025 17:46:34 +0000
From: Eric Dumazet <edumazet@...gle.com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Valentin Schneider <vschneid@...hat.com>
Cc: linux-kernel <linux-kernel@...r.kernel.org>, Eric Dumazet <eric.dumazet@...il.com>, 
	Eric Dumazet <edumazet@...gle.com>, Yafang Shao <laoar.shao@...il.com>, 
	Sean Christopherson <seanjc@...gle.com>, Josh Don <joshdon@...gle.com>
Subject: [PATCH] sched/fair: reduce false sharing on sched_balance_running

rebalance_domains() can attempt to change sched_balance_running
more than 350,000 times per second on our servers.

If sched_clock_irqtime and sched_balance_running share the
same cache line, we see a very high cost on hosts with 480 threads
dealing with many interrupts.

This patch only acquires sched_balance_running when sd->last_balance
is old enough.

It also moves sched_balance_running into a dedicated cache line on SMP.

Signed-off-by: Eric Dumazet <edumazet@...gle.com>
Cc: Yafang Shao <laoar.shao@...il.com>
Cc: Sean Christopherson <seanjc@...gle.com>
Cc: Josh Don <joshdon@...gle.com>
---
 kernel/sched/fair.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e43993a4e5807eaffcacaf761c289e8adb354cfd..460008d0dc459b3ca60209565e89c419ea32a4e2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12144,7 +12144,7 @@ static int active_load_balance_cpu_stop(void *data)
  *   execution, as non-SD_SERIALIZE domains will still be
  *   load-balanced in parallel.
  */
-static atomic_t sched_balance_running = ATOMIC_INIT(0);
+static __cacheline_aligned_in_smp atomic_t sched_balance_running = ATOMIC_INIT(0);
 
 /*
  * Scale the max sched_balance_rq interval with the number of CPUs in the system.
@@ -12220,25 +12220,25 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 
 		interval = get_sd_balance_interval(sd, busy);
 
+		if (!time_after_eq(jiffies, sd->last_balance + interval))
+			goto out;
+
 		need_serialize = sd->flags & SD_SERIALIZE;
 		if (need_serialize) {
 			if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
 				goto out;
 		}
-
-		if (time_after_eq(jiffies, sd->last_balance + interval)) {
-			if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
-				/*
-				 * The LBF_DST_PINNED logic could have changed
-				 * env->dst_cpu, so we can't know our idle
-				 * state even if we migrated tasks. Update it.
-				 */
-				idle = idle_cpu(cpu);
-				busy = !idle && !sched_idle_cpu(cpu);
-			}
-			sd->last_balance = jiffies;
-			interval = get_sd_balance_interval(sd, busy);
+		if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
+			/*
+			 * The LBF_DST_PINNED logic could have changed
+			 * env->dst_cpu, so we can't know our idle
+			 * state even if we migrated tasks. Update it.
+			 */
+			idle = idle_cpu(cpu);
+			busy = !idle && !sched_idle_cpu(cpu);
 		}
+		sd->last_balance = jiffies;
+		interval = get_sd_balance_interval(sd, busy);
 		if (need_serialize)
 			atomic_set_release(&sched_balance_running, 0);
 out:
-- 
2.49.0.805.g082f7c87e0-goog