linux-kernel - Re: [RESEND PATCH] sched/fair: Skip sched_balance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251013142638.GM3245006@noisy.programming.kicks-ass.net>
Date: Mon, 13 Oct 2025 16:26:38 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Tim Chen <tim.c.chen@...ux.intel.com>
Cc: Ingo Molnar <mingo@...nel.org>, Chen Yu <yu.c.chen@...el.com>,
	Doug Nelson <doug.nelson@...el.com>,
	Mohini Narkhede <mohini.narkhede@...el.com>,
	linux-kernel@...r.kernel.org,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Shrikanth Hegde <sshegde@...ux.ibm.com>,
	K Prateek Nayak <kprateek.nayak@....com>
Subject: Re: [RESEND PATCH] sched/fair: Skip sched_balance_running cmpxchg
 when balance is not due

On Thu, Oct 02, 2025 at 04:00:12PM -0700, Tim Chen wrote:

> During load balancing, balancing at the LLC level and above must be
> serialized. 

I would argue the wording here, there is no *must*, they *are*. Per the
current rules SD_NUMA and up get SD_SERIALIZE.

This is a *very* old thing, done by Christoph Lameter back when he was
at SGI. I'm not sure this default is still valid or not. Someone would
have to investigate. An alternative would be moving it into
node_reclaim_distance or somesuch.

> The scheduler currently checks the atomic
> `sched_balance_running` flag before verifying whether a balance is
> actually due. This causes high contention, as multiple CPUs may attempt
> to acquire the flag concurrently.

Right.

> On a 2-socket Granite Rapids system with sub-NUMA clustering enabled
> and running OLTP workloads, 7.6% of CPU cycles were spent on cmpxchg
> operations for `sched_balance_running`. In most cases, the attempt
> aborts immediately after acquisition because the load balance time is
> not yet due.

So I'm not sure I understand the situation, @continue_balancing should
limit this concurrency to however many groups are on this domain -- your
granite thing with SNC on would have something like 6 groups?

> Fix this by checking whether a balance is due *before* trying to
> acquire `sched_balance_running`. This avoids many wasted acquisitions
> and reduces the cmpxchg overhead in `sched_balance_domain()` from 7.6%
> to 0.05%. As a result, OLTP throughput improves by 11%.

Yeah, I see no harm flipping this, but the Changelog needs help.

> Reviewed-by: Chen Yu <yu.c.chen@...el.com>
> Reviewed-by: Vincent Guittot <vincent.guittot@...aro.org>
> Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
> ---
>  kernel/sched/fair.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8ce56a8d507f..bedd785c4a39 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12126,13 +12126,13 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>  
>  		interval = get_sd_balance_interval(sd, busy);
>  
> -		need_serialize = sd->flags & SD_SERIALIZE;
> -		if (need_serialize) {
> -			if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> -				goto out;
> -		}
> -
>  		if (time_after_eq(jiffies, sd->last_balance + interval)) {
> +			need_serialize = sd->flags & SD_SERIALIZE;
> +			if (need_serialize) {
> +				if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> +					goto out;
> +			}
> +
>  			if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
>  				/*
>  				 * The LBF_DST_PINNED logic could have changed
> @@ -12144,9 +12144,9 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>  			}
>  			sd->last_balance = jiffies;
>  			interval = get_sd_balance_interval(sd, busy);
> +			if (need_serialize)
> +				atomic_set_release(&sched_balance_running, 0);
>  		}
> -		if (need_serialize)
> -			atomic_set_release(&sched_balance_running, 0);
>  out:
>  		if (time_after(next_balance, sd->last_balance + interval)) {
>  			next_balance = sd->last_balance + interval;

Instead of making the indenting worse, could we make it better?

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e743d9d0576c..6318834ff42a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12215,6 +12215,8 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 		}
 
 		interval = get_sd_balance_interval(sd, busy);
+		if (time_before(jiffies, sd->last_balance + interval))
+			goto out;
 
 		need_serialize = sd->flags & SD_SERIALIZE;
 		if (need_serialize) {
@@ -12222,19 +12224,18 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 				goto out;
 		}
 
-		if (time_after_eq(jiffies, sd->last_balance + interval)) {
-			if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
-				/*
-				 * The LBF_DST_PINNED logic could have changed
-				 * env->dst_cpu, so we can't know our idle
-				 * state even if we migrated tasks. Update it.
-				 */
-				idle = idle_cpu(cpu);
-				busy = !idle && !sched_idle_cpu(cpu);
-			}
-			sd->last_balance = jiffies;
-			interval = get_sd_balance_interval(sd, busy);
+		if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
+			/*
+			 * The LBF_DST_PINNED logic could have changed
+			 * env->dst_cpu, so we can't know our idle
+			 * state even if we migrated tasks. Update it.
+			 */
+			idle = idle_cpu(cpu);
+			busy = !idle && !sched_idle_cpu(cpu);
 		}
+		sd->last_balance = jiffies;
+		interval = get_sd_balance_interval(sd, busy);
+
 		if (need_serialize)
 			atomic_set_release(&sched_balance_running, 0);
 out: