linux-kernel - Re: [PATCH RFC] sched/fair: bump sd->max_newidle_lb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250626070057.GE1613200@noisy.programming.kicks-ass.net>
Date: Thu, 26 Jun 2025 09:00:57 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Chris Mason <clm@...com>
Cc: vincent.guittot@...aro.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC] sched/fair: bump sd->max_newidle_lb_cost when
 newidle balance fails

On Tue, Jun 24, 2025 at 01:48:08PM -0700, Chris Mason wrote:
> schbench (https://github.com/masoncl/schbench.git) is showing a
> regression from previous production kernels that bisected down to:
> 
> sched/fair: Remove sysctl_sched_migration_cost condition (c5b0a7eefc)
> 
> The schbench command line was:
> 
> schbench -L -m 4 -M auto -t 256 -n 0 -r 0 -s 0
> 
> This creates 4 message threads pinned to CPUs 0-3, and 256x4 worker
> threads spread across the rest of the CPUs.  Neither the worker threads
> or the message threads do any work, they just wake each other up and go
> back to sleep as soon as possible.
> 
> The end result is the first 4 CPUs are pegged waking up those 1024
> workers, and the rest of the CPUs are constantly banging in and out of
> idle.  If I take a v6.9 Linus kernel and revert that one commit,
> performance goes from 3.4M RPS to 5.4M RPS.
> 
> schedstat shows there are ~100x  more new idle balance operations, and
> profiling shows the worker threads are spending ~20% of their CPU time
> on new idle balance.  schedstats also shows that almost all of these new
> idle balance attemps are failing to find busy groups.
> 
> The fix used here is to crank up the cost of the balance whenever it
> fails to find a busy group or busy queue.
> 
> Signed-off-by: Chris Mason <clm@...com>
> ---
>  kernel/sched/fair.c | 52 ++++++++++++++++++++++++---------------------
>  1 file changed, 28 insertions(+), 24 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a14da5396fb2..0c4701564ce01 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11744,6 +11744,32 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
>  	}
>  }
>  
> +static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost)
> +{
> +	if (cost > sd->max_newidle_lb_cost) {
> +		/*
> +		 * Track max cost of a domain to make sure to not delay the
> +		 * next wakeup on the CPU.   Add a bit to the min so we don't
> +		 * have to worry about <= vs <, and to handle the sysctl set at
> +		 * zero.
> +		 */
> +		sd->max_newidle_lb_cost = min(cost, sysctl_sched_migration_cost + 200);
> +		sd->last_decay_max_lb_cost = jiffies;
> +	} else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) {
> +		/*
> +		 * Decay the newidle max times by ~1% per second to ensure that
> +		 * it is not outdated and the current max cost is actually
> +		 * shorter.
> +		 */
> +		sd->max_newidle_lb_cost = (sd->max_newidle_lb_cost * 253) / 256;
> +		sd->last_decay_max_lb_cost = jiffies;
> +
> +		return true;
> +	}
> +
> +	return false;
> +}
> +

For the non-RFC version, please split this into a code move and a code
change -- I had to stare waaay to long to spot the difference (if we
keep this code movement at all).

>  /*
>   * Check this_cpu to ensure it is balanced within domain. Attempt to move
>   * tasks if there is an imbalance.
> @@ -11782,12 +11808,14 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>  
>  	group = sched_balance_find_src_group(&env);
>  	if (!group) {
> +		update_newidle_cost(sd, sd->max_newidle_lb_cost + sd->max_newidle_lb_cost / 2);
>  		schedstat_inc(sd->lb_nobusyg[idle]);
>  		goto out_balanced;
>  	}
>  
>  	busiest = sched_balance_find_src_rq(&env, group);
>  	if (!busiest) {
> +		update_newidle_cost(sd, sd->max_newidle_lb_cost + sd->max_newidle_lb_cost / 2);
>  		schedstat_inc(sd->lb_nobusyq[idle]);
>  		goto out_balanced;
>  	}

So sched_balance_rq() is used for pretty much all load-balancing, not
just newidle.

Either make this conditional like:

	if (idle == CPU_NEWLY_IDLE)
		update_newidle_cost(...);

or do it all the callsite, where we find !pulled_task (ie failure).

Specifically, we already do update_newidle_cost() there, perhaps inflate
the cost there instead?

	if (!pulled_tasks)
		domain_cost += sysctl_sched_migration_cost;

or whatever.