[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250626070057.GE1613200@noisy.programming.kicks-ass.net>
Date: Thu, 26 Jun 2025 09:00:57 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Chris Mason <clm@...com>
Cc: vincent.guittot@...aro.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC] sched/fair: bump sd->max_newidle_lb_cost when
newidle balance fails
On Tue, Jun 24, 2025 at 01:48:08PM -0700, Chris Mason wrote:
> schbench (https://github.com/masoncl/schbench.git) is showing a
> regression from previous production kernels that bisected down to:
>
> sched/fair: Remove sysctl_sched_migration_cost condition (c5b0a7eefc)
>
> The schbench command line was:
>
> schbench -L -m 4 -M auto -t 256 -n 0 -r 0 -s 0
>
> This creates 4 message threads pinned to CPUs 0-3, and 256x4 worker
> threads spread across the rest of the CPUs. Neither the worker threads
> or the message threads do any work, they just wake each other up and go
> back to sleep as soon as possible.
>
> The end result is the first 4 CPUs are pegged waking up those 1024
> workers, and the rest of the CPUs are constantly banging in and out of
> idle. If I take a v6.9 Linus kernel and revert that one commit,
> performance goes from 3.4M RPS to 5.4M RPS.
>
> schedstat shows there are ~100x more new idle balance operations, and
> profiling shows the worker threads are spending ~20% of their CPU time
> on new idle balance. schedstats also shows that almost all of these new
> idle balance attemps are failing to find busy groups.
>
> The fix used here is to crank up the cost of the balance whenever it
> fails to find a busy group or busy queue.
>
> Signed-off-by: Chris Mason <clm@...com>
> ---
> kernel/sched/fair.c | 52 ++++++++++++++++++++++++---------------------
> 1 file changed, 28 insertions(+), 24 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a14da5396fb2..0c4701564ce01 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11744,6 +11744,32 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
> }
> }
>
> +static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost)
> +{
> + if (cost > sd->max_newidle_lb_cost) {
> + /*
> + * Track max cost of a domain to make sure to not delay the
> + * next wakeup on the CPU. Add a bit to the min so we don't
> + * have to worry about <= vs <, and to handle the sysctl set at
> + * zero.
> + */
> + sd->max_newidle_lb_cost = min(cost, sysctl_sched_migration_cost + 200);
> + sd->last_decay_max_lb_cost = jiffies;
> + } else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) {
> + /*
> + * Decay the newidle max times by ~1% per second to ensure that
> + * it is not outdated and the current max cost is actually
> + * shorter.
> + */
> + sd->max_newidle_lb_cost = (sd->max_newidle_lb_cost * 253) / 256;
> + sd->last_decay_max_lb_cost = jiffies;
> +
> + return true;
> + }
> +
> + return false;
> +}
> +
For the non-RFC version, please split this into a code move and a code
change -- I had to stare waaay to long to spot the difference (if we
keep this code movement at all).
> /*
> * Check this_cpu to ensure it is balanced within domain. Attempt to move
> * tasks if there is an imbalance.
> @@ -11782,12 +11808,14 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>
> group = sched_balance_find_src_group(&env);
> if (!group) {
> + update_newidle_cost(sd, sd->max_newidle_lb_cost + sd->max_newidle_lb_cost / 2);
> schedstat_inc(sd->lb_nobusyg[idle]);
> goto out_balanced;
> }
>
> busiest = sched_balance_find_src_rq(&env, group);
> if (!busiest) {
> + update_newidle_cost(sd, sd->max_newidle_lb_cost + sd->max_newidle_lb_cost / 2);
> schedstat_inc(sd->lb_nobusyq[idle]);
> goto out_balanced;
> }
So sched_balance_rq() is used for pretty much all load-balancing, not
just newidle.
Either make this conditional like:
if (idle == CPU_NEWLY_IDLE)
update_newidle_cost(...);
or do it all the callsite, where we find !pulled_task (ie failure).
Specifically, we already do update_newidle_cost() there, perhaps inflate
the cost there instead?
if (!pulled_tasks)
domain_cost += sysctl_sched_migration_cost;
or whatever.
Powered by blists - more mailing lists