[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtCiyXCxAiU8YZUq6fVOY5MmKN7wXeW8Jm1=GWOEghcVBw@mail.gmail.com>
Date: Thu, 26 Jun 2025 09:44:45 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Chris Mason <clm@...com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC] sched/fair: bump sd->max_newidle_lb_cost when newidle
balance fails
On Thu, 26 Jun 2025 at 09:00, Peter Zijlstra <peterz@...radead.org> wrote:
>
> On Tue, Jun 24, 2025 at 01:48:08PM -0700, Chris Mason wrote:
> > schbench (https://github.com/masoncl/schbench.git) is showing a
> > regression from previous production kernels that bisected down to:
> >
> > sched/fair: Remove sysctl_sched_migration_cost condition (c5b0a7eefc)
> >
> > The schbench command line was:
> >
> > schbench -L -m 4 -M auto -t 256 -n 0 -r 0 -s 0
> >
> > This creates 4 message threads pinned to CPUs 0-3, and 256x4 worker
> > threads spread across the rest of the CPUs. Neither the worker threads
> > or the message threads do any work, they just wake each other up and go
> > back to sleep as soon as possible.
> >
> > The end result is the first 4 CPUs are pegged waking up those 1024
> > workers, and the rest of the CPUs are constantly banging in and out of
> > idle. If I take a v6.9 Linus kernel and revert that one commit,
> > performance goes from 3.4M RPS to 5.4M RPS.
> >
> > schedstat shows there are ~100x more new idle balance operations, and
> > profiling shows the worker threads are spending ~20% of their CPU time
> > on new idle balance. schedstats also shows that almost all of these new
> > idle balance attemps are failing to find busy groups.
> >
> > The fix used here is to crank up the cost of the balance whenever it
> > fails to find a busy group or busy queue.
> >
> > Signed-off-by: Chris Mason <clm@...com>
> > ---
> > kernel/sched/fair.c | 52 ++++++++++++++++++++++++---------------------
> > 1 file changed, 28 insertions(+), 24 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 7a14da5396fb2..0c4701564ce01 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11744,6 +11744,32 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
> > }
> > }
> >
> > +static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost)
> > +{
> > + if (cost > sd->max_newidle_lb_cost) {
> > + /*
> > + * Track max cost of a domain to make sure to not delay the
> > + * next wakeup on the CPU. Add a bit to the min so we don't
> > + * have to worry about <= vs <, and to handle the sysctl set at
> > + * zero.
> > + */
> > + sd->max_newidle_lb_cost = min(cost, sysctl_sched_migration_cost + 200);
> > + sd->last_decay_max_lb_cost = jiffies;
> > + } else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) {
> > + /*
> > + * Decay the newidle max times by ~1% per second to ensure that
> > + * it is not outdated and the current max cost is actually
> > + * shorter.
> > + */
> > + sd->max_newidle_lb_cost = (sd->max_newidle_lb_cost * 253) / 256;
> > + sd->last_decay_max_lb_cost = jiffies;
> > +
> > + return true;
> > + }
> > +
> > + return false;
> > +}
> > +
>
> For the non-RFC version, please split this into a code move and a code
> change -- I had to stare waaay to long to spot the difference (if we
> keep this code movement at all).
>
> > /*
> > * Check this_cpu to ensure it is balanced within domain. Attempt to move
> > * tasks if there is an imbalance.
> > @@ -11782,12 +11808,14 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> >
> > group = sched_balance_find_src_group(&env);
> > if (!group) {
> > + update_newidle_cost(sd, sd->max_newidle_lb_cost + sd->max_newidle_lb_cost / 2);
> > schedstat_inc(sd->lb_nobusyg[idle]);
> > goto out_balanced;
> > }
> >
> > busiest = sched_balance_find_src_rq(&env, group);
> > if (!busiest) {
> > + update_newidle_cost(sd, sd->max_newidle_lb_cost + sd->max_newidle_lb_cost / 2);
> > schedstat_inc(sd->lb_nobusyq[idle]);
> > goto out_balanced;
> > }
>
> So sched_balance_rq() is used for pretty much all load-balancing, not
> just newidle.
>
> Either make this conditional like:
>
> if (idle == CPU_NEWLY_IDLE)
> update_newidle_cost(...);
>
> or do it all the callsite, where we find !pulled_task (ie failure).
>
> Specifically, we already do update_newidle_cost() there, perhaps inflate
> the cost there instead?
>
> if (!pulled_tasks)
> domain_cost += sysctl_sched_migration_cost;
yeah, sched_balance_newidle() looks like a better place to
artificially increase the cost
>
> or whatever.
>
Powered by blists - more mailing lists