linux-kernel - Re: [PATCH RFC] sched/fair: bump sd->max_newidle_lb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKfTPtCiyXCxAiU8YZUq6fVOY5MmKN7wXeW8Jm1=GWOEghcVBw@mail.gmail.com>
Date: Thu, 26 Jun 2025 09:44:45 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Chris Mason <clm@...com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC] sched/fair: bump sd->max_newidle_lb_cost when newidle
 balance fails

On Thu, 26 Jun 2025 at 09:00, Peter Zijlstra <peterz@...radead.org> wrote:
>
> On Tue, Jun 24, 2025 at 01:48:08PM -0700, Chris Mason wrote:
> > schbench (https://github.com/masoncl/schbench.git) is showing a
> > regression from previous production kernels that bisected down to:
> >
> > sched/fair: Remove sysctl_sched_migration_cost condition (c5b0a7eefc)
> >
> > The schbench command line was:
> >
> > schbench -L -m 4 -M auto -t 256 -n 0 -r 0 -s 0
> >
> > This creates 4 message threads pinned to CPUs 0-3, and 256x4 worker
> > threads spread across the rest of the CPUs.  Neither the worker threads
> > or the message threads do any work, they just wake each other up and go
> > back to sleep as soon as possible.
> >
> > The end result is the first 4 CPUs are pegged waking up those 1024
> > workers, and the rest of the CPUs are constantly banging in and out of
> > idle.  If I take a v6.9 Linus kernel and revert that one commit,
> > performance goes from 3.4M RPS to 5.4M RPS.
> >
> > schedstat shows there are ~100x  more new idle balance operations, and
> > profiling shows the worker threads are spending ~20% of their CPU time
> > on new idle balance.  schedstats also shows that almost all of these new
> > idle balance attemps are failing to find busy groups.
> >
> > The fix used here is to crank up the cost of the balance whenever it
> > fails to find a busy group or busy queue.
> >
> > Signed-off-by: Chris Mason <clm@...com>
> > ---
> >  kernel/sched/fair.c | 52 ++++++++++++++++++++++++---------------------
> >  1 file changed, 28 insertions(+), 24 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 7a14da5396fb2..0c4701564ce01 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11744,6 +11744,32 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
> >       }
> >  }
> >
> > +static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost)
> > +{
> > +     if (cost > sd->max_newidle_lb_cost) {
> > +             /*
> > +              * Track max cost of a domain to make sure to not delay the
> > +              * next wakeup on the CPU.   Add a bit to the min so we don't
> > +              * have to worry about <= vs <, and to handle the sysctl set at
> > +              * zero.
> > +              */
> > +             sd->max_newidle_lb_cost = min(cost, sysctl_sched_migration_cost + 200);
> > +             sd->last_decay_max_lb_cost = jiffies;
> > +     } else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) {
> > +             /*
> > +              * Decay the newidle max times by ~1% per second to ensure that
> > +              * it is not outdated and the current max cost is actually
> > +              * shorter.
> > +              */
> > +             sd->max_newidle_lb_cost = (sd->max_newidle_lb_cost * 253) / 256;
> > +             sd->last_decay_max_lb_cost = jiffies;
> > +
> > +             return true;
> > +     }
> > +
> > +     return false;
> > +}
> > +
>
> For the non-RFC version, please split this into a code move and a code
> change -- I had to stare waaay to long to spot the difference (if we
> keep this code movement at all).
>
> >  /*
> >   * Check this_cpu to ensure it is balanced within domain. Attempt to move
> >   * tasks if there is an imbalance.
> > @@ -11782,12 +11808,14 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> >
> >       group = sched_balance_find_src_group(&env);
> >       if (!group) {
> > +             update_newidle_cost(sd, sd->max_newidle_lb_cost + sd->max_newidle_lb_cost / 2);
> >               schedstat_inc(sd->lb_nobusyg[idle]);
> >               goto out_balanced;
> >       }
> >
> >       busiest = sched_balance_find_src_rq(&env, group);
> >       if (!busiest) {
> > +             update_newidle_cost(sd, sd->max_newidle_lb_cost + sd->max_newidle_lb_cost / 2);
> >               schedstat_inc(sd->lb_nobusyq[idle]);
> >               goto out_balanced;
> >       }
>
> So sched_balance_rq() is used for pretty much all load-balancing, not
> just newidle.
>
> Either make this conditional like:
>
>         if (idle == CPU_NEWLY_IDLE)
>                 update_newidle_cost(...);
>
> or do it all the callsite, where we find !pulled_task (ie failure).
>
> Specifically, we already do update_newidle_cost() there, perhaps inflate
> the cost there instead?
>
>         if (!pulled_tasks)
>                 domain_cost += sysctl_sched_migration_cost;

yeah, sched_balance_newidle() looks like a better place to
artificially increase the cost

>
> or whatever.
>