linux-kernel - Re: [PATCH] sched: Skip useless sched_balance_running acquisition if load balance is not due

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <23e05939e7a19151d9b17d011e48a85d650b4e8a.camel@linux.intel.com>
Date: Wed, 16 Apr 2025 09:19:30 -0700
From: Tim Chen <tim.c.chen@...ux.intel.com>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>, "Chen, Yu C"
 <yu.c.chen@...el.com>
Cc: Vincent Guittot <vincent.guittot@...aro.org>, Doug Nelson
 <doug.nelson@...el.com>, Mohini Narkhede <mohini.narkhede@...el.com>, 
 linux-kernel@...r.kernel.org, Peter Zijlstra <peterz@...radead.org>, Ingo
 Molnar <mingo@...nel.org>
Subject: Re: [PATCH] sched: Skip useless sched_balance_running acquisition
 if load balance is not due

On Wed, 2025-04-16 at 14:46 +0530, Shrikanth Hegde wrote:
> 
> On 4/16/25 11:58, Chen, Yu C wrote:
> > Hi Shrikanth,
> > 
> > On 4/16/2025 1:30 PM, Shrikanth Hegde wrote:
> > > 
> > > 
> > > On 4/16/25 09:28, Tim Chen wrote:
> > > > At load balance time, balance of last level cache domains and
> > > > above needs to be serialized. The scheduler checks the atomic var
> > > > sched_balance_running first and then see if time is due for a load
> > > > balance. This is an expensive operation as multiple CPUs can attempt
> > > > sched_balance_running acquisition at the same time.
> > > > 
> > > > On a 2 socket Granite Rapid systems enabling sub-numa cluster and
> > > > running OLTP workloads, 7.6% of cpu cycles are spent on cmpxchg of
> > > > sched_balance_running.  Most of the time, a balance attempt is aborted
> > > > immediately after acquiring sched_balance_running as load balance time
> > > > is not due.
> > > > 
> > > > Instead, check balance due time first before acquiring
> > > > sched_balance_running. This skips many useless acquisitions
> > > > of sched_balance_running and knocks the 7.6% CPU overhead on
> > > > sched_balance_domain() down to 0.05%.  Throughput of the OLTP workload
> > > > improved by 11%.
> > > > 
> > > 
> > > Hi Tim.
> > > 
> > > Time check makes sense specially on large systems mainly due to 
> > > NEWIDLE balance.
> 
> scratch the NEWLY_IDLE part from that comment.
> 
> > > 
> > 
> > Could you elaborate a little on this statement? There is no timeout 
> > mechanism like periodic load balancer for the NEWLY_IDLE, right?
> 
> Yes. NEWLY_IDLE is very opportunistic.
> 
> > 
> > > One more point to add, A lot of time, the CPU which acquired 
> > > sched_balance_running,
> > > need not end up doing the load balance, since it not the CPU meant to 
> > > do the load balance.
> > > 
> > > This thread.
> > > https://lore.kernel.org/all/1e43e783-55e7-417f- 
> > > a1a7-503229eb163a@...ux.ibm.com/
> > > 
> > > 
> > > Best thing probably is to acquire it if this CPU has passed the time 
> > > check and as well it is
> > > actually going to do load balance.
> > > 
> > > 
> > 
> > This is a good point, and we might only want to deal with periodic load
> > balancer rather than NEWLY_IDLE balance. Because the latter is too 
> > frequent and contention on the sched_balance_running might introduce
> > high cache contention.
> > 
> 
> But NEWLY_IDLE doesn't serialize using sched_balance_running and can 
> endup consuming a lot of cycles. But if we serialize using 
> sched_balance_running, it would definitely cause a lot contention as is.
> 
> 
> The point was, before acquiring it, it would be better if this CPU is 
> definite to do the load balance. Else there are chances to miss the 
> actual load balance.
> 
You mean doing a should_we_balance() check?  I think we should not
even consider that if balance time is not due and this balance due check should
come first.

Do you have objection to switching the order of the time due check and serialization/sched_balance_running
around as in this patch?  Adding a change to see if this is the right balancing CPU could be
an orthogonal change. 

97% of CPU cycles in sched_balance_domains() are not spent doing useful load balancing work,
but simply in the acquisition of sched_balance_running in the OLTP workload we tested.

         :
         : 104              static __always_inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new)
         : 105              {
         : 106              return arch_cmpxchg(&v->counter, old, new);
    0.00 :   ffffffff81138f8e:       xor    %eax,%eax
    0.00 :   ffffffff81138f90:       mov    $0x1,%ecx
    0.00 :   ffffffff81138f95:       lock cmpxchg %ecx,0x2577d33(%rip)        # ffffffff836b0cd0 <sched_balance_running>
         : 110              sched_balance_domains():
         : 12146            if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
   97.01 :   ffffffff81138f9d:       test   %eax,%eax
    0.00 :   ffffffff81138f9f:       jne    ffffffff81138fbb <sched_balance_domains+0x20b>
         : 12150            if (time_after_eq(jiffies, sd->last_balance + interval)) {
    0.00 :   ffffffff81138fa1:       mov    0x16cfa18(%rip),%rax        # ffffffff828089c0 <jiffies_64>
    0.00 :   ffffffff81138fa8:       sub    0x48(%r14),%rax
    0.00 :   ffffffff81138fac:       cmp    %rdx,%rax
    0.00 :   ffffffff81138faf:       jns    ffffffff8113900f <sched_balance_domains+0x25f>
         : 12155            raw_atomic_set_release():

So trying to skip this unnecessary acquisition and consider load balancing only when time is due.

Tim

> 
> > thanks,
> > Chenyu
> > 
> > > > Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
> > > > Reported-by: Mohini Narkhede <mohini.narkhede@...el.com>
> > > > Tested-by: Mohini Narkhede <mohini.narkhede@...el.com>
> > > > ---
> > > >   kernel/sched/fair.c | 16 ++++++++--------
> > > >   1 file changed, 8 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index e43993a4e580..5e5f7a770b2f 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -12220,13 +12220,13 @@ static void sched_balance_domains(struct rq 
> > > > *rq, enum cpu_idle_type idle)
> > > >           interval = get_sd_balance_interval(sd, busy);
> > > > -        need_serialize = sd->flags & SD_SERIALIZE;
> > > > -        if (need_serialize) {
> > > > -            if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> > > > -                goto out;
> > > > -        }
> > > > -
> > > >           if (time_after_eq(jiffies, sd->last_balance + interval)) {
> > > > +            need_serialize = sd->flags & SD_SERIALIZE;
> > > > +            if (need_serialize) {
> > > > +                if (atomic_cmpxchg_acquire(&sched_balance_running, 
> > > > 0, 1))
> > > > +                    goto out;
> > > > +            }
> > > > +
> > > >               if (sched_balance_rq(cpu, rq, sd, idle, 
> > > > &continue_balancing)) {
> > > >                   /*
> > > >                    * The LBF_DST_PINNED logic could have changed
> > > > @@ -12238,9 +12238,9 @@ static void sched_balance_domains(struct rq 
> > > > *rq, enum cpu_idle_type idle)
> > > >               }
> > > >               sd->last_balance = jiffies;
> > > >               interval = get_sd_balance_interval(sd, busy);
> > > > +            if (need_serialize)
> > > > +                atomic_set_release(&sched_balance_running, 0);
> > > >           }
> > > > -        if (need_serialize)
> > > > -            atomic_set_release(&sched_balance_running, 0);
> > > >   out:
> > > >           if (time_after(next_balance, sd->last_balance + interval)) {
> > > >               next_balance = sd->last_balance + interval;
> > > 
> 
>