linux-kernel - Re: [PATCH] sched/fair: reduce false sharing on sched_balance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iLuGSZFrhfQGMRo579CCv4Cie9Vq3SNkcvYM9XPjmzccA@mail.gmail.com>
Date: Thu, 24 Apr 2025 08:49:51 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Yafang Shao <laoar.shao@...il.com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Valentin Schneider <vschneid@...hat.com>, linux-kernel <linux-kernel@...r.kernel.org>, 
	Eric Dumazet <eric.dumazet@...il.com>, Sean Christopherson <seanjc@...gle.com>, 
	Josh Don <joshdon@...gle.com>
Subject: Re: [PATCH] sched/fair: reduce false sharing on sched_balance_running

On Thu, Apr 24, 2025 at 7:46 AM Yafang Shao <laoar.shao@...il.com> wrote:
>
> On Thu, Apr 24, 2025 at 1:46 AM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > rebalance_domains() can attempt to change sched_balance_running
> > more than 350,000 times per second on our servers.
> >
> > If sched_clock_irqtime and sched_balance_running share the
> > same cache line, we see a very high cost on hosts with 480 threads
> > dealing with many interrupts.
>
> CONFIG_IRQ_TIME_ACCOUNTING is enabled on your systems, right?
> Have you observed any impact on task CPU utilization measurements due
> to this configuration? [0]
>
> If cache misses on sched_clock_irqtime are indeed the bottleneck,  why
> not align it to improve performance?

"Align it" meaning what exactly ? Once sched_clock_irqtime is in a
read-mostly location everything is fine.

The main bottleneck is the false sharing on these Intel 6980P cpus...

On a dual socket system, this false sharing is using something like 4%
of the total memory bandwidth,
and causes apparent high costs on other parts of the kernel.

>
> [0]. https://lore.kernel.org/all/20250103022409.2544-1-laoar.shao@gmail.com/

What part should I look at, and how is this related to my patch ?

>
> >
> > This patch only acquires sched_balance_running when sd->last_balance
> > is old enough.
> >
> > It also moves sched_balance_running into a dedicated cache line on SMP.
> >
> > Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> > Cc: Yafang Shao <laoar.shao@...il.com>
> > Cc: Sean Christopherson <seanjc@...gle.com>
> > Cc: Josh Don <joshdon@...gle.com>
> > ---
> >  kernel/sched/fair.c | 28 ++++++++++++++--------------
> >  1 file changed, 14 insertions(+), 14 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e43993a4e5807eaffcacaf761c289e8adb354cfd..460008d0dc459b3ca60209565e89c419ea32a4e2 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -12144,7 +12144,7 @@ static int active_load_balance_cpu_stop(void *data)
> >   *   execution, as non-SD_SERIALIZE domains will still be
> >   *   load-balanced in parallel.
> >   */
> > -static atomic_t sched_balance_running = ATOMIC_INIT(0);
> > +static __cacheline_aligned_in_smp atomic_t sched_balance_running = ATOMIC_INIT(0);
> >
> >  /*
> >   * Scale the max sched_balance_rq interval with the number of CPUs in the system.
> > @@ -12220,25 +12220,25 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> >
> >                 interval = get_sd_balance_interval(sd, busy);
> >
> > +               if (!time_after_eq(jiffies, sd->last_balance + interval))
> > +                       goto out;
> > +
> >                 need_serialize = sd->flags & SD_SERIALIZE;
> >                 if (need_serialize) {
> >                         if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> >                                 goto out;
> >                 }
> > -
> > -               if (time_after_eq(jiffies, sd->last_balance + interval)) {
> > -                       if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
> > -                               /*
> > -                                * The LBF_DST_PINNED logic could have changed
> > -                                * env->dst_cpu, so we can't know our idle
> > -                                * state even if we migrated tasks. Update it.
> > -                                */
> > -                               idle = idle_cpu(cpu);
> > -                               busy = !idle && !sched_idle_cpu(cpu);
> > -                       }
> > -                       sd->last_balance = jiffies;
> > -                       interval = get_sd_balance_interval(sd, busy);
> > +               if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
> > +                       /*
> > +                        * The LBF_DST_PINNED logic could have changed
> > +                        * env->dst_cpu, so we can't know our idle
> > +                        * state even if we migrated tasks. Update it.
> > +                        */
> > +                       idle = idle_cpu(cpu);
> > +                       busy = !idle && !sched_idle_cpu(cpu);
> >                 }
> > +               sd->last_balance = jiffies;
> > +               interval = get_sd_balance_interval(sd, busy);
> >                 if (need_serialize)
> >                         atomic_set_release(&sched_balance_running, 0);
> >  out:
> > --
> > 2.49.0.805.g082f7c87e0-goog
> >
>
>
> --
> Regards
> Yafang