[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABk29NsmE6ovcJ9O8W+SMS1sQ6h_D=MOXgqk9Hi_OfeyZPJCFA@mail.gmail.com>
Date: Thu, 24 Apr 2025 13:06:52 -0700
From: Josh Don <joshdon@...gle.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>, linux-kernel <linux-kernel@...r.kernel.org>,
Eric Dumazet <eric.dumazet@...il.com>, Yafang Shao <laoar.shao@...il.com>,
Sean Christopherson <seanjc@...gle.com>
Subject: Re: [PATCH] sched/fair: reduce false sharing on sched_balance_running
On Wed, Apr 23, 2025 at 10:46 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> rebalance_domains() can attempt to change sched_balance_running
> more than 350,000 times per second on our servers.
>
> If sched_clock_irqtime and sched_balance_running share the
> same cache line, we see a very high cost on hosts with 480 threads
> dealing with many interrupts.
>
> This patch only acquires sched_balance_running when sd->last_balance
> is old enough.
>
> It also moves sched_balance_running into a dedicated cache line on SMP.
Thanks Eric, looks good to me.
Reviewed-By: Josh Don <joshdon@...gle.com>
>
> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> Cc: Yafang Shao <laoar.shao@...il.com>
> Cc: Sean Christopherson <seanjc@...gle.com>
> Cc: Josh Don <joshdon@...gle.com>
> ---
> kernel/sched/fair.c | 28 ++++++++++++++--------------
> 1 file changed, 14 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e43993a4e5807eaffcacaf761c289e8adb354cfd..460008d0dc459b3ca60209565e89c419ea32a4e2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12144,7 +12144,7 @@ static int active_load_balance_cpu_stop(void *data)
> * execution, as non-SD_SERIALIZE domains will still be
> * load-balanced in parallel.
> */
> -static atomic_t sched_balance_running = ATOMIC_INIT(0);
> +static __cacheline_aligned_in_smp atomic_t sched_balance_running = ATOMIC_INIT(0);
>
> /*
> * Scale the max sched_balance_rq interval with the number of CPUs in the system.
> @@ -12220,25 +12220,25 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>
> interval = get_sd_balance_interval(sd, busy);
>
> + if (!time_after_eq(jiffies, sd->last_balance + interval))
> + goto out;
> +
> need_serialize = sd->flags & SD_SERIALIZE;
> if (need_serialize) {
> if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> goto out;
> }
> -
> - if (time_after_eq(jiffies, sd->last_balance + interval)) {
> - if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
> - /*
> - * The LBF_DST_PINNED logic could have changed
> - * env->dst_cpu, so we can't know our idle
> - * state even if we migrated tasks. Update it.
> - */
> - idle = idle_cpu(cpu);
> - busy = !idle && !sched_idle_cpu(cpu);
> - }
> - sd->last_balance = jiffies;
> - interval = get_sd_balance_interval(sd, busy);
> + if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
> + /*
> + * The LBF_DST_PINNED logic could have changed
> + * env->dst_cpu, so we can't know our idle
> + * state even if we migrated tasks. Update it.
> + */
> + idle = idle_cpu(cpu);
> + busy = !idle && !sched_idle_cpu(cpu);
> }
> + sd->last_balance = jiffies;
> + interval = get_sd_balance_interval(sd, busy);
> if (need_serialize)
> atomic_set_release(&sched_balance_running, 0);
> out:
> --
> 2.49.0.805.g082f7c87e0-goog
>
Powered by blists - more mailing lists