linux-kernel - Re: [PATCH] sched: Skip useless sched_balance_running acquisition if load balance is not due

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <37f0813b205c49a8ab0e604fc1983f9f827b4454.camel@linux.intel.com>
Date: Thu, 17 Apr 2025 10:12:30 -0700
From: Tim Chen <tim.c.chen@...ux.intel.com>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>
Cc: Vincent Guittot <vincent.guittot@...aro.org>, Doug Nelson
 <doug.nelson@...el.com>, Mohini Narkhede <mohini.narkhede@...el.com>, 
 linux-kernel@...r.kernel.org, Peter Zijlstra <peterz@...radead.org>, Ingo
 Molnar <mingo@...nel.org>, "Chen, Yu C" <yu.c.chen@...el.com>
Subject: Re: [PATCH] sched: Skip useless sched_balance_running acquisition
 if load balance is not due

On Thu, 2025-04-17 at 14:49 +0530, Shrikanth Hegde wrote:
> 
> On 4/16/25 21:49, Tim Chen wrote:
> > On Wed, 2025-04-16 at 14:46 +0530, Shrikanth Hegde wrote:
> > > 
> 
> > > 
> > You mean doing a should_we_balance() check?  I think we should not
> > even consider that if balance time is not due and this balance due check should
> > come first.
> > 
> Hi Tim.
> 
> This is the code I was suggesting.

Thanks for suggesting this alternative patch.

The way I read it, it moves the serialization check into sched_balance_rq()
so we do the serialization check after the balance due check.  Functionally
it is equivalent to my proposed patch.

However, I think your original goal is to get the CPU that could
actually balance the sched domain a chance to run in current balance period
and not get skipped over if current CPU cannot balance.
With your patch, if should_we_balance() failed, we
still advance sd->last_balance. So the CPU that could balance has to wait for its
chance in the next period.  I think you'll need to pass the outcome of
should_we_balance() from sched_balance_rq() to sched_balance_domains(),
and not advance sd->last_balance when should_we_balance() failed.
This would allow the CPU that could balance a chance to run 
in the current balance period.

That said, I'm not actually proposing that we do the above. For many large systems,
the overhead in balancing top domains is already high and increasing balance frequency is
not necessarily desirable.

Thanks.

Tim

> 
> ---
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0c19459c8042..e712934973e7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11723,6 +11723,21 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
>                  break;
>          }
>   }
> +/*
> + * This flag serializes load-balancing passes over large domains
> + * (above the NODE topology level) - only one load-balancing instance
> + * may run at a time, to reduce overhead on very large systems with
> + * lots of CPUs and large NUMA distances.
> + *
> + * - Note that load-balancing passes triggered while another one
> + *   is executing are skipped and not re-tried.
> + *
> + * - Also note that this does not serialize rebalance_domains()
> + *   execution, as non-SD_SERIALIZE domains will still be
> + *   load-balanced in parallel.
> + */
> +static atomic_t sched_balance_running = ATOMIC_INIT(0);
> +
>   
>   /*
>    * Check this_cpu to ensure it is balanced within domain. Attempt to move
> @@ -11738,6 +11753,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>          struct rq *busiest;
>          struct rq_flags rf;
>          struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
> +       int need_serialize = 0;
>          struct lb_env env = {
>                  .sd             = sd,
>                  .dst_cpu        = this_cpu,
> @@ -11760,6 +11776,12 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>                  goto out_balanced;
>          }
>   
> +       need_serialize = (idle!=CPU_NEWLY_IDLE) && sd->flags & SD_SERIALIZE;
> +       if (need_serialize) {
> +               if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> +                       return ld_moved;
> +       }
> +
>          group = sched_balance_find_src_group(&env);
>          if (!group) {
>                  schedstat_inc(sd->lb_nobusyg[idle]);
> @@ -11884,6 +11906,8 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>                          if (!cpumask_subset(cpus, env.dst_grpmask)) {
>                                  env.loop = 0;
>                                  env.loop_break = SCHED_NR_MIGRATE_BREAK;
> +                               if (need_serialize)
> +                                       atomic_set_release(&sched_balance_running, 0);
>                                  goto redo;
>                          }
>                          goto out_all_pinned;
> @@ -12000,6 +12024,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>              sd->balance_interval < sd->max_interval)
>                  sd->balance_interval *= 2;
>   out:
> +       if (need_serialize)
> +               atomic_set_release(&sched_balance_running, 0);
> +
>          return ld_moved;
>   }
>   
> @@ -12124,21 +12151,6 @@ static int active_load_balance_cpu_stop(void *data)
>          return 0;
>   }
>   
> -/*
> - * This flag serializes load-balancing passes over large domains
> - * (above the NODE topology level) - only one load-balancing instance
> - * may run at a time, to reduce overhead on very large systems with
> - * lots of CPUs and large NUMA distances.
> - *
> - * - Note that load-balancing passes triggered while another one
> - *   is executing are skipped and not re-tried.
> - *
> - * - Also note that this does not serialize rebalance_domains()
> - *   execution, as non-SD_SERIALIZE domains will still be
> - *   load-balanced in parallel.
> - */
> -static atomic_t sched_balance_running = ATOMIC_INIT(0);
> -
>   /*
>    * Scale the max sched_balance_rq interval with the number of CPUs in the system.
>    * This trades load-balance latency on larger machines for less cross talk.
> @@ -12188,7 +12200,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>          /* Earliest time when we have to do rebalance again */
>          unsigned long next_balance = jiffies + 60*HZ;
>          int update_next_balance = 0;
> -       int need_serialize, need_decay = 0;
> +       int need_decay = 0;
>          u64 max_cost = 0;
>   
>          rcu_read_lock();
> @@ -12213,12 +12225,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>   
>                  interval = get_sd_balance_interval(sd, busy);
>   
> -               need_serialize = sd->flags & SD_SERIALIZE;
> -               if (need_serialize) {
> -                       if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> -                               goto out;
> -               }
> -
>                  if (time_after_eq(jiffies, sd->last_balance + interval)) {
>                          if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
>                                  /*
> @@ -12232,9 +12238,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>                          sd->last_balance = jiffies;
>                          interval = get_sd_balance_interval(sd, busy);
>                  }
> -               if (need_serialize)
> -                       atomic_set_release(&sched_balance_running, 0);
> -out:
> +
>                  if (time_after(next_balance, sd->last_balance + interval)) {
>                          next_balance = sd->last_balance + interval;
>                          update_next_balance = 1;
>