linux-kernel - Re: [RESEND PATCH] sched/fair: Skip sched_balance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aO5VK4PO_REXNhnN@linux.ibm.com>
Date: Tue, 14 Oct 2025 19:20:35 +0530
From: Srikar Dronamraju <srikar@...ux.ibm.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Tim Chen <tim.c.chen@...ux.intel.com>, Ingo Molnar <mingo@...nel.org>,
        Chen Yu <yu.c.chen@...el.com>, Doug Nelson <doug.nelson@...el.com>,
        Mohini Narkhede <mohini.narkhede@...el.com>,
        linux-kernel@...r.kernel.org,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Shrikanth Hegde <sshegde@...ux.ibm.com>,
        K Prateek Nayak <kprateek.nayak@....com>
Subject: Re: [RESEND PATCH] sched/fair: Skip sched_balance_running cmpxchg
 when balance is not due

* Peter Zijlstra <peterz@...radead.org> [2025-10-14 11:24:36]:

> On Mon, Oct 13, 2025 at 02:54:19PM -0700, Tim Chen wrote:
> 
> 
> Right, Yu Chen said something like that as well, should_we_balance() is
> too late.
> 
> Should we instead move the whole serialize thing inside
> sched_balance_rq() like so:
> 
> @@ -12122,21 +12148,6 @@ static int active_load_balance_cpu_stop(void *data)
>  	return 0;
>  }
>  
> -/*
> - * This flag serializes load-balancing passes over large domains
> - * (above the NODE topology level) - only one load-balancing instance
> - * may run at a time, to reduce overhead on very large systems with
> - * lots of CPUs and large NUMA distances.
> - *
> - * - Note that load-balancing passes triggered while another one
> - *   is executing are skipped and not re-tried.
> - *
> - * - Also note that this does not serialize rebalance_domains()
> - *   execution, as non-SD_SERIALIZE domains will still be
> - *   load-balanced in parallel.
> - */
> -static atomic_t sched_balance_running = ATOMIC_INIT(0);
> -
>  /*
>   * Scale the max sched_balance_rq interval with the number of CPUs in the system.
>   * This trades load-balance latency on larger machines for less cross talk.
> @@ -12192,7 +12203,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>  	/* Earliest time when we have to do rebalance again */
>  	unsigned long next_balance = jiffies + 60*HZ;
>  	int update_next_balance = 0;
> -	int need_serialize, need_decay = 0;
> +	int need_decay = 0;
>  	u64 max_cost = 0;
>  
>  	rcu_read_lock();
> @@ -12216,13 +12227,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>  		}
>  
>  		interval = get_sd_balance_interval(sd, busy);
> -
> -		need_serialize = sd->flags & SD_SERIALIZE;
> -		if (need_serialize) {
> -			if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> -				goto out;
> -		}
> -
>  		if (time_after_eq(jiffies, sd->last_balance + interval)) {
>  			if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
>  				/*
> @@ -12236,9 +12240,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>  			sd->last_balance = jiffies;
>  			interval = get_sd_balance_interval(sd, busy);
>  		}
> -		if (need_serialize)
> -			atomic_set_release(&sched_balance_running, 0);
> -out:
> +
>  		if (time_after(next_balance, sd->last_balance + interval)) {
>  			next_balance = sd->last_balance + interval;
>  			update_next_balance = 1;

I think this is better since previously the one CPU which was not suppose to
do the balancing may increment the atomic variable. If the CPU, that was
suppose to do the balance now tries it may fail since the variable was not
yet decremented.

-- 
Thanks and Regards
Srikar Dronamraju