linux-kernel - Re: [PATCH v4] sched/fair: Skip sched_balance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20251112134438.GF4068168@noisy.programming.kicks-ass.net>
Date: Wed, 12 Nov 2025 14:44:38 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: Srikar Dronamraju <srikar@...ux.ibm.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
	Tim Chen <tim.c.chen@...ux.intel.com>,
	Ingo Molnar <mingo@...nel.org>, Chen Yu <yu.c.chen@...el.com>,
	Doug Nelson <doug.nelson@...el.com>,
	Mohini Narkhede <mohini.narkhede@...el.com>,
	linux-kernel@...r.kernel.org,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Shrikanth Hegde <sshegde@...ux.ibm.com>,
	K Prateek Nayak <kprateek.nayak@....com>
Subject: Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when
 balance is not due

On Wed, Nov 12, 2025 at 02:39:37PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 12, 2025 at 04:55:48PM +0530, Srikar Dronamraju wrote:
> 
> > If the CPU that was doing the balance was not the first CPU of the domain
> > span, but it was doing the balance since the first CPU was busy, and the
> > first CPU now happens to be idle at redo, the scheduler would have chosen the
> > first CPU to do the balance. However it will now choose the CPU that had the atomic..
> > 
> > I think this is better because 
> > - The first CPU may have tried just before this CPU dropped the atomic and
> >   hence we may miss the balance opportunity.
> > - The first CPU and the other CPU may not be sharing cache and hence there
> >   may be a cache-miss, which we are avoiding by doing this.
> 
> I'm not sure I understand what you're arguing for. Are you saying it
> would be better to retain the lock where possible?

FWIW this is also the behaviour we had before this patch, where the lock
is taken in the caller, it would have covered the whole redo case as
well.

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11717,23 +11717,22 @@ static int sched_balance_rq(int this_cpu
>  		.fbq_type	= all,
>  		.tasks		= LIST_HEAD_INIT(env.tasks),
>  	};
> -	bool need_unlock;
> +	bool need_unlock = false;
>  
>  	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>  
>  	schedstat_inc(sd->lb_count[idle]);
>  
>  redo:
> -	need_unlock = false;
>  	if (!should_we_balance(&env)) {
>  		*continue_balancing = 0;
>  		goto out_balanced;
>  	}
>  
> -	if (sd->flags & SD_SERIALIZE) {
> -		if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) {
> +	if (!need_unlock && sd->flags & SD_SERIALIZE) {
> +		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
>  			goto out_balanced;
> -		}
> +
>  		need_unlock = true;
>  	}
>  
> @@ -11861,9 +11860,6 @@ static int sched_balance_rq(int this_cpu
>  			if (!cpumask_subset(cpus, env.dst_grpmask)) {
>  				env.loop = 0;
>  				env.loop_break = SCHED_NR_MIGRATE_BREAK;
> -				if (need_unlock)
> -					atomic_set_release(&sched_balance_running, 0);
> -
>  				goto redo;
>  			}
>  			goto out_all_pinned;