linux-kernel - Re: [RESEND PATCH] sched/fair: Skip sched_balance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <0c9ceccb-8f77-4777-a352-090d29421394@intel.com>
Date: Tue, 14 Oct 2025 00:43:27 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>, Peter Zijlstra
	<peterz@...radead.org>, Tim Chen <tim.c.chen@...ux.intel.com>
CC: Ingo Molnar <mingo@...nel.org>, Doug Nelson <doug.nelson@...el.com>,
	Mohini Narkhede <mohini.narkhede@...el.com>, <linux-kernel@...r.kernel.org>,
	Vincent Guittot <vincent.guittot@...aro.org>, K Prateek Nayak
	<kprateek.nayak@....com>
Subject: Re: [RESEND PATCH] sched/fair: Skip sched_balance_running cmpxchg
 when balance is not due

On 10/14/2025 12:41 AM, Shrikanth Hegde wrote:
> 
> 
> On 10/13/25 10:02 PM, Chen, Yu C wrote:
>> On 10/13/2025 10:26 PM, Peter Zijlstra wrote:
>>> On Thu, Oct 02, 2025 at 04:00:12PM -0700, Tim Chen wrote:
>>>
>>>> During load balancing, balancing at the LLC level and above must be
>>>> serialized.
>>>
>>> I would argue the wording here, there is no *must*, they *are*. Per the
>>> current rules SD_NUMA and up get SD_SERIALIZE.
>>>
>>> This is a *very* old thing, done by Christoph Lameter back when he was
>>> at SGI. I'm not sure this default is still valid or not. Someone would
>>> have to investigate. An alternative would be moving it into
>>> node_reclaim_distance or somesuch.
>>>
>>
>> Do you mean the following:
>>
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 444bdfdab731..436c899d8da2 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -1697,11 +1697,16 @@ sd_init(struct sched_domain_topology_level *tl,
>>                  sd->cache_nice_tries = 2;
>>
>>                  sd->flags &= ~SD_PREFER_SIBLING;
>> -               sd->flags |= SD_SERIALIZE;
>>                  if (sched_domains_numa_distance[tl->numa_level] > 
>> node_reclaim_distance) {
>>                          sd->flags &= ~(SD_BALANCE_EXEC |
>>                                         SD_BALANCE_FORK |
>>                                         SD_WAKE_AFFINE);
>> +                       /*
>> +                        * Nodes that are far away need to be 
>> serialized to
>> +                        * reduce the overhead of long-distance task 
>> migration
>> +                        * caused by load balancing.
>> +                        */
>> +                       sd->flags |= SD_SERIALIZE;
>>                  }
>>
>> We can launch some tests to see if removing SD_SERIALIZE would
>> bring any impact.
>>
>>>> On a 2-socket Granite Rapids system with sub-NUMA clustering enabled
>>>> and running OLTP workloads, 7.6% of CPU cycles were spent on cmpxchg
>>>> operations for `sched_balance_running`. In most cases, the attempt
>>>> aborts immediately after acquisition because the load balance time is
>>>> not yet due.
>>>
>>> So I'm not sure I understand the situation, @continue_balancing should
>>> limit this concurrency to however many groups are on this domain -- your
>>> granite thing with SNC on would have something like 6 groups?
>>>
>>
>> My understanding is that, continue_balancing is set to false after
>> atomic_cmpxhg(sched_balance_running), so if sched_balance_domains()
>> is invoked by many CPUs in parallel, the atomic operation still compete?
>>
> 
>  From what i could remember,
> 
> This mostly always happens at SMT after which continue_balancing = 0.
> Since multiple CPUs end up calling it (specially on busy system)
> it causes a lot cacheline bouncing. and ends up showing in perf profile.
> 

I see, when reaching NUMA domain, the continue_balancing  should already
be set to 0. Thanks for pointing it out.

thanks,
Chenyu