[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0c9ceccb-8f77-4777-a352-090d29421394@intel.com>
Date: Tue, 14 Oct 2025 00:43:27 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>, Peter Zijlstra
<peterz@...radead.org>, Tim Chen <tim.c.chen@...ux.intel.com>
CC: Ingo Molnar <mingo@...nel.org>, Doug Nelson <doug.nelson@...el.com>,
Mohini Narkhede <mohini.narkhede@...el.com>, <linux-kernel@...r.kernel.org>,
Vincent Guittot <vincent.guittot@...aro.org>, K Prateek Nayak
<kprateek.nayak@....com>
Subject: Re: [RESEND PATCH] sched/fair: Skip sched_balance_running cmpxchg
when balance is not due
On 10/14/2025 12:41 AM, Shrikanth Hegde wrote:
>
>
> On 10/13/25 10:02 PM, Chen, Yu C wrote:
>> On 10/13/2025 10:26 PM, Peter Zijlstra wrote:
>>> On Thu, Oct 02, 2025 at 04:00:12PM -0700, Tim Chen wrote:
>>>
>>>> During load balancing, balancing at the LLC level and above must be
>>>> serialized.
>>>
>>> I would argue the wording here, there is no *must*, they *are*. Per the
>>> current rules SD_NUMA and up get SD_SERIALIZE.
>>>
>>> This is a *very* old thing, done by Christoph Lameter back when he was
>>> at SGI. I'm not sure this default is still valid or not. Someone would
>>> have to investigate. An alternative would be moving it into
>>> node_reclaim_distance or somesuch.
>>>
>>
>> Do you mean the following:
>>
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 444bdfdab731..436c899d8da2 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -1697,11 +1697,16 @@ sd_init(struct sched_domain_topology_level *tl,
>> sd->cache_nice_tries = 2;
>>
>> sd->flags &= ~SD_PREFER_SIBLING;
>> - sd->flags |= SD_SERIALIZE;
>> if (sched_domains_numa_distance[tl->numa_level] >
>> node_reclaim_distance) {
>> sd->flags &= ~(SD_BALANCE_EXEC |
>> SD_BALANCE_FORK |
>> SD_WAKE_AFFINE);
>> + /*
>> + * Nodes that are far away need to be
>> serialized to
>> + * reduce the overhead of long-distance task
>> migration
>> + * caused by load balancing.
>> + */
>> + sd->flags |= SD_SERIALIZE;
>> }
>>
>> We can launch some tests to see if removing SD_SERIALIZE would
>> bring any impact.
>>
>>>> On a 2-socket Granite Rapids system with sub-NUMA clustering enabled
>>>> and running OLTP workloads, 7.6% of CPU cycles were spent on cmpxchg
>>>> operations for `sched_balance_running`. In most cases, the attempt
>>>> aborts immediately after acquisition because the load balance time is
>>>> not yet due.
>>>
>>> So I'm not sure I understand the situation, @continue_balancing should
>>> limit this concurrency to however many groups are on this domain -- your
>>> granite thing with SNC on would have something like 6 groups?
>>>
>>
>> My understanding is that, continue_balancing is set to false after
>> atomic_cmpxhg(sched_balance_running), so if sched_balance_domains()
>> is invoked by many CPUs in parallel, the atomic operation still compete?
>>
>
> From what i could remember,
>
> This mostly always happens at SMT after which continue_balancing = 0.
> Since multiple CPUs end up calling it (specially on busy system)
> it causes a lot cacheline bouncing. and ends up showing in perf profile.
>
I see, when reaching NUMA domain, the continue_balancing should already
be set to 0. Thanks for pointing it out.
thanks,
Chenyu
Powered by blists - more mailing lists