linux-kernel - Re: [PATCH] sched: Skip useless sched_balance_running acquisition if load balance is not due

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87728994-b928-45b3-a6a0-258af6e81294@amd.com>
Date: Fri, 18 Apr 2025 10:56:04 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Peter Zijlstra <peterz@...radead.org>
CC: Vincent Guittot <vincent.guittot@...aro.org>, Shrikanth Hegde
	<sshegde@...ux.ibm.com>, "Chen, Yu C" <yu.c.chen@...el.com>, Tim Chen
	<tim.c.chen@...ux.intel.com>, Ingo Molnar <mingo@...nel.org>, Doug Nelson
	<doug.nelson@...el.com>, Mohini Narkhede <mohini.narkhede@...el.com>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] sched: Skip useless sched_balance_running acquisition if
 load balance is not due

Hello Peter,

On 4/17/2025 5:31 PM, Peter Zijlstra wrote:
>> o Since this is a single flag across the entire system, it also implies
>>    CPUs cannon concurrently do load balancing across different NUMA
>>    domains which seems reasonable since a load balance at lower NUMA
>>    domain can potentially change the "nr_numa_running" and
>>    "nr_preferred_running" stats for the higher domain but if this is the
>>    case, a newidle balance at lower NUMA domain can interfere with a
>>    concurrent busy / newidle load balancing at higher NUMA domain.
>>    Is this expected? Should newidle balance be serialized too?
> 
> Serializing new-idle might create too much idle time.

In the context of busy and idle balancing, What are your thoughts on a
per sd "serialize' flag?

> 
>>    (P.S. I copied over the serialize logic from sched_balance_domains()
>>     into sched_balance_newidle() and did not see any difference in my
>>     testing but perhaps there are benchmarks out there that care for
>>     this)
>>
>> o If the intention of SD_SERIALIZE was to actually "serializes
>>    load-balancing passes over large domains (above the NODE topology
>>    level)" as the comment above "sched_balance_running" states, and
>>    this question is specific to x86 - when enabling SNC on Intel or
>>    NPS on AMD servers, the first NUMA domain is in fact as big as the
>>    NODE (now PKG domain) if not smaller. Is it okay to clear
>>    SD_SERIALIZE for these domains since they are small enough now?
> 
> You'll have to dive into the history here, but IIRC it was from SGI back
> in the day, where NUMA factors were quite large and load-balancing
> across numa was a pain.

Let me dig up the git history and see if any interesting details hide
there.

> 
> Small really isn't the criteria, but inter-node latency might be, we
> also have this node_reclaim_distance thing.
> 
> Not quite sure what makes sense, someone should tinker I suppose, see
> what works with today's hardare.

I'll try some experiments over the weekend to see if my machine turns
up happy with non-serialized lb for inter-PKG load balancing with NPS
turned on. I'll probably piggy back off of "node_reclaim_distance"
heuristics.

-- 
Thanks and Regards,
Prateek