linux-kernel - Re: [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5f140e59-23f9-46dd-bf5e-7bef0d897cd0@intel.com>
Date: Wed, 15 Oct 2025 12:54:15 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>, Tim Chen
	<tim.c.chen@...ux.intel.com>
CC: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, "K
 Prateek Nayak" <kprateek.nayak@....com>, "Gautham R . Shenoy"
	<gautham.shenoy@....com>, Vincent Guittot <vincent.guittot@...aro.org>, "Juri
 Lelli" <juri.lelli@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, "Mel
 Gorman" <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, "Hillf
 Danton" <hdanton@...a.com>, Shrikanth Hegde <sshegde@...ux.ibm.com>,
	"Jianyong Wu" <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>,
	Tingyin Duan <tingyin.duan@...il.com>, Vern Hao <vernhao@...cent.com>, Len
 Brown <len.brown@...el.com>, Aubrey Li <aubrey.li@...el.com>, Zhao Liu
	<zhao1.liu@...el.com>, Chen Yu <yu.chen.surf@...il.com>, Adam Li
	<adamli@...amperecomputing.com>, Tim Chen <tim.c.chen@...el.com>,
	<linux-kernel@...r.kernel.org>, <haoxing990@...il.com>
Subject: Re: [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load
 balancing

On 10/15/2025 3:12 AM, Madadi Vineeth Reddy wrote:
> On 11/10/25 23:54, Tim Chen wrote:
>> From: "Peter Zijlstra (Intel)" <peterz@...radead.org>
>>
>> Cache-aware load balancing aims to aggregate tasks with potential
>> shared resources into the same cache domain. This approach enhances
>> cache locality, thereby optimizing system performance by reducing
>> cache misses and improving data access efficiency.
>>

[snip]

>> +static void __no_profile task_cache_work(struct callback_head *work)
>> +{
>> +	struct task_struct *p = current;
>> +	struct mm_struct *mm = p->mm;
>> +	unsigned long m_a_occ = 0;
>> +	unsigned long curr_m_a_occ = 0;
>> +	int cpu, m_a_cpu = -1, cache_cpu,
>> +	    pref_nid = NUMA_NO_NODE, curr_cpu;
>> +	cpumask_var_t cpus;
>> +
>> +	WARN_ON_ONCE(work != &p->cache_work);
>> +
>> +	work->next = work;
>> +
>> +	if (p->flags & PF_EXITING)
>> +		return;
>> +
>> +	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
>> +		return;
>> +
>> +	curr_cpu = task_cpu(p);
>> +	cache_cpu = mm->mm_sched_cpu;
>> +#ifdef CONFIG_NUMA_BALANCING
>> +	if (static_branch_likely(&sched_numa_balancing))
>> +		pref_nid = p->numa_preferred_nid;
>> +#endif
>> +
>> +	scoped_guard (cpus_read_lock) {
>> +		get_scan_cpumasks(cpus, cache_cpu,
>> +				  pref_nid, curr_cpu);
>> +
> 
> IIUC, `get_scan_cpumasks` ORs together the preferred NUMA node, cache CPU's node,
> and current CPU's node. This could result in scanning multiple nodes, not preferring
> the NUMA preferred node.
> 

Yes, it is possible, please see comments below.

>> +		for_each_cpu(cpu, cpus) {
>> +			/* XXX sched_cluster_active */
>> +			struct sched_domain *sd = per_cpu(sd_llc, cpu);
>> +			unsigned long occ, m_occ = 0, a_occ = 0;
>> +			int m_cpu = -1, i;
>> +
>> +			if (!sd)
>> +				continue;
>> +
>> +			for_each_cpu(i, sched_domain_span(sd)) {
>> +				occ = fraction_mm_sched(cpu_rq(i),
>> +							per_cpu_ptr(mm->pcpu_sched, i));
>> +				a_occ += occ;
>> +				if (occ > m_occ) {
>> +					m_occ = occ;
>> +					m_cpu = i;
>> +				}
>> +			}
>> +
>> +			/*
>> +			 * Compare the accumulated occupancy of each LLC. The
>> +			 * reason for using accumulated occupancy rather than average
>> +			 * per CPU occupancy is that it works better in asymmetric LLC
>> +			 * scenarios.
>> +			 * For example, if there are 2 threads in a 4CPU LLC and 3
>> +			 * threads in an 8CPU LLC, it might be better to choose the one
>> +			 * with 3 threads. However, this would not be the case if the
>> +			 * occupancy is divided by the number of CPUs in an LLC (i.e.,
>> +			 * if average per CPU occupancy is used).
>> +			 * Besides, NUMA balancing fault statistics behave similarly:
>> +			 * the total number of faults per node is compared rather than
>> +			 * the average number of faults per CPU. This strategy is also
>> +			 * followed here.
>> +			 */
>> +			if (a_occ > m_a_occ) {
>> +				m_a_occ = a_occ;
>> +				m_a_cpu = m_cpu;
>> +			}
>> +
>> +			if (llc_id(cpu) == llc_id(mm->mm_sched_cpu))
>> +				curr_m_a_occ = a_occ;
>> +
>> +			cpumask_andnot(cpus, cpus, sched_domain_span(sd));
>> +		}
> 
> This means NUMA preference has no effect on the selection, except in the
> unlikely case of exactly equal occupancy across LLCs on different nodes
> (where iteration order determines the winner).
> 
> How does it handle when cache locality and memory locality conflict?
> Shouldn't numa preferred node get preference? Also scanning multiple
> nodes add overhead, so can restricting it to numa preferred node be
> better and scan others only when there is no numa preferred node?
> 

Basically, yes, you're right. Ideally, we should prioritize the NUMA
preferred node as the top priority. There's one case I find hard to
handle: the NUMA preferred node is per task rather than per process.
It's possible that different threads of the same process have different
preferred nodes; as a result, the process-wide preferred LLC could bounce
between different nodes, which might cause costly task migrations across
nodes. As a workaround, we tried to keep the scan CPU mask covering the
process's current preferred LLC to ensure the old preferred LLC is included
in the candidates. After all, we have a 2X threshold for switching the
preferred LLC.

thanks,
Chenyu