linux-kernel - Re: [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <da4d350862807bcf18626009b6fae248475acb1e.camel@linux.intel.com>
Date: Wed, 15 Oct 2025 12:32:40 -0700
From: Tim Chen <tim.c.chen@...ux.intel.com>
To: "Chen, Yu C" <yu.c.chen@...el.com>, Madadi Vineeth Reddy
	 <vineethr@...ux.ibm.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, K
 Prateek Nayak <kprateek.nayak@....com>, "Gautham R . Shenoy"
 <gautham.shenoy@....com>, Vincent Guittot	 <vincent.guittot@...aro.org>,
 Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann
 <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben
 Segall	 <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin
 Schneider	 <vschneid@...hat.com>, Hillf Danton <hdanton@...a.com>,
 Shrikanth Hegde	 <sshegde@...ux.ibm.com>, Jianyong Wu
 <jianyong.wu@...look.com>, Yangyu Chen	 <cyy@...self.name>, Tingyin Duan
 <tingyin.duan@...il.com>, Vern Hao	 <vernhao@...cent.com>, Len Brown
 <len.brown@...el.com>, Aubrey Li	 <aubrey.li@...el.com>, Zhao Liu
 <zhao1.liu@...el.com>, Chen Yu	 <yu.chen.surf@...il.com>, Adam Li
 <adamli@...amperecomputing.com>, Tim Chen	 <tim.c.chen@...el.com>,
 linux-kernel@...r.kernel.org, haoxing990@...il.com
Subject: Re: [PATCH 01/19] sched/fair: Add infrastructure for cache-aware
 load balancing

On Wed, 2025-10-15 at 12:54 +0800, Chen, Yu C wrote:
> On 10/15/2025 3:12 AM, Madadi Vineeth Reddy wrote:
> > On 11/10/25 23:54, Tim Chen wrote:
> > > From: "Peter Zijlstra (Intel)" <peterz@...radead.org>
> > > 
> > > Cache-aware load balancing aims to aggregate tasks with potential
> > > shared resources into the same cache domain. This approach enhances
> > > cache locality, thereby optimizing system performance by reducing
> > > cache misses and improving data access efficiency.
> > > 
> 
> [snip]
> 
> > > +static void __no_profile task_cache_work(struct callback_head *work)
> > > +{
> > > +	struct task_struct *p = current;
> > > +	struct mm_struct *mm = p->mm;
> > > +	unsigned long m_a_occ = 0;
> > > +	unsigned long curr_m_a_occ = 0;
> > > +	int cpu, m_a_cpu = -1, cache_cpu,
> > > +	    pref_nid = NUMA_NO_NODE, curr_cpu;
> > > +	cpumask_var_t cpus;
> > > +
> > > +	WARN_ON_ONCE(work != &p->cache_work);
> > > +
> > > +	work->next = work;
> > > +
> > > +	if (p->flags & PF_EXITING)
> > > +		return;
> > > +
> > > +	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
> > > +		return;
> > > +
> > > +	curr_cpu = task_cpu(p);
> > > +	cache_cpu = mm->mm_sched_cpu;
> > > +#ifdef CONFIG_NUMA_BALANCING
> > > +	if (static_branch_likely(&sched_numa_balancing))
> > > +		pref_nid = p->numa_preferred_nid;
> > > +#endif
> > > +
> > > +	scoped_guard (cpus_read_lock) {
> > > +		get_scan_cpumasks(cpus, cache_cpu,
> > > +				  pref_nid, curr_cpu);
> > > +
> > 
> > IIUC, `get_scan_cpumasks` ORs together the preferred NUMA node, cache CPU's node,
> > and current CPU's node. This could result in scanning multiple nodes, not preferring
> > the NUMA preferred node.
> > 
> 
> Yes, it is possible, please see comments below.
> 
> > > +		for_each_cpu(cpu, cpus) {
> > > +			/* XXX sched_cluster_active */
> > > +			struct sched_domain *sd = per_cpu(sd_llc, cpu);
> > > +			unsigned long occ, m_occ = 0, a_occ = 0;
> > > +			int m_cpu = -1, i;
> > > +
> > > +			if (!sd)
> > > +				continue;
> > > +
> > > +			for_each_cpu(i, sched_domain_span(sd)) {
> > > +				occ = fraction_mm_sched(cpu_rq(i),
> > > +							per_cpu_ptr(mm->pcpu_sched, i));
> > > +				a_occ += occ;
> > > +				if (occ > m_occ) {
> > > +					m_occ = occ;
> > > +					m_cpu = i;
> > > +				}
> > > +			}
> > > +
> > > +			/*
> > > +			 * Compare the accumulated occupancy of each LLC. The
> > > +			 * reason for using accumulated occupancy rather than average
> > > +			 * per CPU occupancy is that it works better in asymmetric LLC
> > > +			 * scenarios.
> > > +			 * For example, if there are 2 threads in a 4CPU LLC and 3
> > > +			 * threads in an 8CPU LLC, it might be better to choose the one
> > > +			 * with 3 threads. However, this would not be the case if the
> > > +			 * occupancy is divided by the number of CPUs in an LLC (i.e.,
> > > +			 * if average per CPU occupancy is used).
> > > +			 * Besides, NUMA balancing fault statistics behave similarly:
> > > +			 * the total number of faults per node is compared rather than
> > > +			 * the average number of faults per CPU. This strategy is also
> > > +			 * followed here.
> > > +			 */
> > > +			if (a_occ > m_a_occ) {
> > > +				m_a_occ = a_occ;
> > > +				m_a_cpu = m_cpu;
> > > +			}
> > > +
> > > +			if (llc_id(cpu) == llc_id(mm->mm_sched_cpu))
> > > +				curr_m_a_occ = a_occ;
> > > +
> > > +			cpumask_andnot(cpus, cpus, sched_domain_span(sd));
> > > +		}
> > 
> > This means NUMA preference has no effect on the selection, except in the
> > unlikely case of exactly equal occupancy across LLCs on different nodes
> > (where iteration order determines the winner).
> > 
> > How does it handle when cache locality and memory locality conflict?
> > Shouldn't numa preferred node get preference? Also scanning multiple
> > nodes add overhead, so can restricting it to numa preferred node be
> > better and scan others only when there is no numa preferred node?
> > 
> 
> Basically, yes, you're right. Ideally, we should prioritize the NUMA
> preferred node as the top priority. There's one case I find hard to
> handle: the NUMA preferred node is per task rather than per process.
> It's possible that different threads of the same process have different
> preferred nodes; as a result, the process-wide preferred LLC could bounce
> between different nodes, which might cause costly task migrations across
> nodes. As a workaround, we tried to keep the scan CPU mask covering the
> process's current preferred LLC to ensure the old preferred LLC is included
> in the candidates. After all, we have a 2X threshold for switching the
> preferred LLC.

If tasks in a process had different preferred nodes, they would
belong to different numa_groups, and majority of their data would
be from different NUMA nodes.

To resolve such conflict, we'll need to change the aggregation of tasks by
process, to aggregation of tasks by numa_group when NUMA balancing is
enabled.  This probably makes more sense as tasks in a numa_group
have more shared data and would benefit from co-locating in the
same cache.

Thanks.

Tim