[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5f140e59-23f9-46dd-bf5e-7bef0d897cd0@intel.com>
Date: Wed, 15 Oct 2025 12:54:15 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>, Tim Chen
<tim.c.chen@...ux.intel.com>
CC: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, "K
Prateek Nayak" <kprateek.nayak@....com>, "Gautham R . Shenoy"
<gautham.shenoy@....com>, Vincent Guittot <vincent.guittot@...aro.org>, "Juri
Lelli" <juri.lelli@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, "Mel
Gorman" <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, "Hillf
Danton" <hdanton@...a.com>, Shrikanth Hegde <sshegde@...ux.ibm.com>,
"Jianyong Wu" <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>,
Tingyin Duan <tingyin.duan@...il.com>, Vern Hao <vernhao@...cent.com>, Len
Brown <len.brown@...el.com>, Aubrey Li <aubrey.li@...el.com>, Zhao Liu
<zhao1.liu@...el.com>, Chen Yu <yu.chen.surf@...il.com>, Adam Li
<adamli@...amperecomputing.com>, Tim Chen <tim.c.chen@...el.com>,
<linux-kernel@...r.kernel.org>, <haoxing990@...il.com>
Subject: Re: [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load
balancing
On 10/15/2025 3:12 AM, Madadi Vineeth Reddy wrote:
> On 11/10/25 23:54, Tim Chen wrote:
>> From: "Peter Zijlstra (Intel)" <peterz@...radead.org>
>>
>> Cache-aware load balancing aims to aggregate tasks with potential
>> shared resources into the same cache domain. This approach enhances
>> cache locality, thereby optimizing system performance by reducing
>> cache misses and improving data access efficiency.
>>
[snip]
>> +static void __no_profile task_cache_work(struct callback_head *work)
>> +{
>> + struct task_struct *p = current;
>> + struct mm_struct *mm = p->mm;
>> + unsigned long m_a_occ = 0;
>> + unsigned long curr_m_a_occ = 0;
>> + int cpu, m_a_cpu = -1, cache_cpu,
>> + pref_nid = NUMA_NO_NODE, curr_cpu;
>> + cpumask_var_t cpus;
>> +
>> + WARN_ON_ONCE(work != &p->cache_work);
>> +
>> + work->next = work;
>> +
>> + if (p->flags & PF_EXITING)
>> + return;
>> +
>> + if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
>> + return;
>> +
>> + curr_cpu = task_cpu(p);
>> + cache_cpu = mm->mm_sched_cpu;
>> +#ifdef CONFIG_NUMA_BALANCING
>> + if (static_branch_likely(&sched_numa_balancing))
>> + pref_nid = p->numa_preferred_nid;
>> +#endif
>> +
>> + scoped_guard (cpus_read_lock) {
>> + get_scan_cpumasks(cpus, cache_cpu,
>> + pref_nid, curr_cpu);
>> +
>
> IIUC, `get_scan_cpumasks` ORs together the preferred NUMA node, cache CPU's node,
> and current CPU's node. This could result in scanning multiple nodes, not preferring
> the NUMA preferred node.
>
Yes, it is possible, please see comments below.
>> + for_each_cpu(cpu, cpus) {
>> + /* XXX sched_cluster_active */
>> + struct sched_domain *sd = per_cpu(sd_llc, cpu);
>> + unsigned long occ, m_occ = 0, a_occ = 0;
>> + int m_cpu = -1, i;
>> +
>> + if (!sd)
>> + continue;
>> +
>> + for_each_cpu(i, sched_domain_span(sd)) {
>> + occ = fraction_mm_sched(cpu_rq(i),
>> + per_cpu_ptr(mm->pcpu_sched, i));
>> + a_occ += occ;
>> + if (occ > m_occ) {
>> + m_occ = occ;
>> + m_cpu = i;
>> + }
>> + }
>> +
>> + /*
>> + * Compare the accumulated occupancy of each LLC. The
>> + * reason for using accumulated occupancy rather than average
>> + * per CPU occupancy is that it works better in asymmetric LLC
>> + * scenarios.
>> + * For example, if there are 2 threads in a 4CPU LLC and 3
>> + * threads in an 8CPU LLC, it might be better to choose the one
>> + * with 3 threads. However, this would not be the case if the
>> + * occupancy is divided by the number of CPUs in an LLC (i.e.,
>> + * if average per CPU occupancy is used).
>> + * Besides, NUMA balancing fault statistics behave similarly:
>> + * the total number of faults per node is compared rather than
>> + * the average number of faults per CPU. This strategy is also
>> + * followed here.
>> + */
>> + if (a_occ > m_a_occ) {
>> + m_a_occ = a_occ;
>> + m_a_cpu = m_cpu;
>> + }
>> +
>> + if (llc_id(cpu) == llc_id(mm->mm_sched_cpu))
>> + curr_m_a_occ = a_occ;
>> +
>> + cpumask_andnot(cpus, cpus, sched_domain_span(sd));
>> + }
>
> This means NUMA preference has no effect on the selection, except in the
> unlikely case of exactly equal occupancy across LLCs on different nodes
> (where iteration order determines the winner).
>
> How does it handle when cache locality and memory locality conflict?
> Shouldn't numa preferred node get preference? Also scanning multiple
> nodes add overhead, so can restricting it to numa preferred node be
> better and scan others only when there is no numa preferred node?
>
Basically, yes, you're right. Ideally, we should prioritize the NUMA
preferred node as the top priority. There's one case I find hard to
handle: the NUMA preferred node is per task rather than per process.
It's possible that different threads of the same process have different
preferred nodes; as a result, the process-wide preferred LLC could bounce
between different nodes, which might cause costly task migrations across
nodes. As a workaround, we tried to keep the scan CPU mask covering the
process's current preferred LLC to ensure the old preferred LLC is included
in the candidates. After all, we have a 2X threshold for switching the
preferred LLC.
thanks,
Chenyu
Powered by blists - more mailing lists