[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7e4640a2-f79f-4f14-b099-d97bfd842b37@intel.com>
Date: Tue, 16 Dec 2025 14:12:11 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Vern Hao <haoxing990@...il.com>
CC: Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann
<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben Segall
<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
<vschneid@...hat.com>, Madadi Vineeth Reddy <vineethr@...ux.ibm.com>, "Hillf
Danton" <hdanton@...a.com>, Shrikanth Hegde <sshegde@...ux.ibm.com>,
"Jianyong Wu" <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>,
Tingyin Duan <tingyin.duan@...il.com>, Vern Hao <vernhao@...cent.com>, Len
Brown <len.brown@...el.com>, Aubrey Li <aubrey.li@...el.com>, Zhao Liu
<zhao1.liu@...el.com>, Chen Yu <yu.chen.surf@...il.com>, Adam Li
<adamli@...amperecomputing.com>, Aaron Lu <ziqianlu@...edance.com>, Tim Chen
<tim.c.chen@...el.com>, <linux-kernel@...r.kernel.org>, Peter Zijlstra
<peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, K Prateek Nayak
<kprateek.nayak@....com>, Vincent Guittot <vincent.guittot@...aro.org>,
"Gautham R . Shenoy" <gautham.shenoy@....com>, Tim Chen
<tim.c.chen@...ux.intel.com>
Subject: Re: [PATCH v2 01/23] sched/cache: Introduce infrastructure for
cache-aware load balancing
On 12/11/2025 5:03 PM, Vern Hao wrote:
> Hi, Peter, Chen Yu and Tim:
>
> On 2025/12/4 07:07, Tim Chen wrote:
>> From: "Peter Zijlstra (Intel)" <peterz@...radead.org>
>>
>> Adds infrastructure to enable cache-aware load balancing,
>> which improves cache locality by grouping tasks that share resources
>> within the same cache domain. This reduces cache misses and improves
>> overall data access efficiency.
>>
>> In this initial implementation, threads belonging to the same process
>> are treated as entities that likely share working sets. The mechanism
>> tracks per-process CPU occupancy across cache domains and attempts to
>> migrate threads toward cache-hot domains where their process already
>> has active threads, thereby enhancing locality.
>>
>> This provides a basic model for cache affinity. While the current code
>> targets the last-level cache (LLC), the approach could be extended to
>> other domain types such as clusters (L2) or node-internal groupings.
>>
>> At present, the mechanism selects the CPU within an LLC that has the
>> highest recent runtime. Subsequent patches in this series will use this
>> information in the load-balancing path to guide task placement toward
>> preferred LLCs.
>>
>> In the future, more advanced policies could be integrated through NUMA
>> balancing-for example, migrating a task to its preferred LLC when spare
>> capacity exists, or swapping tasks across LLCs to improve cache affinity.
>> Grouping of tasks could also be generalized from that of a process
>> to be that of a NUMA group, or be user configurable.
>>
>> Originally-by: Peter Zijlstra (Intel) <peterz@...radead.org>
>> Signed-off-by: Chen Yu <yu.c.chen@...el.com>
>> Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
>> ---
>>
>> Notes:
>> v1->v2:
>> Restore the original CPU scan to cover all online CPUs,
>> rather than scanning within the preferred NUMA node.
>> (Peter Zijlstra)
>> Use rq->curr instead of rq->donor. (K Prateek Nayak)
>> Minor fix in task_tick_cache() to use
>> if (mm->mm_sched_epoch >= rq->cpu_epoch)
>> to avoid mm_sched_epoch going backwards.
>>
>> include/linux/mm_types.h | 44 +++++++
>> include/linux/sched.h | 11 ++
>> init/Kconfig | 11 ++
>> kernel/fork.c | 6 +
>> kernel/sched/core.c | 6 +
>> kernel/sched/fair.c | 258 +++++++++++++++++++++++++++++++++++++++
>> kernel/sched/sched.h | 8 ++
>> 7 files changed, 344 insertions(+)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 90e5790c318f..1ea16ef90566 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -939,6 +939,11 @@ typedef struct {
>> DECLARE_BITMAP(__mm_flags, NUM_MM_FLAG_BITS);
>> } __private mm_flags_t;
>> +struct mm_sched {
>> + u64 runtime;
>> + unsigned long epoch;
>> +};
>> +
>> struct kioctx_table;
>> struct iommu_mm_data;
>> struct mm_struct {
>> @@ -1029,6 +1034,17 @@ struct mm_struct {
>> */
>> raw_spinlock_t cpus_allowed_lock;
>> #endif
>> +#ifdef CONFIG_SCHED_CACHE
>> + /*
>> + * Track per-cpu-per-process occupancy as a proxy for cache
>> residency.
>> + * See account_mm_sched() and ...
>> + */
>> + struct mm_sched __percpu *pcpu_sched;
>> + raw_spinlock_t mm_sched_lock;
>> + unsigned long mm_sched_epoch;
>> + int mm_sched_cpu;
> As we discussed earlier,I continue to believe that dedicating
> 'mm_sched_cpu' to handle the aggregated hotspots of all threads is
> inappropriate, as the multiple threads lack a necessary correlation in
> our real application.
>
> So, I was wondering if we could put this variable into struct
> task_struct, That allows us to better monitor the hotspot CPU of each
> thread, despite some details needing consideration.
>
I suppose you are suggesting a fine-grained control for a set of tasks.
Process-scope aggregation could be a start as the default strategy(
conservative, benefit multi-thread workloads that share data per process,
not introduce regression).
On top of that, I wonder if we could provide task-scope control like
sched_setattr(), similar to core-scheduling cookie mechanism, for
users that want aggressive aggregation. But before doing that, we need a
mechanism that that leverages a monitor system(like PMU) to figure out
if putting these tasks together would bring benefit(if I understand
Steven's suggestion correctly on LPC), or detection tasks that share
resource, then maybe leverage QOS interfaces to enable the cache-aware
aggregation(something Qias mentioned on the LPC).
thanks,
Chenyu
Powered by blists - more mailing lists