[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7d5bb7c4-abc5-470e-84fe-72a3b1d3a2f4@gmail.com>
Date: Wed, 17 Dec 2025 09:17:20 +0800
From: Vern Hao <haoxing990@...il.com>
To: "Chen, Yu C" <yu.c.chen@...el.com>
Cc: Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
Hillf Danton <hdanton@...a.com>, Shrikanth Hegde <sshegde@...ux.ibm.com>,
Jianyong Wu <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>,
Tingyin Duan <tingyin.duan@...il.com>, Vern Hao <vernhao@...cent.com>,
Len Brown <len.brown@...el.com>, Aubrey Li <aubrey.li@...el.com>,
Zhao Liu <zhao1.liu@...el.com>, Chen Yu <yu.chen.surf@...il.com>,
Adam Li <adamli@...amperecomputing.com>, Aaron Lu <ziqianlu@...edance.com>,
Tim Chen <tim.c.chen@...el.com>, linux-kernel@...r.kernel.org,
Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
K Prateek Nayak <kprateek.nayak@....com>,
Vincent Guittot <vincent.guittot@...aro.org>,
"Gautham R . Shenoy" <gautham.shenoy@....com>,
Tim Chen <tim.c.chen@...ux.intel.com>
Subject: Re: [PATCH v2 01/23] sched/cache: Introduce infrastructure for
cache-aware load balancing
On 2025/12/16 14:12, Chen, Yu C wrote:
> On 12/11/2025 5:03 PM, Vern Hao wrote:
>> Hi, Peter, Chen Yu and Tim:
>>
>> On 2025/12/4 07:07, Tim Chen wrote:
>>> From: "Peter Zijlstra (Intel)" <peterz@...radead.org>
>>>
>>> Adds infrastructure to enable cache-aware load balancing,
>>> which improves cache locality by grouping tasks that share resources
>>> within the same cache domain. This reduces cache misses and improves
>>> overall data access efficiency.
>>>
>>> In this initial implementation, threads belonging to the same process
>>> are treated as entities that likely share working sets. The mechanism
>>> tracks per-process CPU occupancy across cache domains and attempts to
>>> migrate threads toward cache-hot domains where their process already
>>> has active threads, thereby enhancing locality.
>>>
>>> This provides a basic model for cache affinity. While the current code
>>> targets the last-level cache (LLC), the approach could be extended to
>>> other domain types such as clusters (L2) or node-internal groupings.
>>>
>>> At present, the mechanism selects the CPU within an LLC that has the
>>> highest recent runtime. Subsequent patches in this series will use this
>>> information in the load-balancing path to guide task placement toward
>>> preferred LLCs.
>>>
>>> In the future, more advanced policies could be integrated through NUMA
>>> balancing-for example, migrating a task to its preferred LLC when spare
>>> capacity exists, or swapping tasks across LLCs to improve cache
>>> affinity.
>>> Grouping of tasks could also be generalized from that of a process
>>> to be that of a NUMA group, or be user configurable.
>>>
>>> Originally-by: Peter Zijlstra (Intel) <peterz@...radead.org>
>>> Signed-off-by: Chen Yu <yu.c.chen@...el.com>
>>> Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
>>> ---
>>>
>>> Notes:
>>> v1->v2:
>>> Restore the original CPU scan to cover all online CPUs,
>>> rather than scanning within the preferred NUMA node.
>>> (Peter Zijlstra)
>>> Use rq->curr instead of rq->donor. (K Prateek Nayak)
>>> Minor fix in task_tick_cache() to use
>>> if (mm->mm_sched_epoch >= rq->cpu_epoch)
>>> to avoid mm_sched_epoch going backwards.
>>>
>>> include/linux/mm_types.h | 44 +++++++
>>> include/linux/sched.h | 11 ++
>>> init/Kconfig | 11 ++
>>> kernel/fork.c | 6 +
>>> kernel/sched/core.c | 6 +
>>> kernel/sched/fair.c | 258
>>> +++++++++++++++++++++++++++++++++++++++
>>> kernel/sched/sched.h | 8 ++
>>> 7 files changed, 344 insertions(+)
>>>
>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>> index 90e5790c318f..1ea16ef90566 100644
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -939,6 +939,11 @@ typedef struct {
>>> DECLARE_BITMAP(__mm_flags, NUM_MM_FLAG_BITS);
>>> } __private mm_flags_t;
>>> +struct mm_sched {
>>> + u64 runtime;
>>> + unsigned long epoch;
>>> +};
>>> +
>>> struct kioctx_table;
>>> struct iommu_mm_data;
>>> struct mm_struct {
>>> @@ -1029,6 +1034,17 @@ struct mm_struct {
>>> */
>>> raw_spinlock_t cpus_allowed_lock;
>>> #endif
>>> +#ifdef CONFIG_SCHED_CACHE
>>> + /*
>>> + * Track per-cpu-per-process occupancy as a proxy for cache
>>> residency.
>>> + * See account_mm_sched() and ...
>>> + */
>>> + struct mm_sched __percpu *pcpu_sched;
>>> + raw_spinlock_t mm_sched_lock;
>>> + unsigned long mm_sched_epoch;
>>> + int mm_sched_cpu;
>> As we discussed earlier,I continue to believe that dedicating
>> 'mm_sched_cpu' to handle the aggregated hotspots of all threads is
>> inappropriate, as the multiple threads lack a necessary correlation
>> in our real application.
>>
>> So, I was wondering if we could put this variable into struct
>> task_struct, That allows us to better monitor the hotspot CPU of each
>> thread, despite some details needing consideration.
>>
>
> I suppose you are suggesting a fine-grained control for a set of tasks.
> Process-scope aggregation could be a start as the default strategy(
> conservative, benefit multi-thread workloads that share data per process,
> not introduce regression).
Yes, in our real-world business scenarios at Tencent, I have indeed
encountered this issue where multiple threads are divided into several
categories to handle different transactions, so they are not share the
hot data, the 'mm_sched_cpu' does not represent all of their task, so
add a control interface such as cgroup or others will be a good idea.
>
> On top of that, I wonder if we could provide task-scope control like
> sched_setattr(), similar to core-scheduling cookie mechanism, for
> users that want aggressive aggregation. But before doing that, we need a
> mechanism that that leverages a monitor system(like PMU) to figure out
There will maybe a trouble, If the environment is running on a VM, We
could use tags to differentiate these tasks and do some tests to verify
the performance difference between unifying the |mm_sched_cpu| and not
unifying.
> if putting these tasks together would bring benefit(if I understand
> Steven's suggestion correctly on LPC), or detection tasks that share
> resource, then maybe leverage QOS interfaces to enable the cache-aware
> aggregation(something Qias mentioned on the LPC).
>
> thanks,
> Chenyu
>
Powered by blists - more mailing lists