linux-kernel - Re: [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d188e11a610cd652ad18a83cf325db54f4938537.camel@linux.intel.com>
Date: Thu, 15 Jan 2026 13:47:06 -0800
From: Tim Chen <tim.c.chen@...ux.intel.com>
To: Vern Hao <haoxing990@...il.com>, "Chen, Yu C" <yu.c.chen@...el.com>
Cc: Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann	
 <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben
 Segall	 <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin
 Schneider	 <vschneid@...hat.com>, Madadi Vineeth Reddy
 <vineethr@...ux.ibm.com>, Hillf Danton <hdanton@...a.com>, Shrikanth Hegde
 <sshegde@...ux.ibm.com>, Jianyong Wu	 <jianyong.wu@...look.com>, Yangyu
 Chen <cyy@...self.name>, Tingyin Duan	 <tingyin.duan@...il.com>, Vern Hao
 <vernhao@...cent.com>, Len Brown	 <len.brown@...el.com>, Aubrey Li
 <aubrey.li@...el.com>, Zhao Liu	 <zhao1.liu@...el.com>, Chen Yu
 <yu.chen.surf@...il.com>, Adam Li	 <adamli@...amperecomputing.com>, Aaron
 Lu <ziqianlu@...edance.com>, Tim Chen	 <tim.c.chen@...el.com>,
 linux-kernel@...r.kernel.org, Peter Zijlstra	 <peterz@...radead.org>, Ingo
 Molnar <mingo@...hat.com>, K Prateek Nayak	 <kprateek.nayak@....com>,
 Vincent Guittot <vincent.guittot@...aro.org>,  "Gautham R . Shenoy"
 <gautham.shenoy@....com>
Subject: Re: [PATCH v2 01/23] sched/cache: Introduce infrastructure for
 cache-aware load balancing

On Wed, 2025-12-17 at 09:17 +0800, Vern Hao wrote:
> On 2025/12/16 14:12, Chen, Yu C wrote:
> > On 12/11/2025 5:03 PM, Vern Hao wrote:
> > > Hi, Peter, Chen Yu and Tim:
> > > 
> > > On 2025/12/4 07:07, Tim Chen wrote:
> > > > From: "Peter Zijlstra (Intel)" <peterz@...radead.org>
> > > > 
> > > > Adds infrastructure to enable cache-aware load balancing,
> > > > which improves cache locality by grouping tasks that share resources
> > > > within the same cache domain. This reduces cache misses and improves
> > > > overall data access efficiency.
> > > > 
> > > > In this initial implementation, threads belonging to the same process
> > > > are treated as entities that likely share working sets. The mechanism
> > > > tracks per-process CPU occupancy across cache domains and attempts to
> > > > migrate threads toward cache-hot domains where their process already
> > > > has active threads, thereby enhancing locality.
> > > > 
> > > > This provides a basic model for cache affinity. While the current code
> > > > targets the last-level cache (LLC), the approach could be extended to
> > > > other domain types such as clusters (L2) or node-internal groupings.
> > > > 
> > > > At present, the mechanism selects the CPU within an LLC that has the
> > > > highest recent runtime. Subsequent patches in this series will use this
> > > > information in the load-balancing path to guide task placement toward
> > > > preferred LLCs.
> > > > 
> > > > In the future, more advanced policies could be integrated through NUMA
> > > > balancing-for example, migrating a task to its preferred LLC when spare
> > > > capacity exists, or swapping tasks across LLCs to improve cache 
> > > > affinity.
> > > > Grouping of tasks could also be generalized from that of a process
> > > > to be that of a NUMA group, or be user configurable.
> > > > 
> > > > Originally-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> > > > Signed-off-by: Chen Yu <yu.c.chen@...el.com>
> > > > Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
> > > > ---
> > > > 
> > > > Notes:
> > > >      v1->v2:
> > > >         Restore the original CPU scan to cover all online CPUs,
> > > >         rather than scanning within the preferred NUMA node.
> > > >         (Peter Zijlstra)
> > > >         Use rq->curr instead of rq->donor. (K Prateek Nayak)
> > > >         Minor fix in task_tick_cache() to use
> > > >         if (mm->mm_sched_epoch >= rq->cpu_epoch)
> > > >         to avoid mm_sched_epoch going backwards.
> > > > 
> > > >   include/linux/mm_types.h |  44 +++++++
> > > >   include/linux/sched.h    |  11 ++
> > > >   init/Kconfig             |  11 ++
> > > >   kernel/fork.c            |   6 +
> > > >   kernel/sched/core.c      |   6 +
> > > >   kernel/sched/fair.c      | 258 
> > > > +++++++++++++++++++++++++++++++++++++++
> > > >   kernel/sched/sched.h     |   8 ++
> > > >   7 files changed, 344 insertions(+)
> > > > 
> > > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > > index 90e5790c318f..1ea16ef90566 100644
> > > > --- a/include/linux/mm_types.h
> > > > +++ b/include/linux/mm_types.h
> > > > @@ -939,6 +939,11 @@ typedef struct {
> > > >       DECLARE_BITMAP(__mm_flags, NUM_MM_FLAG_BITS);
> > > >   } __private mm_flags_t;
> > > > +struct mm_sched {
> > > > +    u64 runtime;
> > > > +    unsigned long epoch;
> > > > +};
> > > > +
> > > >   struct kioctx_table;
> > > >   struct iommu_mm_data;
> > > >   struct mm_struct {
> > > > @@ -1029,6 +1034,17 @@ struct mm_struct {
> > > >            */
> > > >           raw_spinlock_t cpus_allowed_lock;
> > > >   #endif
> > > > +#ifdef CONFIG_SCHED_CACHE
> > > > +        /*
> > > > +         * Track per-cpu-per-process occupancy as a proxy for cache 
> > > > residency.
> > > > +         * See account_mm_sched() and ...
> > > > +         */
> > > > +        struct mm_sched __percpu *pcpu_sched;
> > > > +        raw_spinlock_t mm_sched_lock;
> > > > +        unsigned long mm_sched_epoch;
> > > > +        int mm_sched_cpu;
> > > As we discussed earlier，I continue to believe that dedicating 
> > > 'mm_sched_cpu' to handle the aggregated hotspots of all threads is 
> > > inappropriate, as the multiple threads lack a necessary correlation 
> > > in our real application.
> > > 
> > > So, I was wondering if we could put this variable into struct 
> > > task_struct, That allows us to better monitor the hotspot CPU of each 
> > > thread, despite some details needing consideration.
> > > 
> > 
> > I suppose you are suggesting a fine-grained control for a set of tasks.
> > Process-scope aggregation could be a start as the default strategy(
> > conservative, benefit multi-thread workloads that share data per process,
> > not introduce regression).
> 
> Yes, in our real-world business scenarios at Tencent, I have indeed 
> encountered this issue where multiple threads are divided into several 
> categories to handle different transactions, so they are not share the 
> hot data, the 'mm_sched_cpu'  does not represent all of their task, so 
> add a control interface such as cgroup or others will be a good idea.
> 

Yes, the grouping and aggregating of tasks by process will not cover
your usage scenario. Chen Yu and I had quite a bit of discussions among
us and here're our thoughts.

In the initial version of cache aware scheduling, process based aggregation
is a sensible default. Once this basic option is merged in mainline we will consider adding
other options for task grouping.  For example, setting a flag in a cgroup
cpu controller to indicate that tasks in a cgroup could benefit from being
consolidated in a LLC.

We think that you can put your threads in each category in each of
its own cgroup.  Will that meet your need?

Things like mm_sched_cpu ... etc will be abstracted out, where the grouping structure in mm
is abstracted as cache_group.  So we will have something like
cache_group_sched_cpu instead of mm_sched_cpu.

Tim

> > 
> > On top of that, I wonder if we could provide task-scope control like
> > sched_setattr(), similar to core-scheduling cookie mechanism, for
> > users that want aggressive aggregation. But before doing that, we need a
> > mechanism that that leverages a monitor system(like PMU) to figure out
> There will maybe a trouble, If the environment is running on a VM, We 
> could use tags to differentiate these tasks and do some tests to verify 
> the performance difference between unifying the |mm_sched_cpu| and not 
> unifying.
> > if putting these tasks together would bring benefit(if I understand
> > Steven's suggestion correctly on LPC), or detection tasks that share
> > resource, then maybe leverage QOS interfaces to enable the cache-aware
> > aggregation(something Qias mentioned on the LPC).
> > 
> > thanks,
> > Chenyu
> >