linux-kernel - Re: 回复：[Internet]Re: 回复：[PATCH 06/19] sched/fair: Assign preferred LLC ID to processes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <8e1c24c3e0e4cd13935beb8c1cef4b24e642f22b.camel@linux.intel.com>
Date: Tue, 14 Oct 2025 13:13:07 -0700
From: Tim Chen <tim.c.chen@...ux.intel.com>
To: vernhao(郝信) <vernhao@...cent.com>, 
 Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, K
 Prateek Nayak <kprateek.nayak@....com>,  "Gautham R . Shenoy"
 <gautham.shenoy@....com>, haoxing990@...il.com
Cc: Vincent Guittot <vincent.guittot@...aro.org>, Juri Lelli	
 <juri.lelli@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel
 Gorman <mgorman@...e.de>,  Valentin Schneider	 <vschneid@...hat.com>,
 Madadi Vineeth Reddy <vineethr@...ux.ibm.com>, Hillf Danton
 <hdanton@...a.com>, Shrikanth Hegde <sshegde@...ux.ibm.com>, Jianyong Wu	
 <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>, Tingyin Duan	
 <tingyin.duan@...il.com>, Len Brown <len.brown@...el.com>, Aubrey Li	
 <aubrey.li@...el.com>, Zhao Liu <zhao1.liu@...el.com>, Chen Yu	
 <yu.chen.surf@...il.com>, Chen Yu <yu.c.chen@...el.com>, Libo Chen	
 <libo.chen@...cle.com>, Adam Li <adamli@...amperecomputing.com>, Tim Chen	
 <tim.c.chen@...el.com>, linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: 回复：[Internet]Re: 回复：[PATCH 06/19] sched/fair: Assign
 preferred LLC ID to processes

On Tue, 2025-10-14 at 15:07 +0800, vernhao(郝信) wrote:
> Hi Tim, 
> 
> > 
> > 
[snip]

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 61c129bde8b6..d6167a029c47 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1312,6 +1312,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
> >   struct mm_struct *mm = p->mm;
> >   struct mm_sched *pcpu_sched;
> >   unsigned long epoch;
> > + int mm_sched_llc = -1;
> >  
> >   if (!sched_cache_enabled())
> >   return;
> > @@ -1342,6 +1343,12 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
> >   if (mm->mm_sched_cpu != -1)
> >   mm->mm_sched_cpu = -1;
> >   }
> > +
> > + if (mm->mm_sched_cpu != -1)
> > + mm_sched_llc = per_cpu(sd_llc_id, mm->mm_sched_cpu);
> > 
> > In high-concurrency multi-threaded scenarios, not all threads handle same events, so their hot data in the LLC is not completely shared. 
> > Therefore, if every thread's preferred LLC is migrated to the LLC pointed to by mm->mm_sched_cpu, this would lead to the incorrect 
> > assumption that all threads prefer the same LLC, thereby intensifying competition between LLCs.
> 
> Yes, that's the reason why we stop aggregating to the preferred LLC once the the utilization of the
> LLC becomes too high relative to the other LLCs.
> 
> But this approach is only a compensatory measure after the fact. The threads have already undergone incorrect migration to they are not perferred LLC. 
> Is there a better way to handle this situation?

The threads would stay where they were instead of migrating to preferred LLC
that's overloaded.

> 
> If you know your threads characteristics before hand on which of them
> share data together, you probably can use cgroup/cpuset
> from user space to separate out the threads.  
> 
> Yes, this is a solution, and I am trying to implement it.
> 
> There's not enough info from occupancy data for OS to group
> the threads by data sharing. Perhaps an alternative if NUMA balancing
> is on is to group tasks by their task numa group instead of by mm.  
> 
> This may not be a good solution either, especially for virtual machine scenarios which has no NUMA.

If you are in a VM, the cache topology may not correspond to
real CPU cache topology and you probably should not enable cache
aware scheduling inside, unless you are doing some explicit
binding of VCPUs.

> 
> That would incur the page scanning overhead etc and make
> cache aware scheduling be dependent on NUMA balancing.
>  
> 
> > 
> > So I'm wondering, why not move ‘mm->mm_sched_cpu’ to ‘task_struct’, so that each thread can individually track its preferred LLC? What are the losses in doing so?
> 
> You would need a way to group related tasks together and put them
> on the same LLC.  Either group them by mm or some other means.
> 
> Yes, you are right, how about this, beside in 'mm',  add cgroup support too ？ 

Doing cgroup may not solve the original issue you brought
up, where a process may have a group of tasks wanting to go
into one cache and another group of tasks going to another cache.
I could be wrong but I don't think you can split up tasks in a process
in cgroup v2 to different cgroups.

Also the cgroup folks are quite resistant to adding new knobs.

Tim

> 
> > 
> > +
> > + if (p->preferred_llc != mm_sched_llc)
> > + p->preferred_llc = mm_sched_llc;
> >  }
> >  
> >  static void task_tick_cache(struct rq *rq, struct task_struct *p)