linux-kernel - Re: [RFC PATCH v4 26/28] sched: Do not enable cache aware scheduling for process with large RSS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3cb6ebc7-a2fd-42b3-8739-b00e28a09cb6@os.amperecomputing.com>
Date: Fri, 26 Sep 2025 16:48:14 +0800
From: Adam Li <adamli@...amperecomputing.com>
To: Chen Yu <yu.c.chen@...el.com>, Peter Zijlstra <peterz@...radead.org>,
 Ingo Molnar <mingo@...hat.com>, K Prateek Nayak <kprateek.nayak@....com>,
 "Gautham R . Shenoy" <gautham.shenoy@....com>
Cc: Vincent Guittot <vincent.guittot@...aro.org>,
 Juri Lelli <juri.lelli@...hat.com>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
 Libo Chen <libo.chen@...cle.com>,
 Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
 Hillf Danton <hdanton@...a.com>, Shrikanth Hegde <sshegde@...ux.ibm.com>,
 Jianyong Wu <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>,
 Tingyin Duan <tingyin.duan@...il.com>, Vern Hao <vernhao@...cent.com>,
 Len Brown <len.brown@...el.com>, Tim Chen <tim.c.chen@...ux.intel.com>,
 Aubrey Li <aubrey.li@...el.com>, Zhao Liu <zhao1.liu@...el.com>,
 Chen Yu <yu.chen.surf@...il.com>, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v4 26/28] sched: Do not enable cache aware scheduling
 for process with large RSS

Hi Chen Yu,

Thanks for your work.
I tested the patch set on AmpereOne CPU with 192 cores.

With CONFIG_SCHED_CLUSTER enabled, and with certain firmware setting,
every eight cores will be grouped into a 'cluster' schedule domain
with 'SD_SHARE_LLC' flag.
However, these eight cores do *no* share L3 cache in this setup.

In exceed_llc_capacity() of this patch, we have 'llc = l3_leaf->size',
'llc' will be zero if there is *no* L3 cache.
So exceed_llc_capacity() will be true and 'Cache Aware Scheduling' will
not work. Please see details bellow.

I read in patch 01/28 "sched: Cache aware load-balancing" [1],
Peter mentioned:
"It is an attempt at modelling cache affinity -- and while the patch
really only targets LLC, it could very well be extended to also apply to
clusters (L2). Specifically any case of multiple cache domains inside a
node".

Do you have any idea how we can apply the cache aware load-balancing
to clusters? The cores in the cluster may share L2 or LLC tags.

[1]: https://lore.kernel.org/all/9157186cf9e3fd541f62c637579ff736b3704c51.1754712565.git.tim.c.chen@linux.intel.com/

On 8/9/2025 1:08 PM, Chen Yu wrote:
> It has been reported that when running memory-intensive workloads
> such as stream, sched_cache may saturate the memory bandwidth on
> the preferred LLC.
> 
> To prevent this from happening, evaluate the process's memory
> footprint by checking the size of RSS (anonymous pages and shared
> pages) and comparing it to the size of the LLC. If the former is
> larger, skip cache-aware scheduling. This is because if tasks
> do not actually share data, aggregating tasks with large RSS will
> likely result in cache contention and performance depredation.
> 
> However, in theory, RSS is not the same as memory footprint.
> This is just an estimated approach to prevent over-aggregation.
> The default behavior is to strictly compare the size of RSS with
> the size of the LLC. The next patch will introduce a user-provided
> hint to customize this comparison.
> 
> Reported-by: K Prateek Nayak <kprateek.nayak@....com>
> Co-developed-by: Tim Chen <tim.c.chen@...ux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@...el.com>
> Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
> ---
>  kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 44 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4bf794f170cf..cbda7dad1305 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1205,6 +1205,34 @@ static inline int pref_llc_idx(struct task_struct *p)
>  	return llc_idx(p->preferred_llc);
>  }
>  
> +static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
> +{
> +	struct cpu_cacheinfo *this_cpu_ci;
> +	struct cacheinfo *l3_leaf;
> +	unsigned long rss;
> +	unsigned int llc;
> +
> +	/*
> +	 * get_cpu_cacheinfo_level() can not be used
> +	 * because it requires the cpu_hotplug_lock
> +	 * to be held. Use get_cpu_cacheinfo()
> +	 * directly because the 'cpu' can not be
> +	 * offlined at the moment.
> +	 */
> +	this_cpu_ci = get_cpu_cacheinfo(cpu);
> +	if (!this_cpu_ci->info_list ||
> +	    this_cpu_ci->num_leaves < 3)
> +		return true;
> +
> +	l3_leaf = this_cpu_ci->info_list + 3;
> +	llc = l3_leaf->size;
> +
For some arm64 CPU topology, cores can be grouped into 'cluster'.
Cores in a cluster may not share L3 cache. 'l3_leaf->size'
will be 0.

It looks we assume LLC is L3 cache?

Can we skip exceed_llc_capacity() check if no L3?
Like this draft patch:

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1227,6 +1227,8 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)

        l3_leaf = this_cpu_ci->info_list + 3;
        llc = l3_leaf->size;
+       if (!llc)
+               return false;

        rss = get_mm_counter(mm, MM_ANONPAGES) +
                get_mm_counter(mm, MM_SHMEMPAGES);


> +	rss = get_mm_counter(mm, MM_ANONPAGES) +
> +		get_mm_counter(mm, MM_SHMEMPAGES);
> +
> +	return (llc <= (rss * PAGE_SIZE));

If 'llc' is 0, exceed_llc_capacity() will always return true.

> +}
> +
>  static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
>  {
>  	int smt_nr = 1;
> @@ -1363,7 +1391,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>  	 */
>  	if (epoch - READ_ONCE(mm->mm_sched_epoch) > sysctl_llc_old ||
>  	    get_nr_threads(p) <= 1 ||
> -	    exceed_llc_nr(mm, cpu_of(rq))) {
> +	    exceed_llc_nr(mm, cpu_of(rq)) ||
> +	    exceed_llc_capacity(mm, cpu_of(rq))) {
>  		mm->mm_sched_cpu = -1;
>  		pcpu_sched->occ = 0;
>  	}
> @@ -1448,6 +1477,14 @@ static void __no_profile task_cache_work(struct callback_head *work)
>  		return;
>  	}
>  
> +	/*
> +	 * Do not check exceed_llc_nr() because
> +	 * the active number of threads needs to
> +	 * been updated anyway.
> +	 */
> +	if (exceed_llc_capacity(mm, curr_cpu))
> +		return;
> +
>  	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
>  		return;
>  
> @@ -9113,8 +9150,12 @@ static __maybe_unused enum llc_mig_hint get_migrate_hint(int src_cpu, int dst_cp
>  	if (cpu < 0)
>  		return mig_allow;
>  
> -	 /* skip cache aware load balance for single/too many threads */
> -	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu))
> +	/*
> +	 * skip cache aware load balance for single/too many threads
> +	 * and large footprint.
> +	 */
> +	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu) ||
> +	    exceed_llc_capacity(mm, dst_cpu))
>  		return mig_allow;
>  
>  	if (cpus_share_cache(dst_cpu, cpu))

Thanks,-adam