[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3cb6ebc7-a2fd-42b3-8739-b00e28a09cb6@os.amperecomputing.com>
Date: Fri, 26 Sep 2025 16:48:14 +0800
From: Adam Li <adamli@...amperecomputing.com>
To: Chen Yu <yu.c.chen@...el.com>, Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>, K Prateek Nayak <kprateek.nayak@....com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>
Cc: Vincent Guittot <vincent.guittot@...aro.org>,
Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
Libo Chen <libo.chen@...cle.com>,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
Hillf Danton <hdanton@...a.com>, Shrikanth Hegde <sshegde@...ux.ibm.com>,
Jianyong Wu <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>,
Tingyin Duan <tingyin.duan@...il.com>, Vern Hao <vernhao@...cent.com>,
Len Brown <len.brown@...el.com>, Tim Chen <tim.c.chen@...ux.intel.com>,
Aubrey Li <aubrey.li@...el.com>, Zhao Liu <zhao1.liu@...el.com>,
Chen Yu <yu.chen.surf@...il.com>, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v4 26/28] sched: Do not enable cache aware scheduling
for process with large RSS
Hi Chen Yu,
Thanks for your work.
I tested the patch set on AmpereOne CPU with 192 cores.
With CONFIG_SCHED_CLUSTER enabled, and with certain firmware setting,
every eight cores will be grouped into a 'cluster' schedule domain
with 'SD_SHARE_LLC' flag.
However, these eight cores do *no* share L3 cache in this setup.
In exceed_llc_capacity() of this patch, we have 'llc = l3_leaf->size',
'llc' will be zero if there is *no* L3 cache.
So exceed_llc_capacity() will be true and 'Cache Aware Scheduling' will
not work. Please see details bellow.
I read in patch 01/28 "sched: Cache aware load-balancing" [1],
Peter mentioned:
"It is an attempt at modelling cache affinity -- and while the patch
really only targets LLC, it could very well be extended to also apply to
clusters (L2). Specifically any case of multiple cache domains inside a
node".
Do you have any idea how we can apply the cache aware load-balancing
to clusters? The cores in the cluster may share L2 or LLC tags.
[1]: https://lore.kernel.org/all/9157186cf9e3fd541f62c637579ff736b3704c51.1754712565.git.tim.c.chen@linux.intel.com/
On 8/9/2025 1:08 PM, Chen Yu wrote:
> It has been reported that when running memory-intensive workloads
> such as stream, sched_cache may saturate the memory bandwidth on
> the preferred LLC.
>
> To prevent this from happening, evaluate the process's memory
> footprint by checking the size of RSS (anonymous pages and shared
> pages) and comparing it to the size of the LLC. If the former is
> larger, skip cache-aware scheduling. This is because if tasks
> do not actually share data, aggregating tasks with large RSS will
> likely result in cache contention and performance depredation.
>
> However, in theory, RSS is not the same as memory footprint.
> This is just an estimated approach to prevent over-aggregation.
> The default behavior is to strictly compare the size of RSS with
> the size of the LLC. The next patch will introduce a user-provided
> hint to customize this comparison.
>
> Reported-by: K Prateek Nayak <kprateek.nayak@....com>
> Co-developed-by: Tim Chen <tim.c.chen@...ux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@...el.com>
> Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
> ---
> kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 44 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4bf794f170cf..cbda7dad1305 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1205,6 +1205,34 @@ static inline int pref_llc_idx(struct task_struct *p)
> return llc_idx(p->preferred_llc);
> }
>
> +static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
> +{
> + struct cpu_cacheinfo *this_cpu_ci;
> + struct cacheinfo *l3_leaf;
> + unsigned long rss;
> + unsigned int llc;
> +
> + /*
> + * get_cpu_cacheinfo_level() can not be used
> + * because it requires the cpu_hotplug_lock
> + * to be held. Use get_cpu_cacheinfo()
> + * directly because the 'cpu' can not be
> + * offlined at the moment.
> + */
> + this_cpu_ci = get_cpu_cacheinfo(cpu);
> + if (!this_cpu_ci->info_list ||
> + this_cpu_ci->num_leaves < 3)
> + return true;
> +
> + l3_leaf = this_cpu_ci->info_list + 3;
> + llc = l3_leaf->size;
> +
For some arm64 CPU topology, cores can be grouped into 'cluster'.
Cores in a cluster may not share L3 cache. 'l3_leaf->size'
will be 0.
It looks we assume LLC is L3 cache?
Can we skip exceed_llc_capacity() check if no L3?
Like this draft patch:
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1227,6 +1227,8 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
l3_leaf = this_cpu_ci->info_list + 3;
llc = l3_leaf->size;
+ if (!llc)
+ return false;
rss = get_mm_counter(mm, MM_ANONPAGES) +
get_mm_counter(mm, MM_SHMEMPAGES);
> + rss = get_mm_counter(mm, MM_ANONPAGES) +
> + get_mm_counter(mm, MM_SHMEMPAGES);
> +
> + return (llc <= (rss * PAGE_SIZE));
If 'llc' is 0, exceed_llc_capacity() will always return true.
> +}
> +
> static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
> {
> int smt_nr = 1;
> @@ -1363,7 +1391,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
> */
> if (epoch - READ_ONCE(mm->mm_sched_epoch) > sysctl_llc_old ||
> get_nr_threads(p) <= 1 ||
> - exceed_llc_nr(mm, cpu_of(rq))) {
> + exceed_llc_nr(mm, cpu_of(rq)) ||
> + exceed_llc_capacity(mm, cpu_of(rq))) {
> mm->mm_sched_cpu = -1;
> pcpu_sched->occ = 0;
> }
> @@ -1448,6 +1477,14 @@ static void __no_profile task_cache_work(struct callback_head *work)
> return;
> }
>
> + /*
> + * Do not check exceed_llc_nr() because
> + * the active number of threads needs to
> + * been updated anyway.
> + */
> + if (exceed_llc_capacity(mm, curr_cpu))
> + return;
> +
> if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
> return;
>
> @@ -9113,8 +9150,12 @@ static __maybe_unused enum llc_mig_hint get_migrate_hint(int src_cpu, int dst_cp
> if (cpu < 0)
> return mig_allow;
>
> - /* skip cache aware load balance for single/too many threads */
> - if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu))
> + /*
> + * skip cache aware load balance for single/too many threads
> + * and large footprint.
> + */
> + if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu) ||
> + exceed_llc_capacity(mm, dst_cpu))
> return mig_allow;
>
> if (cpus_share_cache(dst_cpu, cpu))
Thanks,-adam
Powered by blists - more mailing lists