linux-kernel - [RFC PATCH v4 26/28] sched: Do not enable cache aware scheduling for process with large RSS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <881a665a94858d4fb6f13491f4dffe58c8fc3870.1754712565.git.tim.c.chen@linux.intel.com>
Date: Sat,  9 Aug 2025 13:08:49 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	"Gautham R . Shenoy" <gautham.shenoy@....com>
Cc: Vincent Guittot <vincent.guittot@...aro.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>,
	Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>,
	Libo Chen <libo.chen@...cle.com>,
	Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
	Hillf Danton <hdanton@...a.com>,
	Shrikanth Hegde <sshegde@...ux.ibm.com>,
	Jianyong Wu <jianyong.wu@...look.com>,
	Yangyu Chen <cyy@...self.name>,
	Tingyin Duan <tingyin.duan@...il.com>,
	Vern Hao <vernhao@...cent.com>,
	Len Brown <len.brown@...el.com>,
	Tim Chen <tim.c.chen@...ux.intel.com>,
	Aubrey Li <aubrey.li@...el.com>,
	Zhao Liu <zhao1.liu@...el.com>,
	Chen Yu <yu.chen.surf@...il.com>,
	Chen Yu <yu.c.chen@...el.com>,
	linux-kernel@...r.kernel.org
Subject: [RFC PATCH v4 26/28] sched: Do not enable cache aware scheduling for process with large RSS

It has been reported that when running memory-intensive workloads
such as stream, sched_cache may saturate the memory bandwidth on
the preferred LLC.

To prevent this from happening, evaluate the process's memory
footprint by checking the size of RSS (anonymous pages and shared
pages) and comparing it to the size of the LLC. If the former is
larger, skip cache-aware scheduling. This is because if tasks
do not actually share data, aggregating tasks with large RSS will
likely result in cache contention and performance depredation.

However, in theory, RSS is not the same as memory footprint.
This is just an estimated approach to prevent over-aggregation.
The default behavior is to strictly compare the size of RSS with
the size of the LLC. The next patch will introduce a user-provided
hint to customize this comparison.

Reported-by: K Prateek Nayak <kprateek.nayak@....com>
Co-developed-by: Tim Chen <tim.c.chen@...ux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@...el.com>
Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
---
 kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 44 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4bf794f170cf..cbda7dad1305 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1205,6 +1205,34 @@ static inline int pref_llc_idx(struct task_struct *p)
 	return llc_idx(p->preferred_llc);
 }
 
+static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
+{
+	struct cpu_cacheinfo *this_cpu_ci;
+	struct cacheinfo *l3_leaf;
+	unsigned long rss;
+	unsigned int llc;
+
+	/*
+	 * get_cpu_cacheinfo_level() can not be used
+	 * because it requires the cpu_hotplug_lock
+	 * to be held. Use get_cpu_cacheinfo()
+	 * directly because the 'cpu' can not be
+	 * offlined at the moment.
+	 */
+	this_cpu_ci = get_cpu_cacheinfo(cpu);
+	if (!this_cpu_ci->info_list ||
+	    this_cpu_ci->num_leaves < 3)
+		return true;
+
+	l3_leaf = this_cpu_ci->info_list + 3;
+	llc = l3_leaf->size;
+
+	rss = get_mm_counter(mm, MM_ANONPAGES) +
+		get_mm_counter(mm, MM_SHMEMPAGES);
+
+	return (llc <= (rss * PAGE_SIZE));
+}
+
 static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
 {
 	int smt_nr = 1;
@@ -1363,7 +1391,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	 */
 	if (epoch - READ_ONCE(mm->mm_sched_epoch) > sysctl_llc_old ||
 	    get_nr_threads(p) <= 1 ||
-	    exceed_llc_nr(mm, cpu_of(rq))) {
+	    exceed_llc_nr(mm, cpu_of(rq)) ||
+	    exceed_llc_capacity(mm, cpu_of(rq))) {
 		mm->mm_sched_cpu = -1;
 		pcpu_sched->occ = 0;
 	}
@@ -1448,6 +1477,14 @@ static void __no_profile task_cache_work(struct callback_head *work)
 		return;
 	}
 
+	/*
+	 * Do not check exceed_llc_nr() because
+	 * the active number of threads needs to
+	 * been updated anyway.
+	 */
+	if (exceed_llc_capacity(mm, curr_cpu))
+		return;
+
 	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
 		return;
 
@@ -9113,8 +9150,12 @@ static __maybe_unused enum llc_mig_hint get_migrate_hint(int src_cpu, int dst_cp
 	if (cpu < 0)
 		return mig_allow;
 
-	 /* skip cache aware load balance for single/too many threads */
-	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu))
+	/*
+	 * skip cache aware load balance for single/too many threads
+	 * and large footprint.
+	 */
+	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu) ||
+	    exceed_llc_capacity(mm, dst_cpu))
 		return mig_allow;
 
 	if (cpus_share_cache(dst_cpu, cpu))
-- 
2.25.1