linux-kernel - [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <2c45f6db1efef84c6c1ed514a8d24a9bc4a2ca4b.1745199017.git.yu.c.chen@intel.com>
Date: Mon, 21 Apr 2025 11:25:18 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	"Gautham R . Shenoy" <gautham.shenoy@....com>
Cc: Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>,
	Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>,
	Tim Chen <tim.c.chen@...el.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Libo Chen <libo.chen@...cle.com>,
	Abel Wu <wuyun.abel@...edance.com>,
	Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
	Hillf Danton <hdanton@...a.com>,
	linux-kernel@...r.kernel.org,
	Chen Yu <yu.c.chen@...el.com>
Subject: [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated

It is found that when the process's preferred LLC gets saturated by too many
threads, task contention is very frequent and causes performance regression.

Save the per LLC statistics calculated by periodic load balance. The statistics
include the average utilization and the average number of runnable tasks.
The task wakeup path for cache aware scheduling manipulates these statistics
to inhibit cache aware scheduling to avoid performance regression. When either
the average utilization of the preferred LLC has reached 25%, or the average
number of runnable tasks has exceeded 1/3 of the LLC weight, the cache aware
wakeup is disabled. Only when the process has more threads than the LLC weight
will this restriction be enabled.

Running schbench via mmtests on a Xeon platform, which has 2 sockets, each socket
has 60 Cores/120 CPUs. The DRAM interleave is enabled across NUMA nodes via BIOS,
so there are 2 "LLCs" in 1 NUMA node.

compare-mmtests.pl --directory work/log --benchmark schbench --names baseline,sched_cache
                                    baselin             sched_cach
                                   baseline            sched_cache
Lat 50.0th-qrtle-1          6.00 (   0.00%)        6.00 (   0.00%)
Lat 90.0th-qrtle-1         10.00 (   0.00%)        9.00 (  10.00%)
Lat 99.0th-qrtle-1         29.00 (   0.00%)       13.00 (  55.17%)
Lat 99.9th-qrtle-1         35.00 (   0.00%)       21.00 (  40.00%)
Lat 20.0th-qrtle-1        266.00 (   0.00%)      266.00 (   0.00%)
Lat 50.0th-qrtle-2          8.00 (   0.00%)        6.00 (  25.00%)
Lat 90.0th-qrtle-2         10.00 (   0.00%)       10.00 (   0.00%)
Lat 99.0th-qrtle-2         19.00 (   0.00%)       18.00 (   5.26%)
Lat 99.9th-qrtle-2         27.00 (   0.00%)       29.00 (  -7.41%)
Lat 20.0th-qrtle-2        533.00 (   0.00%)      507.00 (   4.88%)
Lat 50.0th-qrtle-4          6.00 (   0.00%)        5.00 (  16.67%)
Lat 90.0th-qrtle-4          8.00 (   0.00%)        5.00 (  37.50%)
Lat 99.0th-qrtle-4         14.00 (   0.00%)        9.00 (  35.71%)
Lat 99.9th-qrtle-4         22.00 (   0.00%)       14.00 (  36.36%)
Lat 20.0th-qrtle-4       1070.00 (   0.00%)      995.00 (   7.01%)
Lat 50.0th-qrtle-8          5.00 (   0.00%)        5.00 (   0.00%)
Lat 90.0th-qrtle-8          7.00 (   0.00%)        5.00 (  28.57%)
Lat 99.0th-qrtle-8         12.00 (   0.00%)       11.00 (   8.33%)
Lat 99.9th-qrtle-8         19.00 (   0.00%)       16.00 (  15.79%)
Lat 20.0th-qrtle-8       2140.00 (   0.00%)     2140.00 (   0.00%)
Lat 50.0th-qrtle-16         6.00 (   0.00%)        5.00 (  16.67%)
Lat 90.0th-qrtle-16         7.00 (   0.00%)        5.00 (  28.57%)
Lat 99.0th-qrtle-16        12.00 (   0.00%)       10.00 (  16.67%)
Lat 99.9th-qrtle-16        17.00 (   0.00%)       14.00 (  17.65%)
Lat 20.0th-qrtle-16      4296.00 (   0.00%)     4200.00 (   2.23%)
Lat 50.0th-qrtle-32         6.00 (   0.00%)        5.00 (  16.67%)
Lat 90.0th-qrtle-32         8.00 (   0.00%)        6.00 (  25.00%)
Lat 99.0th-qrtle-32        12.00 (   0.00%)       10.00 (  16.67%)
Lat 99.9th-qrtle-32        17.00 (   0.00%)       14.00 (  17.65%)
Lat 20.0th-qrtle-32      8496.00 (   0.00%)     8528.00 (  -0.38%)
Lat 50.0th-qrtle-64         6.00 (   0.00%)        5.00 (  16.67%)
Lat 90.0th-qrtle-64         8.00 (   0.00%)        8.00 (   0.00%)
Lat 99.0th-qrtle-64        12.00 (   0.00%)       12.00 (   0.00%)
Lat 99.9th-qrtle-64        17.00 (   0.00%)       17.00 (   0.00%)
Lat 20.0th-qrtle-64     17120.00 (   0.00%)    17120.00 (   0.00%)
Lat 50.0th-qrtle-128        7.00 (   0.00%)        7.00 (   0.00%)
Lat 90.0th-qrtle-128        9.00 (   0.00%)        9.00 (   0.00%)
Lat 99.0th-qrtle-128       13.00 (   0.00%)       14.00 (  -7.69%)
Lat 99.9th-qrtle-128       20.00 (   0.00%)       20.00 (   0.00%)
Lat 20.0th-qrtle-128    31776.00 (   0.00%)    30496.00 (   4.03%)
Lat 50.0th-qrtle-239        9.00 (   0.00%)        9.00 (   0.00%)
Lat 90.0th-qrtle-239       14.00 (   0.00%)       18.00 ( -28.57%)
Lat 99.0th-qrtle-239       43.00 (   0.00%)       56.00 ( -30.23%)
Lat 99.9th-qrtle-239      106.00 (   0.00%)      483.00 (-355.66%)
Lat 20.0th-qrtle-239    30176.00 (   0.00%)    29984.00 (   0.64%)

We can see overall latency improvement and some throughput degradation
when the system gets saturated.

Also, we run schbench (old version) on an EPYC 7543 system, which has
4 NUMA nodes, and each node has 4 LLCs. Monitor the 99.0th latency:

case                    load            baseline(std%)  compare%( std%)
normal                  4-mthreads-1-workers     1.00 (  6.47)   +9.02 (  4.68)
normal                  4-mthreads-2-workers     1.00 (  3.25)  +28.03 (  8.76)
normal                  4-mthreads-4-workers     1.00 (  6.67)   -4.32 (  2.58)
normal                  4-mthreads-8-workers     1.00 (  2.38)   +1.27 (  2.41)
normal                  4-mthreads-16-workers    1.00 (  5.61)   -8.48 (  4.39)
normal                  4-mthreads-31-workers    1.00 (  9.31)   -0.22 (  9.77)

When the LLC is underloaded, the latency improvement is observed. When the LLC
gets saturated, we observe some degradation.

The aggregation of tasks will move tasks towards the preferred LLC
pretty quickly during wake ups. However load balance will tend to move
tasks away from the aggregated LLC. The two migrations are in the
opposite directions and tend to bounce tasks between LLCs. Such task
migrations should be impeded in load balancing as long as the home LLC.
We're working on fixing up the load balancing path to address such issues.

Co-developed-by: Tim Chen <tim.c.chen@...el.com>
Signed-off-by: Tim Chen <tim.c.chen@...el.com>
Signed-off-by: Chen Yu <yu.c.chen@...el.com>
---
 include/linux/sched/topology.h |   4 ++
 kernel/sched/fair.c            | 101 ++++++++++++++++++++++++++++++++-
 2 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 198bb5cc1774..9625d9d762f5 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -78,6 +78,10 @@ struct sched_domain_shared {
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
 	int		nr_idle_scan;
+#ifdef CONFIG_SCHED_CACHE
+	unsigned long	util_avg;
+	u64		nr_avg;
+#endif
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1733eb83042c..f74d8773c811 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8791,6 +8791,58 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 #ifdef CONFIG_SCHED_CACHE
 static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle);
 
+/* expected to be protected by rcu_read_lock() */
+static bool get_llc_stats(int cpu, int *nr, int *weight, unsigned long *util)
+{
+	struct sched_domain_shared *sd_share;
+
+	sd_share = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (!sd_share)
+		return false;
+
+	*nr = READ_ONCE(sd_share->nr_avg);
+	*util = READ_ONCE(sd_share->util_avg);
+	*weight = per_cpu(sd_llc_size, cpu);
+
+	return true;
+}
+
+static bool valid_target_cpu(int cpu, struct task_struct *p)
+{
+	int nr_running, llc_weight;
+	unsigned long util, llc_cap;
+
+	if (!get_llc_stats(cpu, &nr_running, &llc_weight,
+			   &util))
+		return false;
+
+	llc_cap = llc_weight * SCHED_CAPACITY_SCALE;
+
+	/*
+	 * If this process has many threads, be careful to avoid
+	 * task stacking on the preferred LLC, by checking the system's
+	 * utilization and runnable tasks. Otherwise, if this
+	 * process does not have many threads, honor the cache
+	 * aware wakeup.
+	 */
+	if (get_nr_threads(p) < llc_weight)
+		return true;
+
+	/*
+	 * Check if it exceeded 25% of average utiliazation,
+	 * or if it exceeded 33% of CPUs. This is a magic number
+	 * that did not cause heavy cache contention on Xeon or
+	 * Zen.
+	 */
+	if (util * 4 >= llc_cap)
+		return false;
+
+	if (nr_running * 3 >= llc_weight)
+		return false;
+
+	return true;
+}
+
 static int select_cache_cpu(struct task_struct *p, int prev_cpu)
 {
 	struct mm_struct *mm = p->mm;
@@ -8813,6 +8865,9 @@ static int select_cache_cpu(struct task_struct *p, int prev_cpu)
 	if (cpus_share_cache(prev_cpu, cpu))
 		return prev_cpu;
 
+	if (!valid_target_cpu(cpu, p))
+		return prev_cpu;
+
 	if (static_branch_likely(&sched_numa_balancing) &&
 	    __migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) {
 		/*
@@ -9564,7 +9619,8 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 	 */
 	if (sched_feat(SCHED_CACHE) && p->mm && p->mm->mm_sched_cpu >= 0 &&
 	    cpus_share_cache(env->src_cpu, p->mm->mm_sched_cpu) &&
-	    !cpus_share_cache(env->src_cpu, env->dst_cpu))
+	    !cpus_share_cache(env->src_cpu, env->dst_cpu) &&
+	     !valid_target_cpu(env->dst_cpu, p))
 		return 1;
 #endif
 
@@ -10634,6 +10690,48 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
 	return check_cpu_capacity(rq, sd);
 }
 
+#ifdef CONFIG_SCHED_CACHE
+/*
+ * Save this sched group's statistic for later use:
+ * The task wakeup and load balance can make better
+ * decision based on these statistics.
+ */
+static void update_sg_if_llc(struct lb_env *env, struct sg_lb_stats *sgs,
+			     struct sched_group *group)
+{
+	/* Find the sched domain that spans this group. */
+	struct sched_domain *sd = env->sd->child;
+	struct sched_domain_shared *sd_share;
+	u64 last_nr;
+
+	if (!sched_feat(SCHED_CACHE) || env->idle == CPU_NEWLY_IDLE)
+		return;
+
+	/* only care the sched domain that spans 1 LLC */
+	if (!sd || !(sd->flags & SD_SHARE_LLC) ||
+	    !sd->parent || (sd->parent->flags & SD_SHARE_LLC))
+		return;
+
+	sd_share = rcu_dereference(per_cpu(sd_llc_shared,
+				   cpumask_first(sched_group_span(group))));
+	if (!sd_share)
+		return;
+
+	last_nr = READ_ONCE(sd_share->nr_avg);
+	update_avg(&last_nr, sgs->sum_nr_running);
+
+	if (likely(READ_ONCE(sd_share->util_avg) != sgs->group_util))
+		WRITE_ONCE(sd_share->util_avg, sgs->group_util);
+
+	WRITE_ONCE(sd_share->nr_avg, last_nr);
+}
+#else
+static inline void update_sg_if_llc(struct lb_env *env, struct sg_lb_stats *sgs,
+				    struct sched_group *group)
+{
+}
+#endif
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -10723,6 +10821,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
 
+	update_sg_if_llc(env, sgs, group);
 	/* Computing avg_load makes sense only when group is overloaded */
 	if (sgs->group_type == group_overloaded)
 		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
-- 
2.25.1