linux-kernel - [PATCH 3/3] sched/fair: Remove magic hardcoded margin in fits

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20240205223344.2280519-4-qyousef@layalina.io>
Date: Mon,  5 Feb 2024 22:33:44 +0000
From: Qais Yousef <qyousef@...alina.io>
To: Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>
Cc: linux-kernel@...r.kernel.org,
	Lukasz Luba <lukasz.luba@....com>,
	Wei Wang <wvw@...gle.com>,
	Rick Yiu <rickyiu@...gle.com>,
	Chung-Kai Mei <chungkai@...gle.com>,
	Qais Yousef <qyousef@...alina.io>
Subject: [PATCH 3/3] sched/fair: Remove magic hardcoded margin in fits_capacity()

Replace hardcoded margin value in fits_capacity() with better dynamic
logic.

80% margin is a magic value that has served its purpose for now, but it
no longer fits the variety of systems that exist today. If a system is
over powered specifically, this 80% will mean we leave a lot of capacity
unused before we decide to upmigrate on HMP system.

On many systems the little cores are under powered and ability to
migrate faster away from them is desired.

There are two factors that must define when a task is misfit:

1. Due to invariance, time stretches and the ability of a task to reach
   certain util value requires longer runtime on smaller CPUs.

   This means tasks running on biggest core will require shorter amount
   of time to reach max performance point compared to the smaller cores.
   To counter the impact of this invariance, we must ensure that the
   task migrate faster so that they can reach the max performance point
   at a constant rate regardless where they're running on.

2. Misfit migration relies on TICK to help it to migrate. But as a worst
   case scenario this TICK can happen after we cross the fits_capacity
   threshold. So we must cater for this worst case scenario when
   calculating the threshold so tasks don't end up stuck on the wrong
   CPU which can cause performance and latency problems.

Below table shows the mapping for the misfit threshold without and with
taking the tick (4ms) into account. Note how the margin needs to be very
large at the bottom end, but very small at the higher end.

It is still large in the 600-750 range where many mid cores lie. So
there are contradiction requirements of enabling tasks to experience
coherent ramp-up time across all CPUs and reduce the latency to achieve
max performance point vs slowing down to allow mid clusters to be more
utilized in 'not so busy' situations.

Not sure if the answer lies in misfit definition. But most likely in
better task placement and load balancing strategies that considers other
factors beside misfit.

cap		threshold	%		threshold-tick	%
0		0		0		0		0
16		0		0		0		0
32		1		3.12		0		0
48		3		6.25		2		4.16
64		4		6.25		2		3.12
80		6		7.5		5		6.25
96		10		10.41		8		8.33
112		14		12.5		11		9.82
128		18		14.06		16		12.5
144		21		14.58		18		12.5
160		26		16.25		23		14.37
176		33		18.75		29		16.47
192		39		20.31		35		18.22
208		47		22.59		43		20.67
224		55		24.55		50		22.32
240		63		26.25		59		24.58
256		73		28.51		68		26.56
272		82		30.14		77		28.30
288		93		32.29		87		30.20
304		103		33.88		97		31.90
320		114		35.62		108		33.75
336		126		37.5		120		35.71
352		138		39.20		132		37.5
368		151		41.03		144		39.13
384		163		42.44		157		40.88
400		177		44.25		170		42.5
416		197		47.35		190		45.67
432		212		49.07		204		47.22
448		226		50.44		218		48.66
464		241		51.93		233		50.21
480		263		54.79		255		53.12
496		278		56.04		271		54.63
512		294		57.42		286		55.85
528		317		60.03		309		58.52
544		333		61.21		325		59.74
560		356		63.57		348		62.14
576		380		65.97		372		64.58
592		396		66.89		388		65.54
608		419		68.91		412		67.76
624		443		70.99		435		69.71
640		466		72.81		459		71.71
656		489		74.54		482		73.47
672		512		76.19		505		75.14
688		534		77.61		528		76.74
704		557		79.11		550		78.12
720		578		80.27		572		79.44
736		606		82.33		600		81.52
752		633		84.17		627		83.37
768		653		85.02		647		84.24
784		678		86.47		672		85.71
800		707		88.37		701		87.62
816		729		89.33		724		88.72
832		756		90.86		751		90.26
848		780		91.98		776		91.50
864		803		92.93		799		92.47
880		828		94.09		824		93.63
896		850		94.86		847		94.53
912		874		95.83		871		95.50
928		897		96.65		894		96.33
944		921		97.56		919		97.35
960		943		98.22		941		98.02
976		964		98.77		963		98.66
992		984		99.19		983		99.09
1008		1004		99.60		1004		99.60
1024		1022		99.80		1022		99.80

Signed-off-by: Qais Yousef <qyousef@...alina.io>
---
 kernel/sched/core.c  |  1 +
 kernel/sched/fair.c  | 42 +++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h |  1 +
 3 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index db4be4921e7f..def7af257270 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10004,6 +10004,7 @@ void __init sched_init(void)
 		rq->sd = NULL;
 		rq->rd = NULL;
 		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
+		rq->fits_capacity_threshold = SCHED_CAPACITY_SCALE;
 		rq->balance_callback = &balance_push_callback;
 		rq->active_balance = 0;
 		rq->next_balance = jiffies;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b803030c3a03..630eae0470ca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -101,11 +101,22 @@ int __weak arch_asym_cpu_priority(int cpu)
 }
 
 /*
- * The margin used when comparing utilization with CPU capacity.
+ * fits_capacity() must ensure that a task will not be 'stuck' on a CPU with
+ * lower capacity for too long. This can happen because of two reasons:
  *
- * (default: ~20%)
+ * 1. Capacity and frequency invariance means the runtime on each CPU is not
+ *    the same. We want the task to experience the same ramp-up time to reach
+ *    max performance point of the system as if they were running on the
+ *    biggest CPU.
+ *
+ * 2. A misfit migration will require a tick as a worst case scenario to
+ *    migrate it to a bigger CPU. So we must cater for this time and make sure
+ *    it migrates before it becomes a true misfit.
  */
-#define fits_capacity(cap, max)	((cap) * 1280 < (max) * 1024)
+static inline bool fits_capacity(unsigned long util, int cpu)
+{
+	return util < cpu_rq(cpu)->fits_capacity_threshold;
+}
 
 /*
  * The margin used when comparing CPU capacities.
@@ -4965,13 +4976,12 @@ static inline int util_fits_cpu(unsigned long util,
 				int cpu)
 {
 	unsigned long capacity_orig, capacity_orig_thermal;
-	unsigned long capacity = capacity_of(cpu);
 	bool fits, uclamp_max_fits;
 
 	/*
 	 * Check if the real util fits without any uclamp boost/cap applied.
 	 */
-	fits = fits_capacity(util, capacity);
+	fits = fits_capacity(util, cpu);
 
 	if (!uclamp_is_used())
 		return fits;
@@ -9522,12 +9532,30 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 {
 	unsigned long capacity = scale_rt_capacity(cpu);
 	struct sched_group *sdg = sd->groups;
+	struct rq *rq = cpu_rq(cpu);
+	u64 limit;
 
 	if (!capacity)
 		capacity = 1;
 
-	cpu_rq(cpu)->cpu_capacity = capacity;
-	trace_sched_cpu_capacity_tp(cpu_rq(cpu));
+	rq->cpu_capacity = capacity;
+	trace_sched_cpu_capacity_tp(rq);
+
+	/*
+	 * Calculate the util at which the task must be considered a misfit.
+	 *
+	 * We must ensure that a task experiences the same ramp-up time to
+	 * reach max performance point of the system regardless of the CPU it
+	 * is running on (due to invariance, time will stretch and task will
+	 * take longer to achieve the same util value compared to a task
+	 * running on a big CPU) and a delay in misfit migration which depends
+	 * on TICK doesn't end up hurting it as it can happen after we would
+	 * have crossed this threshold.
+	 */
+	limit = approximate_runtime(arch_scale_cpu_capacity(cpu)) * USEC_PER_MSEC;
+	limit -= TICK_USEC; /* sd->balance_interval is more accurate */
+	limit = cap_scale(limit, arch_scale_cpu_capacity(cpu));
+	rq->fits_capacity_threshold = approximate_util_avg(0, limit);
 
 	sdg->sgc->capacity = capacity;
 	sdg->sgc->min_capacity = capacity;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d078e6481bd0..92c29b6d3cc5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1054,6 +1054,7 @@ struct rq {
 	struct sched_domain __rcu	*sd;
 
 	unsigned long		cpu_capacity;
+	unsigned long		fits_capacity_threshold;
 
 	struct balance_callback *balance_callback;
 
-- 
2.34.1