linux-kernel - [PATCH] sched/fair: Decrease util

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20241219091207.2001051-1-pierre.gondois@arm.com>
Date: Thu, 19 Dec 2024 10:12:06 +0100
From: Pierre Gondois <pierre.gondois@....com>
To: linux-kernel@...r.kernel.org
Cc: Chritian Loehle <christian.loehle@....com>,
	Hongyan Xia <hongyan.xia2@....com>,
	Pierre Gondois <pierre.gondois@....com>,
	Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>,
	Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>
Subject: [PATCH] sched/fair: Decrease util_est in presence of idle time

util_est signal does not decay if the task utilization is lower
than its runnable signal by a value of 10. This was done to keep
the util_est signal high in case a task shares a rq with another
task and doesn't obtain a desired running time.

However, tasks sharing a rq obtain the running time they desire
provided that the rq has some idle time. Indeed, either:
- a CPU is always running. The utilization signal of tasks reflects
  the running time they obtained. This running time depends on the
  niceness of the tasks. A decreasing utilization signal doesn't
  reflect a decrease of the task activity and the util_est signal
  should not be decayed in this case.
- a CPU is not always running (i.e. there is some idle time). Tasks
  might be waiting to run, increasing their runnable signal, but
  eventually run to completion. A decreasing utilization signal
  does reflect a decrease of the task activity and the util_est
  signal should be decayed in this case.

------------

Running on a 1024 capacity CPU:
- TaskA:
  - duty_cycle=60%, period=16ms, duration=2s
- TaskB:
  - sleep=2ms (to let TaskA run first), duration=1s
  - first phase: duty_cycle=20%, period=16ms, duration=1s
  - second phase: duty_cycle=5%, period=16ms, duration=1s

When TaskB starts the second phase, the util_est signal can take up to
~150ms before starting to decrease. Indeed, if TaskB runs after
TaskA, its runnable signal will be higher than its util signal by more
than 10 units.
This creates unnecessary frequency spikes: upon enqueuing TaskB,
util_est_cfs is increased by the old value of util_est of TaskB,
impacting frequency selection through:
sugov_get_util()
\-cpu_util_cfs_boost()
  \-cpu_util()
		util_est = READ_ONCE(cfs_rq->avg.util_est);

------------

Energy impact can also be measured as suggested by Hongyan at [2].
On a Pixel6, the following workload is run 10 times:
- TaskA:
  - duty_cycle=20%, duration=0.4s
  - one task per mid and big CPU (2 mid and 2 big CPUs, so 4 in total)
  - Used to increase the runnable signals of TaskB
- TaskB:
  - sleep=2ms (to let TaskA run first)
  - first phase: duty_cycle=20%, duration=0.2s
  - second phase: duty_cycle=5%, duration=0.2s
  - 4 occurrences of TaskB

The duration of the workload is low (0.4s) to emphasis the impact of
continuing to run at an overestimated frequency.

Energy measured with energy counters:
┌────────────┬────────────┬───────────┬───────────┬───────────┐
│ base mean  ┆ patch mean ┆ base std  ┆ patch std ┆ ratio (%) │
╞════════════╪════════════╪═══════════╪═══════════╪═══════════╡
│ 536.412419 ┆ 487.232935 ┆ 68.591493 ┆ 66.862019 ┆     -9.16 │
└────────────┴────────────┴───────────┴───────────┴───────────┘

Energy computed from util signals and energy model:
┌────────────┬────────────┬───────────┬───────────┬───────────┐
│ base mean  ┆ patch mean ┆ base std  ┆ patch std ┆ ratio (%) │
╞════════════╪════════════╪═══════════╪═══════════╪═══════════╡
│  4.8318e9  ┆   4.0823e9 ┆ 5.1044e8  ┆  7.5558e8 ┆    -15.51 │
└────────────┴────────────┴───────────┴───────────┴───────────┘

------------

The initial patch [2] aimed to solve an issue detected while running
speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3
versions are compared:
- base: the current version
- patch: the new version, with this patch applied
- revert: the initial version, with commit [2] reverted

Score (higher is better):
┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
│ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
│     108.16 ┆     104.06 ┆     105.82 ┆      -3.94% ┆       -2.16% │
└────────────┴────────────┴────────────┴─────────────┴──────────────┘
┌───────────┬───────────┬────────────┐
│ base std  ┆ patch std ┆ revert std │
╞═══════════╪═══════════╪════════════╡
│      0.57 ┆      0.49 ┆       0.58 │
└───────────┴───────────┴────────────┘

Energy measured with energy counters:
┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
│ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
│  141262.79 ┆  130630.09 ┆  134108.07 ┆      -7.52% ┆       -5.64% │
└────────────┴────────────┴────────────┴─────────────┴──────────────┘
┌───────────┬───────────┬────────────┐
│ base std  ┆ patch std ┆ revert std │
╞═══════════╪═══════════╪════════════╡
│   1347.13 ┆   2431.67 ┆     510.88 │
└───────────┴───────────┴────────────┘

Energy computed from util signals and energy model:
┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
│ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
│  2.0539e12 ┆  1.3569e12 ┆ 1.3637e+12 ┆     -33.93% ┆      -33.60% │
└────────────┴────────────┴────────────┴─────────────┴──────────────┘
┌───────────┬───────────┬────────────┐
│ base std  ┆ patch std ┆ revert std │
╞═══════════╪═══════════╪════════════╡
│ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │
└───────────┴───────────┴────────────┘

OU ratio in % (ratio of time being overutilized over total time).
The test lasts ~65s:
┌────────────┬────────────┬─────────────┐
│ base mean  ┆ patch mean ┆ revert mean │
╞════════════╪════════════╪═════════════╡
│     63.39% ┆     12.48% ┆      12.28% │
└────────────┴────────────┴─────────────┘
┌───────────┬───────────┬─────────────┐
│ base std  ┆ patch std ┆ revert mean │
╞═══════════╪═══════════╪═════════════╡
│      0.97 ┆      0.28 ┆        0.88 │
└───────────┴───────────┴─────────────┘

The energy gain can be explained by the fact that the system is
overutilized during most of the test with the base version.

During the test, the base condition is evaluated to true ~40%
of the time. The new condition is evaluated to true ~2% of
the time. Preventing util_est signals to decay with the base
condition has a significant impact on the overutilized state
due to an overestimation of the resulting utilization of tasks.

The score is impacted by the patch, but:
- it is expected to have slightly lower scores with EAS running more
  often
- the base version making the system run at higher frequencies by
  overestimating task utilization, it is expected to have higher scores

------------

Decrease util_est when the rq has some idle time instead of a delta
between util and runnable signals.

[1] https://lore.kernel.org/lkml/39cde23a-19d8-4e64-a1d2-f26bce264883@arm.com/
[2] commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task utilization")
https://lore.kernel.org/lkml/20231122140119.472110-1-vincent.guittot@linaro.org/
[3] https://lore.kernel.org/lkml/CAKfTPtDd-HhF-YiNTtL9i5k0PfJbF819Yxu4YquzfXgwi7voyw@mail.gmail.com/#t

Signed-off-by: Pierre Gondois <pierre.gondois@....com>
---
 kernel/sched/fair.c |  2 +-
 kernel/sched/pelt.h | 23 ++++++++++++++++-------
 2 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e9ca38512de..d058ab29e52e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5033,7 +5033,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
 	 * To avoid underestimate of task utilization, skip updates of EWMA if
 	 * we cannot grant that thread got all CPU time it wanted.
 	 */
-	if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
+	if (rq_no_idle_pelt(rq_of(cfs_rq)))
 		goto done;
 
 
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index f4f6a0875c66..eaf34f95fd93 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -123,21 +123,30 @@ static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
 }
 
 /*
- * When rq becomes idle, we have to check if it has lost idle time
- * because it was fully busy. A rq is fully used when the /Sum util_sum
- * is greater or equal to:
+ * A rq is fully used (or is considered to not have enough idle time)
+ * when the /Sum util_sum is greater or equal to:
  * (LOAD_AVG_MAX - 1024 + rq->cfs.avg.period_contrib) << SCHED_CAPACITY_SHIFT;
  * For optimization and computing rounding purpose, we don't take into account
  * the position in the current window (period_contrib) and we use the higher
  * bound of util_sum to decide.
  */
-static inline void update_idle_rq_clock_pelt(struct rq *rq)
+static inline bool rq_no_idle_pelt(struct rq *rq)
 {
-	u32 divider = ((LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT) - LOAD_AVG_MAX;
-	u32 util_sum = rq->cfs.avg.util_sum;
+	u32 util_sum, divider;
+
+	divider = (PELT_MIN_DIVIDER << SCHED_CAPACITY_SHIFT) - LOAD_AVG_MAX;
+	util_sum = rq->cfs.avg.util_sum;
 	util_sum += rq->avg_rt.util_sum;
 	util_sum += rq->avg_dl.util_sum;
+	return util_sum >= divider;
+}
 
+/*
+ * When rq becomes idle, we have to check if it has lost idle time
+ * because it was fully busy.
+ */
+static inline void update_idle_rq_clock_pelt(struct rq *rq)
+{
 	/*
 	 * Reflecting stolen time makes sense only if the idle
 	 * phase would be present at max capacity. As soon as the
@@ -147,7 +156,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
 	 * this case. We keep track of this lost idle time compare to
 	 * rq's clock_task.
 	 */
-	if (util_sum >= divider)
+	if (rq_no_idle_pelt(rq))
 		rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
 
 	_update_idle_rq_clock_pelt(rq);
-- 
2.25.1