linux-kernel - [PATCH V2] sched: Forward deadline for early tick

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20241225054030.77765-1-15645113830zzh@gmail.com>
Date: Wed, 25 Dec 2024 13:40:31 +0800
From: zihan zhou <15645113830zzh@...il.com>
To: mingo@...hat.com,
	peterz@...radead.org,
	juri.lelli@...hat.com,
	vincent.guittot@...aro.org,
	dietmar.eggemann@....com,
	rostedt@...dmis.org,
	bsegall@...gle.com,
	mgorman@...e.de,
	vschneid@...hat.com
Cc: linux-kernel@...r.kernel.org,
	zihan zhou <15645113830zzh@...il.com>,
	yaozhenguo <yaozhenguo@...com>,
	yaowenchao <yaowenchao@...com>
Subject: [PATCH V2] sched: Forward deadline for early tick

Due to the tick error, the eevdf scheduler exhibits unexpected behavior.
For example, a machine with sysctl_sched_base_slice=3ms, CONFIG_HZ=1000
 should trigger a tick every 1ms. A se (sched_entity) with default weight
 1024 should theoretically reach its deadline on the third tick.

However, the tick often arrives a little faster than expected. In this
 case, the se can only wait until the next tick to consider that it has
 reached its deadline, and will run 1ms longer.

vruntime + sysctl_sched_base_slice =     deadline
        |-----------|-----------|-----------|-----------|
             1ms          1ms         1ms         1ms
                   ^           ^           ^           ^
                 tick1       tick2       tick3       tick4(nearly 4ms)

Here is a simple example of this scenario,
where sysctl_sched_base_slice=3ms, CONFIG_HZ=1000,
the CPU is Intel(R) Xeon(R) Platinum 8338C CPU @ 2.60GHz,
and "while :; do :; done &" is run twice with pids 72112 and 72113.
According to the design of eevdf,
both should run for 3ms each, but often they run for 4ms.

         time    cpu  task name      wait time  sch delay   run time
                      [tid/pid]         (msec)     (msec)     (msec)
------------- ------  -------------  ---------  ---------  ---------
 56696.846136 [0001]  perf[72368]        0.000      0.000      0.000
 56696.849378 [0001]  bash[72112]        0.000      0.000      3.241
 56696.852379 [0001]  bash[72113]        0.000      0.000      3.000
 56696.852964 [0001]  sleep[72369]       0.000      6.261      0.584
 56696.856378 [0001]  bash[72112]        3.585      0.000      3.414
 56696.860379 [0001]  bash[72113]        3.999      0.000      4.000 <
 56696.864379 [0001]  bash[72112]        4.000      0.000      4.000 <
 56696.868377 [0001]  bash[72113]        4.000      0.000      3.997 <
 56696.871378 [0001]  bash[72112]        3.997      0.000      3.000
 56696.874377 [0001]  bash[72113]        3.000      0.000      2.999
 56696.877377 [0001]  bash[72112]        2.999      0.000      2.999
 56696.881377 [0001]  bash[72113]        2.999      0.000      3.999 <

There are two reasons for tick error: clockevent precision and the 
CONFIG_IRQ_TIME_ACCOUNTING. with CONFIG_IRQ_TIME_ACCOUNTING every tick
 will less than 1ms, but even without it, because of clockevent precision,
tick still often less than 1ms. In the system above, there is no such
 config, but the task still often takes more than 3ms.

To solve this problem, we add a sched feature FORWARD_DEADLINE,
consider forwarding the deadline appropriately. When vruntime is very
close to the deadline, and the task is ineligible, we consider that task
 should be resched, the tolerance is set to min(vslice/128, tick/2).

when open FORWARD_DEADLINE,
 the task will run once every 3ms as designed by eevdf:

       time    cpu  task name         wait time  sch delay   run time
                    [tid/pid]            (msec)     (msec)     (msec)
----------- ------  ----------------  ---------  ---------  ---------
 207.207293 [0001]  bash[1699]            3.998      0.000      3.000 
 207.210294 [0001]  bash[1694]            3.000      0.000      3.000 
 207.213300 [0001]  bash[1699]            3.000      0.000      3.006 
 207.216293 [0001]  bash[1694]            3.006      0.000      2.993 
 207.219293 [0001]  bash[1699]            2.993      0.000      3.000 
 207.222293 [0001]  bash[1694]            3.000      0.000      2.999 
 207.225293 [0001]  bash[1699]            2.999      0.000      3.000 
 207.228293 [0001]  bash[1694]            3.000      0.000      3.000 
 207.231293 [0001]  bash[1699]            3.000      0.000      3.000 
 207.234293 [0001]  bash[1694]            3.000      0.000      2.999 
 207.237292 [0001]  bash[1699]            2.999      0.000      2.999 
 207.240293 [0001]  bash[1694]            2.999      0.000      3.000 
 207.243293 [0001]  bash[1699]            3.000      0.000      3.000 


Signed-off-by: zihan zhou <15645113830zzh@...il.com>
Signed-off-by: yaozhenguo <yaozhenguo@...com>
Signed-off-by: yaowenchao <yaowenchao@...com>
---
v2: 1. just forward deadline for ineligible task.
    2. for ineligible task, vd_i = vd_i + r_i / w_i, but for eligible task,
       vd_i = ve_i + r_i / w_i, which is the same as before.
    3. tolerance = min(vslice>>7, TICK_NSEC/2), prevent scheduling errors 
       from increasing when vslice is too large relative to tick.
---
 kernel/sched/fair.c     | 42 +++++++++++++++++++++++++++++++++++++----
 kernel/sched/features.h |  7 +++++++
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d16c8545c71..9cc52f632bb1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1006,8 +1006,10 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
  */
 static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if ((s64)(se->vruntime - se->deadline) < 0)
-		return false;
+
+	u64 vslice;
+	u64 tolerance = 0;
+	u64 next_deadline;
 
 	/*
 	 * For EEVDF the virtual time slope is determined by w_i (iow.
@@ -1016,11 +1018,43 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	 */
 	if (!se->custom_slice)
 		se->slice = sysctl_sched_base_slice;
+	vslice = calc_delta_fair(se->slice, se);
+
+	/*
+	 * vd_i = ve_i + r_i / w_i
+	 */
+	next_deadline = se->vruntime + vslice;
+
+	if (sched_feat(FORWARD_DEADLINE))
+		tolerance = min(vslice>>7, TICK_NSEC/2);
+
+	if ((s64)(se->vruntime + tolerance - se->deadline) < 0)
+		return false;
 
 	/*
-	 * EEVDF: vd_i = ve_i + r_i / w_i
+	 * when se->vruntime + tolerance - se->deadline >= 0
+	 * but se->vruntime - se->deadline < 0,
+	 * there is two case: if entity is eligible?
+	 * if entity is not eligible, we don't need wait deadline, because
+	 * eevdf don't guarantee
+	 * an ineligible entity can exec its request time in one go.
+	 * but when entity eligible, just let it run, which is the
+	 * same processing logic as before.
 	 */
-	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+
+	if (sched_feat(FORWARD_DEADLINE) && (s64)(se->vruntime - se->deadline) < 0) {
+
+		if (entity_eligible(cfs_rq, se))
+			return false;
+
+		/*
+		 * vd_i = vd_i + r_i / w_i
+		 */
+		next_deadline = se->deadline + vslice;
+	}
+
+
+	se->deadline = next_deadline;
 
 	/*
 	 * The task has consumed its request, reschedule.
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 290874079f60..5c74deec7209 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -24,6 +24,13 @@ SCHED_FEAT(RUN_TO_PARITY, true)
  */
 SCHED_FEAT(PREEMPT_SHORT, true)
 
+/*
+ * For some cases where the tick is faster than expected,
+ * move the deadline forward
+ */
+SCHED_FEAT(FORWARD_DEADLINE, true)
+
+
 /*
  * Prefer to schedule the task we woke last (assuming it failed
  * wakeup-preemption), since its likely going to consume data we
-- 
2.39.3 (Apple Git-146)