lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20241126062350.88183-1-zhouzihan30@jd.com>
Date: Tue, 26 Nov 2024 14:23:50 +0800
From: zhouzihan30 <15645113830zzh@...il.com>
To: mingo@...hat.com,
	peterz@...radead.org,
	juri.lelli@...hat.com,
	vincent.guittot@...aro.org,
	dietmar.eggemann@....com,
	rostedt@...dmis.org,
	bsegall@...gle.com,
	mgorman@...e.de,
	vschneid@...hat.com
Cc: linux-kernel@...r.kernel.org,
	zhouzihan30 <zhouzihan30@...com>,
	yaozhenguo <yaozhenguo@...com>
Subject: [PATCH] sched: Forward deadline for early tick

Due to the problem of tick time accuracy, 
the eevdf scheduler exhibits unexpected behavior.
For example, a machine with sysctl_sched_base_slice=3ms, CONFIG_HZ=1000
 should trigger a tick every 1ms. 
A se (sched_entity) with default weight 1024 should
 theoretically reach its deadline on the third tick. 
However, the tick often arrives a little faster than expected. 
In this case, the se can only wait until the next tick to
 consider that it has reached its deadline, and will run 1ms longer.

vruntime + sysctl_sched_base_slice =     deadline
        |-----------|-----------|-----------|-----------|
             1ms          1ms         1ms         1ms
                   ^           ^           ^           ^
                 tick1       tick2       tick3       tick4(nearly 4ms)

Here is a simple example of this scenario, 
where sysctl_sched_base_slice=3ms, CONFIG_HZ=1000, 
the CPU is Intel(R) Xeon(R) Platinum 8338C CPU @ 2.60GHz, 
and "while :; do :; done &" is run twice with pids 72112 and 72113.
According to the design of eevdf,
both should run for 3ms each, but often they run for 4ms.

         time    cpu  task name      wait time  sch delay   run time
                      [tid/pid]         (msec)     (msec)     (msec)
------------- ------  -------------  ---------  ---------  ---------
 56696.846136 [0001]  perf[72368]        0.000      0.000      0.000
 56696.849378 [0001]  bash[72112]        0.000      0.000      3.241
 56696.852379 [0001]  bash[72113]        0.000      0.000      3.000
 56696.852964 [0001]  sleep[72369]       0.000      6.261      0.584
 56696.856378 [0001]  bash[72112]        3.585      0.000      3.414
 56696.860379 [0001]  bash[72113]        3.999      0.000      4.000
 56696.864379 [0001]  bash[72112]        4.000      0.000      4.000
 56696.868377 [0001]  bash[72113]        4.000      0.000      3.997
 56696.871378 [0001]  bash[72112]        3.997      0.000      3.000
 56696.874377 [0001]  bash[72113]        3.000      0.000      2.999
 56696.877377 [0001]  bash[72112]        2.999      0.000      2.999
 56696.881377 [0001]  bash[72113]        2.999      0.000      3.999

The reason for this problem is that
 the actual time of each tick is less than 1ms.
We believe there are two reasons:
Reason 1:
Hardware error. 
The clock of an ordinary computer cannot guarantee its own accurate time.
Reason 2:
Calculation error.
Many clocks calculate time indirectly through the number of cycles, 
which will definitely have errors and be smaller than the actual value,
 the kernel code is:

clc= ((unsignedlonglong) delta*dev->mult) >>dev->shift;
dev->set_next_event((unsignedlong) clc, dev);

To solve this problem,
we add a sched feature FORWARD_DEADLINE,
consider forwarding the deadline appropriately. 
When vruntime is very close to the deadline,
we consider that the deadline has been reached.
This tolerance is set to vslice/128.
On our machine, the tick error will not be greater than this tolerance,
and an error of less than 1%
 should not affect the fairness of the scheduler.

when open FORWARD_DEADLINE,
 the task will run once every 3ms as designed by eevdf:

         time    cpu  task name         wait time  sch delay   run time
                      [tid/pid]            (msec)     (msec)     (msec)
------------- ------  ----------------  ---------  ---------  ---------
 57110.032369 [0001]  perf[72752]           0.000      0.000      0.000
 57110.035805 [0001]  bash[72112]           0.000      0.000      3.436
 57110.036400 [0001]  sleep[72755]          0.000      3.456      0.594
 57110.039806 [0001]  bash[72113]           0.000      0.000      3.405
 57110.042805 [0001]  bash[72112]           4.000      0.000      2.999
 57110.045811 [0001]  bash[72113]           2.999      0.000      3.005
 57110.045816 [0001]  ksoftirqd/1[98]       0.000      0.001      0.005
 57110.048804 [0001]  bash[72112]           3.010      0.000      2.987
 57110.051805 [0001]  bash[72113]           2.993      0.000      3.001
 57110.054804 [0001]  bash[72112]           3.001      0.000      2.998
 57110.057805 [0001]  bash[72113]           2.998      0.000      3.000
 57110.060804 [0001]  bash[72112]           3.000      0.000      2.999
 57110.063805 [0001]  bash[72113]           2.999      0.000      3.001
 57110.066804 [0001]  bash[72112]           3.001      0.000      2.999
 57110.069805 [0001]  bash[72113]           2.999      0.000      3.000
 57110.072804 [0001]  bash[72112]           3.000      0.000      2.999
---
 kernel/sched/fair.c     | 21 ++++++++++++++++++---
 kernel/sched/features.h |  7 +++++++
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d16c8545c71..ea342ef8db26 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1006,8 +1006,9 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
  */
 static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if ((s64)(se->vruntime - se->deadline) < 0)
-		return false;
+
+	u64 vslice;
+	u64 tolerance = 0;
 
 	/*
 	 * For EEVDF the virtual time slope is determined by w_i (iow.
@@ -1016,11 +1017,25 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	 */
 	if (!se->custom_slice)
 		se->slice = sysctl_sched_base_slice;
+	vslice = calc_delta_fair(se->slice, se);
+
+
+	/*
+	 * When the deadline is only 1/128 away,
+	 * it is considered to have been reached.
+	 */
+	if (sched_feat(FORWARD_DEADLINE))
+		tolerance = vslice>>7;
+
+
+
+	if ((s64)(se->vruntime + tolerance - se->deadline) < 0)
+		return false;
 
 	/*
 	 * EEVDF: vd_i = ve_i + r_i / w_i
 	 */
-	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+	se->deadline = se->vruntime + vslice;
 
 	/*
 	 * The task has consumed its request, reschedule.
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 290874079f60..fad8e2bbf4ed 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -24,6 +24,13 @@ SCHED_FEAT(RUN_TO_PARITY, true)
  */
 SCHED_FEAT(PREEMPT_SHORT, true)
 
+/*
+ * For some cases where the tick is faster than expected,
+ * move the deadline forward.
+ */
+SCHED_FEAT(FORWARD_DEADLINE, true)
+
+
 /*
  * Prefer to schedule the task we woke last (assuming it failed
  * wakeup-preemption), since its likely going to consume data we
-- 
2.39.3 (Apple Git-146)
Signed-off-by: zhouzihan30 <zhouzihan30@...com>
Signed-off-by: yaozhenguo <yaozhenguo@...com>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ