linux-kernel - Re: [PATCH] sched: Forward deadline for early tick

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241126103513.GK38837@noisy.programming.kicks-ass.net>
Date: Tue, 26 Nov 2024 11:35:13 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: zhouzihan30 <15645113830zzh@...il.com>
Cc: mingo@...hat.com, juri.lelli@...hat.com, vincent.guittot@...aro.org,
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
	mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org,
	zhouzihan30 <zhouzihan30@...com>, yaozhenguo <yaozhenguo@...com>
Subject: Re: [PATCH] sched: Forward deadline for early tick

On Tue, Nov 26, 2024 at 02:23:50PM +0800, zhouzihan30 wrote:
> Due to the problem of tick time accuracy, 
> the eevdf scheduler exhibits unexpected behavior.
> For example, a machine with sysctl_sched_base_slice=3ms, CONFIG_HZ=1000
>  should trigger a tick every 1ms. 
> A se (sched_entity) with default weight 1024 should
>  theoretically reach its deadline on the third tick. 
> However, the tick often arrives a little faster than expected. 
> In this case, the se can only wait until the next tick to
>  consider that it has reached its deadline, and will run 1ms longer.
> 
> vruntime + sysctl_sched_base_slice =     deadline
>         |-----------|-----------|-----------|-----------|
>              1ms          1ms         1ms         1ms
>                    ^           ^           ^           ^
>                  tick1       tick2       tick3       tick4(nearly 4ms)
> 
> Here is a simple example of this scenario, 
> where sysctl_sched_base_slice=3ms, CONFIG_HZ=1000, 
> the CPU is Intel(R) Xeon(R) Platinum 8338C CPU @ 2.60GHz, 
> and "while :; do :; done &" is run twice with pids 72112 and 72113.
> According to the design of eevdf,
> both should run for 3ms each, but often they run for 4ms.

> 
> The reason for this problem is that
>  the actual time of each tick is less than 1ms.
> We believe there are two reasons:
> Reason 1:
> Hardware error. 
> The clock of an ordinary computer cannot guarantee its own accurate time.
> Reason 2:
> Calculation error.
> Many clocks calculate time indirectly through the number of cycles, 
> which will definitely have errors and be smaller than the actual value,
>  the kernel code is:
> 
> clc= ((unsignedlonglong) delta*dev->mult) >>dev->shift;
> dev->set_next_event((unsignedlong) clc, dev);
> 
> To solve this problem,
> we add a sched feature FORWARD_DEADLINE,
> consider forwarding the deadline appropriately. 
> When vruntime is very close to the deadline,
> we consider that the deadline has been reached.
> This tolerance is set to vslice/128.
> On our machine, the tick error will not be greater than this tolerance,
> and an error of less than 1%
>  should not affect the fairness of the scheduler.

*sigh*

So yeah, discrete system, we get to quantize things. Timing has an
error proportional to the quantum chosen.

Current setup is the trivial whole integer or floor function. As such
the absolute error e is 0<e<q -- for a reason, see below.

You then introduce an additional error of: r/128 in order to minimize
the total error, except you made it worse. Consider case where r/128>q.

The only semi sane thing to do here is to replace floor with rounding,
then the absolute error changes to 0<e<q/2.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fbdca89c677f..d01b73e3f5aa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1006,7 +1006,9 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
  */
 static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if ((s64)(se->vruntime - se->deadline) < 0)
+	s64 v_half_tick = calc_delta_fair(TICK_NSEC/2, se);
+
+	if ((s64)(se->vruntime - se->deadline) < -v_half_tick)
 		return false;
 
 	/*

Now the consequence of doing this are that you will end up pushing a
task's (virtual) deadline into the future before it will have had time
to consume its full request, meaning its effective priority drops before
completion.

While it does not harm fairness, it does harm to the guarantees provided
by EEVDF such as they are.

Additionally, since update_deadline() is not proper (see its comment),
you're making the cumulative error significantly worse. A better version
would be something like:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fbdca89c677f..377e61be2cf6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1000,13 +1000,12 @@ int sched_update_scaling(void)
 
 static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
 
-/*
- * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
- * this is probably good enough.
- */
 static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if ((s64)(se->vruntime - se->deadline) < 0)
+	s64 v_half_tick = calc_delta_fair(TICK_NSEC/2, se);
+	s64 v_slice;
+
+	if ((s64)(se->vruntime - se->deadline) < -v_half_tick)
 		return false;
 
 	/*
@@ -1017,10 +1016,14 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	if (!se->custom_slice)
 		se->slice = sysctl_sched_base_slice;
 
+	v_slice = calc_delta_fair(se->slice, se);
+
 	/*
-	 * EEVDF: vd_i = ve_i + r_i / w_i
+	 * EEVDF: vd_i += r_i / w_i
 	 */
-	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+	do {
+		se->deadline += v_slice;
+	} while ((s64)(se->vruntime - se->deadline) < v_half_tick);
 
 	/*
 	 * The task has consumed its request, reschedule.


Now, if all this is worth it I've no idea.

Since you clearly do not care about the request size, your simplest
solution is to simply decrease it.


Also, none of this has been near a compiler, and please double check the
math if you want to proceed with any of this, I've given that about as
much thought as compile time :-)