linux-kernel - Re: [PATCH V2] sched: Forward deadline for early tick

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtAKmLw4WqeZ0uJEd33e2mJnaWM8ni3=J7YTSRBVNVymTg@mail.gmail.com>
Date: Thu, 2 Jan 2025 11:00:31 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: zihan zhou <15645113830zzh@...il.com>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com, 
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com, 
	mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org, 
	yaozhenguo <yaozhenguo@...com>, yaowenchao <yaowenchao@...com>
Subject: Re: [PATCH V2] sched: Forward deadline for early tick

On Wed, 25 Dec 2024 at 06:41, zihan zhou <15645113830zzh@...il.com> wrote:
>
> Due to the tick error, the eevdf scheduler exhibits unexpected behavior.
> For example, a machine with sysctl_sched_base_slice=3ms, CONFIG_HZ=1000
>  should trigger a tick every 1ms. A se (sched_entity) with default weight
>  1024 should theoretically reach its deadline on the third tick.
>
> However, the tick often arrives a little faster than expected. In this
>  case, the se can only wait until the next tick to consider that it has
>  reached its deadline, and will run 1ms longer.
>
> vruntime + sysctl_sched_base_slice =     deadline
>         |-----------|-----------|-----------|-----------|
>              1ms          1ms         1ms         1ms
>                    ^           ^           ^           ^
>                  tick1       tick2       tick3       tick4(nearly 4ms)
>
> Here is a simple example of this scenario,
> where sysctl_sched_base_slice=3ms, CONFIG_HZ=1000,
> the CPU is Intel(R) Xeon(R) Platinum 8338C CPU @ 2.60GHz,
> and "while :; do :; done &" is run twice with pids 72112 and 72113.
> According to the design of eevdf,
> both should run for 3ms each, but often they run for 4ms.
>
>          time    cpu  task name      wait time  sch delay   run time
>                       [tid/pid]         (msec)     (msec)     (msec)
> ------------- ------  -------------  ---------  ---------  ---------
>  56696.846136 [0001]  perf[72368]        0.000      0.000      0.000
>  56696.849378 [0001]  bash[72112]        0.000      0.000      3.241
>  56696.852379 [0001]  bash[72113]        0.000      0.000      3.000
>  56696.852964 [0001]  sleep[72369]       0.000      6.261      0.584
>  56696.856378 [0001]  bash[72112]        3.585      0.000      3.414
>  56696.860379 [0001]  bash[72113]        3.999      0.000      4.000 <
>  56696.864379 [0001]  bash[72112]        4.000      0.000      4.000 <
>  56696.868377 [0001]  bash[72113]        4.000      0.000      3.997 <
>  56696.871378 [0001]  bash[72112]        3.997      0.000      3.000
>  56696.874377 [0001]  bash[72113]        3.000      0.000      2.999
>  56696.877377 [0001]  bash[72112]        2.999      0.000      2.999
>  56696.881377 [0001]  bash[72113]        2.999      0.000      3.999 <
>
> There are two reasons for tick error: clockevent precision and the
> CONFIG_IRQ_TIME_ACCOUNTING. with CONFIG_IRQ_TIME_ACCOUNTING every tick
>  will less than 1ms, but even without it, because of clockevent precision,
> tick still often less than 1ms. In the system above, there is no such
>  config, but the task still often takes more than 3ms.
>
> To solve this problem, we add a sched feature FORWARD_DEADLINE,
> consider forwarding the deadline appropriately. When vruntime is very
> close to the deadline, and the task is ineligible, we consider that task
>  should be resched, the tolerance is set to min(vslice/128, tick/2).

I'm worried with this approximation because the task didn't get the
slice it has requested because of the stolen time by irq or a shorter
tick duration

>
> when open FORWARD_DEADLINE,
>  the task will run once every 3ms as designed by eevdf:
>
>        time    cpu  task name         wait time  sch delay   run time
>                     [tid/pid]            (msec)     (msec)     (msec)
> ----------- ------  ----------------  ---------  ---------  ---------
>  207.207293 [0001]  bash[1699]            3.998      0.000      3.000
>  207.210294 [0001]  bash[1694]            3.000      0.000      3.000
>  207.213300 [0001]  bash[1699]            3.000      0.000      3.006
>  207.216293 [0001]  bash[1694]            3.006      0.000      2.993
>  207.219293 [0001]  bash[1699]            2.993      0.000      3.000
>  207.222293 [0001]  bash[1694]            3.000      0.000      2.999
>  207.225293 [0001]  bash[1699]            2.999      0.000      3.000
>  207.228293 [0001]  bash[1694]            3.000      0.000      3.000
>  207.231293 [0001]  bash[1699]            3.000      0.000      3.000
>  207.234293 [0001]  bash[1694]            3.000      0.000      2.999
>  207.237292 [0001]  bash[1699]            2.999      0.000      2.999
>  207.240293 [0001]  bash[1694]            2.999      0.000      3.000
>  207.243293 [0001]  bash[1699]            3.000      0.000      3.000
>
>
> Signed-off-by: zihan zhou <15645113830zzh@...il.com>
> Signed-off-by: yaozhenguo <yaozhenguo@...com>
> Signed-off-by: yaowenchao <yaowenchao@...com>
> ---
> v2: 1. just forward deadline for ineligible task.
>     2. for ineligible task, vd_i = vd_i + r_i / w_i, but for eligible task,
>        vd_i = ve_i + r_i / w_i, which is the same as before.
>     3. tolerance = min(vslice>>7, TICK_NSEC/2), prevent scheduling errors
>        from increasing when vslice is too large relative to tick.
> ---
>  kernel/sched/fair.c     | 42 +++++++++++++++++++++++++++++++++++++----
>  kernel/sched/features.h |  7 +++++++
>  2 files changed, 45 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2d16c8545c71..9cc52f632bb1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1006,8 +1006,10 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
>   */
>  static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       if ((s64)(se->vruntime - se->deadline) < 0)
> -               return false;
> +
> +       u64 vslice;
> +       u64 tolerance = 0;
> +       u64 next_deadline;
>
>         /*
>          * For EEVDF the virtual time slope is determined by w_i (iow.
> @@ -1016,11 +1018,43 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
>          */
>         if (!se->custom_slice)
>                 se->slice = sysctl_sched_base_slice;
> +       vslice = calc_delta_fair(se->slice, se);
> +
> +       /*
> +        * vd_i = ve_i + r_i / w_i
> +        */
> +       next_deadline = se->vruntime + vslice;
> +
> +       if (sched_feat(FORWARD_DEADLINE))
> +               tolerance = min(vslice>>7, TICK_NSEC/2);
> +
> +       if ((s64)(se->vruntime + tolerance - se->deadline) < 0)
> +               return false;
>
>         /*
> -        * EEVDF: vd_i = ve_i + r_i / w_i
> +        * when se->vruntime + tolerance - se->deadline >= 0
> +        * but se->vruntime - se->deadline < 0,
> +        * there is two case: if entity is eligible?
> +        * if entity is not eligible, we don't need wait deadline, because
> +        * eevdf don't guarantee
> +        * an ineligible entity can exec its request time in one go.

Yes but it also didn't say that you can move forward its deadline
before it has consumed its slice request. A task is only ensured to
get a discrete time quanta but can be preempted after each quanta even
if its slice has not elapsed.

What you want, it's to trigger a need_resched after a minimum time
quanta has elapsed but not to update the deadline before the slice has
elapsed.

Now the question is what is the minimum time quanta for us. Should it
be a tick whatever it's real duration for the task ? Should it be
longer ?


> +        * but when entity eligible, just let it run, which is the
> +        * same processing logic as before.
>          */
> -       se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
> +
> +       if (sched_feat(FORWARD_DEADLINE) && (s64)(se->vruntime - se->deadline) < 0) {
> +
> +               if (entity_eligible(cfs_rq, se))
> +                       return false;
> +
> +               /*
> +                * vd_i = vd_i + r_i / w_i
> +                */
> +               next_deadline = se->deadline + vslice;
> +       }
> +
> +
> +       se->deadline = next_deadline;
>
>         /*
>          * The task has consumed its request, reschedule.
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 290874079f60..5c74deec7209 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -24,6 +24,13 @@ SCHED_FEAT(RUN_TO_PARITY, true)
>   */
>  SCHED_FEAT(PREEMPT_SHORT, true)
>
> +/*
> + * For some cases where the tick is faster than expected,
> + * move the deadline forward
> + */
> +SCHED_FEAT(FORWARD_DEADLINE, true)
> +
> +
>  /*
>   * Prefer to schedule the task we woke last (assuming it failed
>   * wakeup-preemption), since its likely going to consume data we
> --
> 2.39.3 (Apple Git-146)
>