linux-kernel - Re: [PATCH] sched: Forward deadline for early tick

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtDcQH==23tWEXNz=dFujBvBC88NESp4KcwEPf83_s6PjA@mail.gmail.com>
Date: Wed, 11 Dec 2024 14:14:37 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: zhouzihan30 <15645113830zzh@...il.com>
Cc: peterz@...radead.org, bsegall@...gle.com, dietmar.eggemann@....com, 
	juri.lelli@...hat.com, linux-kernel@...r.kernel.org, mgorman@...e.de, 
	mingo@...hat.com, rostedt@...dmis.org, vschneid@...hat.com, yaozhenguo@...com, 
	zhouzihan30@...com
Subject: Re: [PATCH] sched: Forward deadline for early tick

On Wed, 27 Nov 2024 at 08:06, zhouzihan30 <15645113830zzh@...il.com> wrote:
>
> From: zhouzihan <zhouzihan30@...com>
>
> Thank you very much for your reply, which has brought me lots
>  of thoughts.
>
> I have reconsidered this issue and believe that the root cause is that
>  the kernel is difficult and unnecessary to implement an ideal eevdf
>  due to real hardware:
> for example,
> an ideal eevdf may bring frequent switching, its cost makes kernel can't
>  really do it.
>
> So I see that the kernel has a very concise and clever implementation:
>  update_deadline, which allows each task to use up the request size
>  as much as possible in one go.
>
> Here, the kernel has actually slightly violated eevdf: we are no longer
>  concerned about whether a task is eligible for the time being.
>
> In the prev patch, it was mentioned that due to tick errors, some tasks
>  run longer than requested. So if we can do this: when a task vruntime

Could you give more details about the tick error ?

Could it be that you have irq accounting enabled  ? In this case the
delta_exec will always be lower than tick because of the time spent in
the irq context. As an example, the delta of rq_clock_task is always
less than 1ms on my system but the delta of rq_clock is sometimes
above and sometime below 1ms

This means that the task didn't effectively get its slice because of
time spent in IRQ context. Would it be better to set a default slice
slightly lower than an integer number of tick



>  approaches the deadline, we check if the task is eligible.
> If it is eligible, we follow the previous logic and do not schedule it.
> However, if it is ineligible, we schedule it because eevdf has the
>  responsibility to not exec ineligible task.
>
> In other words, the kernel has given the task a "benefit" based on the
>  actual situation, and we still have the right to revoke this benefit.
>
> In this way, it actually brings the kernel closer to an ideal eevdf,
> and at the same time, your reply made me realize my mistake:
> The deadline update should be updated using the following function,
> which is more reasonable:
>     vd_i += r_i / w_i
> By using it, our scheduler is still fair,
>  and each task can obtain its own time according to its weight.
>
> About tolerance, I think min(vslice/128, tick/2) is better,
> as your reply, vslice maybe too big, so use it.
>
> However, there is still a new issue here:
> If a se 1 terminates its deadline prematurely due to ineligible,
> and then a se 2 runs, after the end of the run,
> se 1 becomes eligible, but its deadline has already been pushed back,
> which is not earliest eligible,
> so se 1 can't run. This seems to not comply with eevdf specifications.
>
> But I think this is acceptable. In the past, the logic causes a task to
>  run an extra tick (ms), which means other tasks have to wait for an
>  extra tick. Now, a task maybe run less time (us), but it will be
>  compensated for in the next scheduling. In terms of overall accuracy,
>  I think it has improved.
>
> By the way, we may try to solve this by delaying the deadline update,
> which means we schedule a task but not update its deadline,
> util next exec it.
> I am not sure if it is necessary to implement such complex logic here.
> I think it is actually unnecessary,
> because the ideal eevdf may not be suitable for the kernel.
> It is no need to spend so much to solve the error of few time.
> If there is a truly completely accurate system, it should not have
>  tick error, and just closes the FORWARD_DEADLINE feature.
> Of course, if you think it is necessary, I am willing try to implement it.
>
> Signed-off-by: zhouzihan30 <zhouzihan30@...com>
> ---
>  kernel/sched/fair.c     | 41 +++++++++++++++++++++++++++++++++++++----
>  kernel/sched/features.h |  7 +++++++
>  2 files changed, 44 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2d16c8545c71..e6e58c7d6d4c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1006,8 +1006,10 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
>   */
>  static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       if ((s64)(se->vruntime - se->deadline) < 0)
> -               return false;
> +
> +       u64 vslice;
> +       u64 tolerance = 0;
> +       u64 next_deadline;
>
>         /*
>          * For EEVDF the virtual time slope is determined by w_i (iow.
> @@ -1016,11 +1018,42 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
>          */
>         if (!se->custom_slice)
>                 se->slice = sysctl_sched_base_slice;
> +       vslice = calc_delta_fair(se->slice, se);
> +
> +       next_deadline = se->vruntime + vslice;
> +
> +       if (sched_feat(FORWARD_DEADLINE))
> +               tolerance = min(vslice>>7, TICK_NSEC/2);
> +
> +       if ((s64)(se->vruntime + tolerance - se->deadline) < 0)
> +               return false;
>
>         /*
> -        * EEVDF: vd_i = ve_i + r_i / w_i
> +        * when se->vruntime + tolerance - se->deadline >= 0
> +        * but se->vruntime - se->deadline < 0,
> +        * there is two case: if entity is eligible?
> +        * if entity is not eligible, we don't need wait deadline, because
> +        * eevdf don't guarantee
> +        * an ineligible entity can exec its request time in one go.
> +        * but when entity eligible, just let it run, which is the
> +        * same processing logic as before.
>          */
> -       se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
> +
> +       if (sched_feat(FORWARD_DEADLINE) && (s64)(se->vruntime - se->deadline) < 0) {
> +               if (entity_eligible(cfs_rq, se)) {
> +                       return false;
> +               } else {
> +                       /*
> +                        * in this case, entity's request size does not use light，
> +                        * but considering it is not eligible, we don't need exec it.
> +                        * and we let vd_i += r_i / w_i, make scheduler fairness.
> +                        */
> +                       next_deadline = se->deadline + vslice;
> +               }
> +       }
> +
> +
> +       se->deadline = next_deadline;
>
>         /*
>          * The task has consumed its request, reschedule.
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 290874079f60..5c74deec7209 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -24,6 +24,13 @@ SCHED_FEAT(RUN_TO_PARITY, true)
>   */
>  SCHED_FEAT(PREEMPT_SHORT, true)
>
> +/*
> + * For some cases where the tick is faster than expected,
> + * move the deadline forward
> + */
> +SCHED_FEAT(FORWARD_DEADLINE, true)
> +
> +
>  /*
>   * Prefer to schedule the task we woke last (assuming it failed
>   * wakeup-preemption), since its likely going to consume data we
> --
> 2.39.3 (Apple Git-146)
>