linux-kernel - Re: [RFC PATCH 1/1] sched/eevdf: Skip eligibility check for current entity during wakeup preemption

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+q576OCK24VSp+s4OLD2ogO48i95y39_JO=zV=TwHSEg3_b1w@mail.gmail.com>
Date: Mon, 25 Mar 2024 10:13:17 -0500
From: Youssef Esmat <youssefesmat@...omium.org>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Daniel Bristot de Oliveira <bristot@...hat.com>, Valentin Schneider <vschneid@...hat.com>, linux-kernel@...r.kernel.org, 
	Tobias Huschle <huschle@...ux.ibm.com>, Luis Machado <luis.machado@....com>, 
	Chen Yu <yu.c.chen@...el.com>, Abel Wu <wuyun.abel@...edance.com>, 
	Tianchen Ding <dtcccc@...ux.alibaba.com>, Xuewen Yan <xuewen.yan94@...il.com>, 
	"Gautham R. Shenoy" <gautham.shenoy@....com>
Subject: Re: [RFC PATCH 1/1] sched/eevdf: Skip eligibility check for current
 entity during wakeup preemption

On Mon, Mar 25, 2024 at 1:03 AM K Prateek Nayak <kprateek.nayak@...com> wrote:
>
> With the curr entity's eligibility check, a wakeup preemption is very
> likely when an entity with positive lag joins the runqueue pushing the
> avg_vruntime of the runqueue backwards, making the vruntime of the
> current entity ineligible. This leads to aggressive wakeup preemption
> which was previously guarded by wakeup_granularity_ns in legacy CFS.
> Below figure depicts one such aggressive preemption scenario with EEVDF
> in DeathStarBench [1]:
>
>                                 deadline for Nginx
>                    |
>         +-------+  |                    |
>     /-- | Nginx | -|------------------> |
>     |   +-------+  |                    |
>     |              |
>     |   -----------|-------------------------------> vruntime timeline
>     |              \--> rq->avg_vruntime
>     |
>     |   wakes service on the same runqueue since system is busy
>     |
>     |   +---------+|
>     \-->| Service || (service has +ve lag pushes avg_vruntime backwards)
>         +---------+|
>           |        |
>    wakeup |     +--|-----+                       |
>  preempts \---->| N|ginx | --------------------> | {deadline for Nginx}
>                 +--|-----+                       |
>            (Nginx ineligible)
>         -----------|-------------------------------> vruntime timeline
>                    \--> rq->avg_vruntime
>
> When NGINX server is involuntarily switched out, it cannot accept any
> incoming request, leading to longer turn around time for the clients and
> thus loss in DeathStarBench throughput.
>
>     ==================================================================
>     Test          : DeathStarBench
>     Units         : Normalized latency
>     Interpretation: Lower is better
>     Statistic     : Mean
>     ==================================================================
>     tip         1.00
>     eevdf       1.14 (+14.61%)
>
> For current running task, skip eligibility check in pick_eevdf() if it
> has not exhausted the slice promised to it during selection despite the
> situation having changed since. The behavior is guarded by
> RUN_TO_PARITY_WAKEUP sched_feat to simplify testing. With
> RUN_TO_PARITY_WAKEUP enabled, performance loss seen with DeathStarBench
> since the merge of EEVDF disappears. Following are the results from
> testing on a Dual Socket 3rd Generation EPYC server (2 x 64C/128T):
>
>     ==================================================================
>     Test          : DeathStarBench
>     Units         : Normalized throughput
>     Interpretation: Higher is better
>     Statistic     : Mean
>     ==================================================================
>     Pinning      scaling     tip            run-to-parity-wakeup(pct imp)
>      1CCD           1       1.00            1.16 (%diff: 16%)
>      2CCD           2       1.00            1.03 (%diff: 3%)
>      4CCD           4       1.00            1.12 (%diff: 12%)
>      8CCD           8       1.00            1.05 (%diff: 6%)
>
> With spec_rstack_overflow=off, the DeathStarBench performance with the
> proposed solution is same as the performance on v6.5 release before
> EEVDF was merged.

Thanks for sharing this Prateek.
We actually noticed we could also gain performance by disabling
eligibility checks (but disable it on all paths).
The following are a few threads we had on the topic:

Discussion around eligibility:
https://lore.kernel.org/lkml/CA+q576MS0-MV1Oy-eecvmYpvNT3tqxD8syzrpxQ-Zk310hvRbw@mail.gmail.com/
Some of our results:
https://lore.kernel.org/lkml/CA+q576Mov1jpdfZhPBoy_hiVh3xSWuJjXdP3nS4zfpqfOXtq7Q@mail.gmail.com/
Sched feature to disable eligibility:
https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefesmat@chromiumorg/

>
> This may lead to newly waking task waiting longer for its turn on the
> CPU, however, testing on the same system did not reveal any consistent
> regressions with the standard benchmarks.
>
> Link: https://github.com/delimitrou/DeathStarBench/ [1]
> Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
> ---
>  kernel/sched/fair.c     | 24 ++++++++++++++++++++----
>  kernel/sched/features.h |  1 +
>  2 files changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6a16129f9a5c..a9b145a4eab0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -875,7 +875,7 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
>   *
>   * Which allows tree pruning through eligibility.
>   */
> -static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq, bool wakeup_preempt)
>  {
>         struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
>         struct sched_entity *se = __pick_first_entity(cfs_rq);
> @@ -889,7 +889,23 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>         if (cfs_rq->nr_running == 1)
>                 return curr && curr->on_rq ? curr : se;
>
> -       if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> +       if (curr && !curr->on_rq)
> +               curr = NULL;
> +
> +       /*
> +        * When an entity with positive lag wakes up, it pushes the
> +        * avg_vruntime of the runqueue backwards. This may causes the
> +        * current entity to be ineligible soon into its run leading to
> +        * wakeup preemption.
> +        *
> +        * To prevent such aggressive preemption of the current running
> +        * entity during task wakeups, skip the eligibility check if the
> +        * slice promised to the entity since its selection has not yet
> +        * elapsed.
> +        */
> +       if (curr &&
> +           !(sched_feat(RUN_TO_PARITY_WAKEUP) && wakeup_preempt && curr->vlag == curr->deadline) &&
> +           !entity_eligible(cfs_rq, curr))
>                 curr = NULL;
>
>         /*
> @@ -5460,7 +5476,7 @@ pick_next_entity(struct cfs_rq *cfs_rq)
>             cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
>                 return cfs_rq->next;
>
> -       return pick_eevdf(cfs_rq);
> +       return pick_eevdf(cfs_rq, false);
>  }
>
>  static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
> @@ -8340,7 +8356,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>         /*
>          * XXX pick_eevdf(cfs_rq) != se ?
>          */
> -       if (pick_eevdf(cfs_rq) == pse)
> +       if (pick_eevdf(cfs_rq, true) == pse)
>                 goto preempt;
>
>         return;
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 143f55df890b..027bab5b4031 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -7,6 +7,7 @@
>  SCHED_FEAT(PLACE_LAG, true)
>  SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
>  SCHED_FEAT(RUN_TO_PARITY, true)
> +SCHED_FEAT(RUN_TO_PARITY_WAKEUP, true)
>
>  /*
>   * Prefer to schedule the task we woke last (assuming it failed
> --
> 2.34.1
>