[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtBRf7-Kt4xxLsNHbjM=n_txHWePNKssV3Sxepqh6oM4WA@mail.gmail.com>
Date: Fri, 17 Jan 2025 14:25:58 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>, linux-kernel@...r.kernel.org,
Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>, "Gautham R. Shenoy" <gautham.shenoy@....com>,
Swapnil Sapkal <swapnil.sapkal@....com>
Subject: Re: [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with
delayed dequeue
Hi Prateek,
On Fri, 17 Jan 2025 at 11:59, K Prateek Nayak <kprateek.nayak@....com> wrote:
>
> set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
> entity is delayed irrespective of whether the entity corresponds to a
> task or a cfs_rq.
>
> Consider the following scenario:
>
> root
> / \
> A B (*) delayed since B is no longer eligible on root
> | |
> Task0 Task1 <--- dequeue_task_fair() - task blocks
>
> When Task1 blocks (dequeue_entity() for task's se returns true),
> dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
> hierarchy of Task1. However, when the sched_entity corresponding to
> cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
> hierarchy too leading to both dequeue_entity() and set_delayed()
> decrementing h_nr_runnable for the dequeue of the same task.
>
> A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
> dequeue_entities() like below:
>
> cfs_rq->h_nr_runnable -= h_nr_runnable;
> SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
>
> is consistently tripped when running wakeup intensive workloads like
> hackbench in a cgroup.
>
> This error is self correcting since cfs_rq are per-cpu and cannot
> migrate. The entitiy is either picked for full dequeue or is requeued
> when a task wakes up below it. Both those paths call clear_delayed()
> which again increments h_nr_runnable of the hierarchy without
> considering if the entity corresponds to a task or not.
>
> h_nr_runnable will eventually reflect the correct value however in the
> interim, the incorrect values can still influence PELT calculation which
> uses se->runnable_weight or cfs_rq->h_nr_runnable.
>
> Since only delayed tasks take the early return path in
> dequeue_entities() and enqueue_task_fair(), adjust the
> h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
> this path skips the h_nr_* update loops and returns early.
>
> For entities corresponding to cfs_rq, the h_nr_* update loop in the
> caller will do the right thing.
>
> Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
You probably mean c2a295bffeaf ("sched/fair: Add new cfs_rq.h_nr_runnable")
Before we were tracking the opposite h_nr_delayed. Did you see the
problem only on tip/sched/core or also before the rework which added
h_nr_runnable and removed h_nr_delayed
I'm going to have a closer look
> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@....com>
> Tested-by: Swapnil Sapkal <swapnil.sapkal@....com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
> ---
> kernel/sched/fair.c | 19 +++++++++++++++++++
> 1 file changed, 19 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 98ac49ce78ea..0fe6c6e65e55 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5481,6 +5481,15 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
> static void set_delayed(struct sched_entity *se)
> {
> se->sched_delayed = 1;
> +
> + /*
> + * Delayed se of cfs_rq have no tasks queued on them.
> + * Do not adjust h_nr_runnable since dequeue_entities()
> + * will account it for blocked tasks.
> + */
> + if (!entity_is_task(se))
> + return;
> +
> for_each_sched_entity(se) {
> struct cfs_rq *cfs_rq = cfs_rq_of(se);
>
> @@ -5493,6 +5502,16 @@ static void set_delayed(struct sched_entity *se)
> static void clear_delayed(struct sched_entity *se)
> {
> se->sched_delayed = 0;
> +
> + /*
> + * Delayed se of cfs_rq have no tasks queued on them.
> + * Do not adjust h_nr_runnable since a dequeue has
> + * already accounted for it or an enqueue of a task
> + * below it will account for it in enqueue_task_fair().
> + */
> + if (!entity_is_task(se))
> + return;
> +
> for_each_sched_entity(se) {
> struct cfs_rq *cfs_rq = cfs_rq_of(se);
>
>
> base-commit: 7d9da040575b343085287686fa902a5b2d43c7ca
> --
> 2.34.1
>
Powered by blists - more mailing lists