[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250929103836.GK3419281@noisy.programming.kicks-ass.net>
Date: Mon, 29 Sep 2025 12:38:36 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: John Stultz <jstultz@...gle.com>, Matt Fleming <matt@...dmodwrite.com>,
Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>,
linux-kernel@...r.kernel.org, kernel-team@...udflare.com,
Matt Fleming <mfleming@...udflare.com>,
Oleg Nesterov <oleg@...hat.com>,
Chris Arges <carges@...udflare.com>, stable@...r.kernel.org
Subject: Re: [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force
dequeue sched_delayed tasks"
On Fri, Sep 26, 2025 at 08:13:09AM +0530, K Prateek Nayak wrote:
> > Peter: Those cfs_rq_throttled() exits in dequeue_entities() seem a
> > little odd, as the desired dequeue didn't really complete, but
> > dequeue_task_fair() will still return true indicating success - not
> > that too many places are checking the dequeue_task return. Is that
> > right?
Bah, i'm forever confused on the throttle cases there :/
> I think for most part until now it was harmless as we couldn't pick on
> a throttled hierarchy and other calls to dequeue_task(DEQUEUE_DELAYED)
> would later do a:
>
> queued = task_on_rq_queued(p);
> ...
> if (queued)
> enqueue_task(p)
>
> which would either lead to spuriously running a blocked task and it
> would block back again, or a wakeup would properly wakeup the queued
> task via ttwu_runnable() but wait_task_inactive() is interesting as
> it expects the dequeue will result in a block which never happens with
> throttled hierarchies. I'm impressed double dequeue doesn't result in
> any major splats!
>
> Matt, if possible can you try the patch attached below to check if the
> bailout for throttled hierarchy is indeed the root cause. Thanks in
> advance.
>
> P.S. the per-task throttle in tip:sched/core would get rid of all this
> but it would be good to have a fix via tip:sched/urgent to get it
> backported to v6.12 LTS and the newer stable kernels.
Yes, good riddance to that code :-)
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8ce56a8d507f..f0a4d9d7424d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6969,6 +6969,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> int h_nr_runnable = 0;
> struct cfs_rq *cfs_rq;
> u64 slice = 0;
> + int ret = 0; /* XXX: Do we care if ret is 0 vs 1 since we only check ret < 0? */
Right, we don't appear to really need that.
>
> if (entity_is_task(se)) {
> p = task_of(se);
> @@ -6998,7 +6999,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>
> /* end evaluation on encountering a throttled cfs_rq */
> if (cfs_rq_throttled(cfs_rq))
> - return 0;
> + goto out;
>
> /* Don't dequeue parent if it has other entities besides us */
> if (cfs_rq->load.weight) {
> @@ -7039,7 +7040,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>
> /* end evaluation on encountering a throttled cfs_rq */
> if (cfs_rq_throttled(cfs_rq))
> - return 0;
> + goto out;
> }
>
> sub_nr_running(rq, h_nr_queued);
> @@ -7048,6 +7049,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
> rq->next_balance = jiffies;
>
> + ret = 1;
> +out:
> if (p && task_delayed) {
> WARN_ON_ONCE(!task_sleep);
> WARN_ON_ONCE(p->on_rq != 1);
> @@ -7063,7 +7066,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> __block_task(rq, p);
> }
>
> - return 1;
> + return ret;
> }
So the difference is that we also do __block_task() when we get
throttled somewhere in the hierarchy. IIRC when I was looking at this, I
thought it wouldn't matter since it won't get picked anyway, on account
of the cfs_rq being blocked/detached, so who cares.
But yeah, this makes sense.
Patch logistics are going to be a pain -- .17 is closed and merge window
is open, which means Linus will have per-task throttle and /urgent don't
work no more.
At this point best we can do is a patch to stable with a note that
upstream is no longer affected due to rework or something.
Powered by blists - more mailing lists