linux-kernel - Re: [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250929103836.GK3419281@noisy.programming.kicks-ass.net>
Date: Mon, 29 Sep 2025 12:38:36 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: John Stultz <jstultz@...gle.com>, Matt Fleming <matt@...dmodwrite.com>,
	Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>,
	linux-kernel@...r.kernel.org, kernel-team@...udflare.com,
	Matt Fleming <mfleming@...udflare.com>,
	Oleg Nesterov <oleg@...hat.com>,
	Chris Arges <carges@...udflare.com>, stable@...r.kernel.org
Subject: Re: [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force
 dequeue sched_delayed tasks"

On Fri, Sep 26, 2025 at 08:13:09AM +0530, K Prateek Nayak wrote:

> > Peter: Those cfs_rq_throttled() exits in dequeue_entities() seem a
> > little odd, as the desired dequeue didn't really complete, but
> > dequeue_task_fair() will still return true indicating success - not
> > that too many places are checking the dequeue_task return. Is that
> > right?

Bah, i'm forever confused on the throttle cases there :/

> I think for most part until now it was harmless as we couldn't pick on
> a throttled hierarchy and other calls to dequeue_task(DEQUEUE_DELAYED)
> would later do a:
> 
>     queued = task_on_rq_queued(p);
>     ...
>     if (queued)
>         enqueue_task(p)
> 
> which would either lead to spuriously running a blocked task and it
> would block back again, or a wakeup would properly wakeup the queued
> task via ttwu_runnable() but wait_task_inactive() is interesting as
> it expects the dequeue will result in a block which never happens with
> throttled hierarchies. I'm impressed double dequeue doesn't result in
> any major splats!
> 
> Matt, if possible can you try the patch attached below to check if the
> bailout for throttled hierarchy is indeed the root cause. Thanks in
> advance.
> 
> P.S. the per-task throttle in tip:sched/core would get rid of all this
> but it would be good to have a fix via tip:sched/urgent to get it
> backported to v6.12 LTS and the newer stable kernels.

Yes, good riddance to that code :-)

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8ce56a8d507f..f0a4d9d7424d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6969,6 +6969,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  	int h_nr_runnable = 0;
>  	struct cfs_rq *cfs_rq;
>  	u64 slice = 0;
> +	int ret = 0; /* XXX: Do we care if ret is 0 vs 1 since we only check ret < 0? */

Right, we don't appear to really need that.

>  
>  	if (entity_is_task(se)) {
>  		p = task_of(se);
> @@ -6998,7 +6999,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  
>  		/* end evaluation on encountering a throttled cfs_rq */
>  		if (cfs_rq_throttled(cfs_rq))
> -			return 0;
> +			goto out;
>  
>  		/* Don't dequeue parent if it has other entities besides us */
>  		if (cfs_rq->load.weight) {
> @@ -7039,7 +7040,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  
>  		/* end evaluation on encountering a throttled cfs_rq */
>  		if (cfs_rq_throttled(cfs_rq))
> -			return 0;
> +			goto out;
>  	}
>  
>  	sub_nr_running(rq, h_nr_queued);
> @@ -7048,6 +7049,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
>  		rq->next_balance = jiffies;
>  
> +	ret = 1;
> +out:
>  	if (p && task_delayed) {
>  		WARN_ON_ONCE(!task_sleep);
>  		WARN_ON_ONCE(p->on_rq != 1);
> @@ -7063,7 +7066,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  		__block_task(rq, p);
>  	}
>  
> -	return 1;
> +	return ret;
>  }

So the difference is that we also do __block_task() when we get
throttled somewhere in the hierarchy. IIRC when I was looking at this, I
thought it wouldn't matter since it won't get picked anyway, on account
of the cfs_rq being blocked/detached, so who cares.

But yeah, this makes sense.

Patch logistics are going to be a pain -- .17 is closed and merge window
is open, which means Linus will have per-task throttle and /urgent don't
work no more.

At this point best we can do is a patch to stable with a note that
upstream is no longer affected due to rework or something.