linux-kernel - Re: [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <105ae6f1-f629-4fe7-9644-4242c3bed035@amd.com>
Date: Fri, 26 Sep 2025 08:13:09 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: John Stultz <jstultz@...gle.com>, Matt Fleming <matt@...dmodwrite.com>
CC: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, "Mel
 Gorman" <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
	<linux-kernel@...r.kernel.org>, <kernel-team@...udflare.com>, Matt Fleming
	<mfleming@...udflare.com>, Oleg Nesterov <oleg@...hat.com>, Chris Arges
	<carges@...udflare.com>, <stable@...r.kernel.org>
Subject: Re: [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force
 dequeue sched_delayed tasks"

Hello John, Matt,

On 9/26/2025 5:35 AM, John Stultz wrote:
> On Thu, Sep 25, 2025 at 6:33 AM Matt Fleming <matt@...dmodwrite.com> wrote:
>>
>> From: Matt Fleming <mfleming@...udflare.com>
>>
>> This reverts commit b7ca5743a2604156d6083b88cefacef983f3a3a6.
>>
>> If we dequeue a task (task B) that was sched delayed then that task is
>> definitely no longer on the rq and not tracked in the rbtree.
>> Unfortunately, task_on_rq_queued(B) will still return true because
>> dequeue_task() doesn't update p->on_rq.
> 
> Hey!
>   Sorry again my patch has been causing you trouble. Thanks for your
> persistence in chasing this down!
> 
> It's confusing as this patch uses the similar logic as logic
> pick_next_entity() uses when a sched_delayed task is picked to run, as
> well as elsewhere in __sched_setscheduler() and in sched_ext, so I'd
> fret that similar
> 
> And my impression was that dequeue_task() on a sched_delayed task
> would update p->on_rq via calling __block_task() at the end of
> dequeue_entities().
> 
> However, there are two spots where we might exit dequeue_entities()
> early when cfs_rq_throttled(rq), so maybe that's what's catching us
> here?

That could very likely be it.

> 
> Peter: Those cfs_rq_throttled() exits in dequeue_entities() seem a
> little odd, as the desired dequeue didn't really complete, but
> dequeue_task_fair() will still return true indicating success - not
> that too many places are checking the dequeue_task return. Is that
> right?

I think for most part until now it was harmless as we couldn't pick on
a throttled hierarchy and other calls to dequeue_task(DEQUEUE_DELAYED)
would later do a:

    queued = task_on_rq_queued(p);
    ...
    if (queued)
        enqueue_task(p)

which would either lead to spuriously running a blocked task and it
would block back again, or a wakeup would properly wakeup the queued
task via ttwu_runnable() but wait_task_inactive() is interesting as
it expects the dequeue will result in a block which never happens with
throttled hierarchies. I'm impressed double dequeue doesn't result in
any major splats!

Matt, if possible can you try the patch attached below to check if the
bailout for throttled hierarchy is indeed the root cause. Thanks in
advance.

P.S. the per-task throttle in tip:sched/core would get rid of all this
but it would be good to have a fix via tip:sched/urgent to get it
backported to v6.12 LTS and the newer stable kernels.

---
Thanks and Regards,
Prateek

(Prepared on top of tip:sched/urgent but should apply fine on v6.17-rc7)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8ce56a8d507f..f0a4d9d7424d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6969,6 +6969,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	int h_nr_runnable = 0;
 	struct cfs_rq *cfs_rq;
 	u64 slice = 0;
+	int ret = 0; /* XXX: Do we care if ret is 0 vs 1 since we only check ret < 0? */
 
 	if (entity_is_task(se)) {
 		p = task_of(se);
@@ -6998,7 +6999,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			return 0;
+			goto out;
 
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
@@ -7039,7 +7040,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			return 0;
+			goto out;
 	}
 
 	sub_nr_running(rq, h_nr_queued);
@@ -7048,6 +7049,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
 		rq->next_balance = jiffies;
 
+	ret = 1;
+out:
 	if (p && task_delayed) {
 		WARN_ON_ONCE(!task_sleep);
 		WARN_ON_ONCE(p->on_rq != 1);
@@ -7063,7 +7066,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		__block_task(rq, p);
 	}
 
-	return 1;
+	return ret;
 }
 
 /*