[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240923154601.GC437832@cmpxchg.org>
Date: Mon, 23 Sep 2024 11:46:01 -0400
From: Johannes Weiner <hannes@...xchg.org>
To: Parag W <parag.lkml@...il.com>
Cc: anna-maria@...utronix.de, frederic@...nel.org,
linux-kernel@...r.kernel.org, peterz@...radead.org,
pmenzel@...gen.mpg.de, regressions@...ts.linux.dev,
surenb@...gle.com, tglx@...utronix.de
Subject: Re: Error: psi: inconsistent task state! task=1:swapper/0 cpu=0
psi_flags=4 clear=0 set=4
On Mon, Sep 23, 2024 at 08:03:39AM -0400, Parag W wrote:
> FWIW, moving psi_enqueue to be after ->enqueue_task() in
> sched/core.c made no difference - I still get the inconsistent task
> state error. psi_dequeue() is already before ->dequeue_task() in
> line with uclamp.
Yes, that isn't enough.
AFAICS, in psi want to know when a task gets dequeued from a core POV,
even if the class holds on to it until picked again. If it's later
picked and dequeued by the class, I don't think there is a possible
call into psi. Lastly, if a sched_delayed task is woken and enqueued
from core, psi wants to know - we should call psi_enqueue() after
->enqueue_task has cleared sched_delayed.
I don't think we want the ttwu_runnable() callback: since the task
hasn't been dequeued yet from a core & PSI perspective, we shouldn't
update psi states either. The sched_delayed check in psi_enqueue()
should accomplish that. Oh, but wait: ->enqueue_task() will clear
sched_delayed beforehand. We should probably filter ENQUEUE_DELAYED?
This leaves me with the below diff. But I'm still getting the double
enqueue with it applied:
[root@ham ~]# dmesg | grep -i psi
[ 0.350533] psi: inconsistent task state! task=1:swapper/0 cpu=0 psi_flags=4 clear=0 set=4
Peter, what am I missing here?
---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b6cc1cf499d6..4f036c66cf07 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2012,11 +2012,6 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & ENQUEUE_NOCLOCK))
update_rq_clock(rq);
- if (!(flags & ENQUEUE_RESTORE)) {
- sched_info_enqueue(rq, p);
- psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
- }
-
p->sched_class->enqueue_task(rq, p, flags);
/*
* Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
@@ -2024,6 +2019,11 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
*/
uclamp_rq_inc(rq, p);
+ if (!(flags & (ENQUEUE_RESTORE|ENQUEUE_DELAYED))) {
+ sched_info_enqueue(rq, p);
+ psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
+ }
+
if (sched_core_enabled(rq))
sched_core_enqueue(rq, p);
}
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 237780aa3c53..138c52c2f2c9 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -129,6 +129,9 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup)
if (static_branch_likely(&psi_disabled))
return;
+ if (p->se.sched_delayed)
+ return;
+
if (p->in_memstall)
set |= TSK_MEMSTALL_RUNNING;
@@ -148,6 +151,9 @@ static inline void psi_dequeue(struct task_struct *p, bool sleep)
if (static_branch_likely(&psi_disabled))
return;
+ if (p->se.sched_delayed)
+ return;
+
/*
* A voluntary sleep is a dequeue followed by a task switch. To
* avoid walking all ancestors twice, psi_task_switch() handles
Powered by blists - more mailing lists