[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f82def74-a64a-4a05-c8d4-4eeb3e03d0c0@amd.com>
Date: Tue, 8 Oct 2024 21:54:52 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Johannes Weiner <hannes@...xchg.org>, Peter Zijlstra
<peterz@...radead.org>
CC: Klaus Kudielka <klaus.kudielka@...il.com>, Chris Bainbridge
<chris.bainbridge@...il.com>, <linux-kernel@...r.kernel.org>,
<bsegall@...gle.com>, <dietmar.eggemann@....com>, <efault@....de>,
<juri.lelli@...hat.com>, <mgorman@...e.de>, <mingo@...hat.com>,
<rostedt@...dmis.org>, <tglx@...utronix.de>, <vincent.guittot@...aro.org>,
<vschneid@...hat.com>, <wuyun.abel@...edance.com>,
<youssefesmat@...omium.org>, <spasswolf@....de>,
<regressions@...ts.linux.dev>, "Linux regression tracking (Thorsten
Leemhuis)" <regressions@...mhuis.info>, "Gautham R. Shenoy"
<gautham.shenoy@....com>
Subject: Re: [REGRESSION] Re: [PATCH 17/24] sched/fair: Implement delayed
dequeue
Hello folks,
On 10/8/2024 9:08 PM, K Prateek Nayak wrote:
> Hello Johannes, Peter,
>
> On 10/4/2024 10:13 PM, K Prateek Nayak wrote:
>> [..snip..]
>>>> Anyway, assuming PSI wants to preserve current semantics, does something
>>>> like the below work?
>>>
>>> This doesn't. But it's a different corruption now:
>>>
>>> [ 2.298408] psi: inconsistent task state! task=24:cpuhp/1 cpu=1 psi_flags=10 clear=14 set=0
>>
>> I hit the same log (clear 14, set 0) and I tried the below changes on
>> top of Peter's diff:
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 0d766fb9fbc4..9cf3d4359994 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2012,9 +2012,10 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>> if (!(flags & ENQUEUE_NOCLOCK))
>> update_rq_clock(rq);
>>
>> - if (!(flags & ENQUEUE_RESTORE) && !p->se.sched_delayed) {
>> + if (!(flags & ENQUEUE_RESTORE) && (!p->se.sched_delayed || (flags & ENQUEUE_DELAYED))) {
>> sched_info_enqueue(rq, p);
>> - psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
>> + psi_enqueue(p, ((flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED)) ||
>> + (flags & ENQUEUE_DELAYED));
>> }
>>
>> p->sched_class->enqueue_task(rq, p, flags);
>> --
>>
>> ... but it just changes the warning to:
>>
>> psi: task underflow! cpu=65 t=0 tasks=[0 0 0 0] clear=1 set=4
>> psi: task underflow! cpu=31 t=0 tasks=[0 0 1 0] clear=1 set=0
>>
>> Doing a dump_stack(), I see it come from psi_enqueue() and
>> psi_ttwu_dequeue() and I see "clear=1" as the common theme. I've
>> stared at it for a while but I'm at a loss currently. If something
>> jumps out, I'll update here.
>
> I could narrow down the crux of the matter to the fact that when a task
> is delayed, and the delayed task is then migrated, the wakeup context
> may not have any idea that the task was moved from its previous
> runqueue. This is the same reason psi_enqueue() considers only ...
>
> (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED)
>
> ... as a wakeup. In case of a wakeup with migration, PSI forgoes
> clearing the TSK_IOWAIT flag which seems to be the issue I encountered
> in my splat previously.
>
> With that said, the below diff, based on Peter's original approach
> currently seems to work for me in the sense that I have not seen the
> inconsistent state warning for a while now with my stress test.
>
> Two key points of the approach are:
>
> o It uses "p->migration_flags" to indicate a delayed entity has
> migrated to another runqueue and convey the same during psi_enqueue().
>
> o It adds ENQUEUE_WAKEUP flag alongside ENQUEUE_DELAYED for
> enqueue_task() in ttwu_runnable() since psi_enqueue() needs to know of
> a wakeup without migration to clear the TSK_IOWAIT flag it would have
> set during psi_task_switch() for blocking task and going down the
> stack for enqueue_task_fair(), there seem to be no other observer of
> the ENQUEUE_WAKEUP flag other than psi_enqueue() in the requeue path.
>
> If there are no obvious objections, I'll send a clean patch soon.
> In the meantime, here is the diff:
> [..snip..]
That last diff was malformed! Sorry for the noise. Here is the corrected
diff:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 43e453ab7e20..885801432e9a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2009,12 +2009,19 @@ unsigned long get_wchan(struct task_struct *p)
void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
+ bool wakee_not_migrated = (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED);
+
if (!(flags & ENQUEUE_NOCLOCK))
update_rq_clock(rq);
if (!(flags & ENQUEUE_RESTORE)) {
sched_info_enqueue(rq, p);
- psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
+
+ /* Notify PSI that the task was migrated in a delayed state before wakeup. */
+ if ((p->migration_flags & DELAYED_MIGRATED) && !task_on_rq_migrating(p)) {
+ wakee_not_migrated = false;
+ p->migration_flags &= ~DELAYED_MIGRATED;
+ }
}
p->sched_class->enqueue_task(rq, p, flags);
@@ -2023,6 +2030,8 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
* ->sched_delayed.
*/
uclamp_rq_inc(rq, p);
+ if (!(flags & ENQUEUE_RESTORE))
+ psi_enqueue(p, wakee_not_migrated);
if (sched_core_enabled(rq))
sched_core_enqueue(rq, p);
@@ -2042,6 +2051,9 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & DEQUEUE_SAVE)) {
sched_info_dequeue(rq, p);
psi_dequeue(p, flags & DEQUEUE_SLEEP);
+
+ if (p->se.sched_delayed && task_on_rq_migrating(p))
+ p->migration_flags |= DELAYED_MIGRATED;
}
/*
@@ -3733,7 +3745,7 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
if (task_on_rq_queued(p)) {
update_rq_clock(rq);
if (p->se.sched_delayed)
- enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+ enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_WAKEUP | ENQUEUE_DELAYED);
if (!task_on_cpu(rq, p)) {
/*
* When on_rq && !on_cpu the task is preempted, see if
@@ -6537,6 +6549,7 @@ static void __sched notrace __schedule(int sched_mode)
* as a preemption by schedule_debug() and RCU.
*/
bool preempt = sched_mode > SM_NONE;
+ bool block = false;
unsigned long *switch_count;
unsigned long prev_state;
struct rq_flags rf;
@@ -6622,6 +6635,7 @@ static void __sched notrace __schedule(int sched_mode)
* After this, schedule() must not care about p->state any more.
*/
block_task(rq, prev, flags);
+ block = true;
}
switch_count = &prev->nvcsw;
}
@@ -6667,7 +6681,7 @@ static void __sched notrace __schedule(int sched_mode)
migrate_disable_switch(rq, prev);
psi_account_irqtime(rq, prev, next);
- psi_sched_switch(prev, next, !task_on_rq_queued(prev));
+ psi_sched_switch(prev, next, block);
trace_sched_switch(preempt, prev, next, prev_state);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b1c3588a8f00..2dc2c4cb4f5f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1326,6 +1326,7 @@ static inline int cpu_of(struct rq *rq)
}
#define MDF_PUSH 0x01
+#define DELAYED_MIGRATED 0x02 /* Task was migrated when in DELAYED_DEQUEUE state */
static inline bool is_migration_disabled(struct task_struct *p)
{
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 237780aa3c53..06a2c6d3ec1e 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -129,6 +129,13 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup)
if (static_branch_likely(&psi_disabled))
return;
+ /*
+ * Delayed task is not ready to run yet!
+ * Wait for a requeue before accounting.
+ */
+ if (p->se.sched_delayed)
+ return;
+
if (p->in_memstall)
set |= TSK_MEMSTALL_RUNNING;
@@ -148,6 +155,9 @@ static inline void psi_dequeue(struct task_struct *p, bool sleep)
if (static_branch_likely(&psi_disabled))
return;
+ /* Delayed task can only be dequeued for migration. */
+ WARN_ON_ONCE(p->se.sched_delayed && sleep);
+
/*
* A voluntary sleep is a dequeue followed by a task switch. To
* avoid walking all ancestors twice, psi_task_switch() handles
--
Attaching an RFC patch in case I messed this up as well!
> --
>
> Any and all suggestions are highly appreciated.
>
>>
>> Thank you again both for taking a look.
>>
>
--
Thanks and Regards,
Prateek
View attachment "0001-sched-psi-Fixup-PSI-accounting-with-DELAY_DEQUEUE.patch" of type "text/plain" (5869 bytes)
Powered by blists - more mailing lists