linux-kernel - Re: [REGRESSION] Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f82def74-a64a-4a05-c8d4-4eeb3e03d0c0@amd.com>
Date: Tue, 8 Oct 2024 21:54:52 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Johannes Weiner <hannes@...xchg.org>, Peter Zijlstra
	<peterz@...radead.org>
CC: Klaus Kudielka <klaus.kudielka@...il.com>, Chris Bainbridge
	<chris.bainbridge@...il.com>, <linux-kernel@...r.kernel.org>,
	<bsegall@...gle.com>, <dietmar.eggemann@....com>, <efault@....de>,
	<juri.lelli@...hat.com>, <mgorman@...e.de>, <mingo@...hat.com>,
	<rostedt@...dmis.org>, <tglx@...utronix.de>, <vincent.guittot@...aro.org>,
	<vschneid@...hat.com>, <wuyun.abel@...edance.com>,
	<youssefesmat@...omium.org>, <spasswolf@....de>,
	<regressions@...ts.linux.dev>, "Linux regression tracking (Thorsten
 Leemhuis)" <regressions@...mhuis.info>, "Gautham R. Shenoy"
	<gautham.shenoy@....com>
Subject: Re: [REGRESSION] Re: [PATCH 17/24] sched/fair: Implement delayed
 dequeue

Hello folks,

On 10/8/2024 9:08 PM, K Prateek Nayak wrote:
> Hello Johannes, Peter,
> 
> On 10/4/2024 10:13 PM, K Prateek Nayak wrote:
>> [..snip..]
>>>> Anyway, assuming PSI wants to preserve current semantics, does something
>>>> like the below work?
>>>
>>> This doesn't. But it's a different corruption now:
>>>
>>> [    2.298408] psi: inconsistent task state! task=24:cpuhp/1 cpu=1 psi_flags=10 clear=14 set=0
>>
>> I hit the same log (clear 14, set 0) and I tried the below changes on
>> top of Peter's diff:
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 0d766fb9fbc4..9cf3d4359994 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2012,9 +2012,10 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>>       if (!(flags & ENQUEUE_NOCLOCK))
>>           update_rq_clock(rq);
>>
>> -    if (!(flags & ENQUEUE_RESTORE) && !p->se.sched_delayed) {
>> +    if (!(flags & ENQUEUE_RESTORE) && (!p->se.sched_delayed || (flags & ENQUEUE_DELAYED))) {
>>           sched_info_enqueue(rq, p);
>> -        psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
>> +        psi_enqueue(p, ((flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED)) ||
>> +                   (flags & ENQUEUE_DELAYED));
>>       }
>>
>>       p->sched_class->enqueue_task(rq, p, flags);
>> -- 
>>
>> ... but it just changes the warning to:
>>
>>      psi: task underflow! cpu=65 t=0 tasks=[0 0 0 0] clear=1 set=4
>>      psi: task underflow! cpu=31 t=0 tasks=[0 0 1 0] clear=1 set=0
>>
>> Doing a dump_stack(), I see it come from psi_enqueue() and
>> psi_ttwu_dequeue() and I see "clear=1" as the common theme. I've
>> stared at it for a while but I'm at a loss currently. If something
>> jumps out, I'll update here.
> 
> I could narrow down the crux of the matter to the fact that when a task
> is delayed, and the delayed task is then migrated, the wakeup context
> may not have any idea that the task was moved from its previous
> runqueue. This is the same reason psi_enqueue() considers only ...
> 
>      (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED)
> 
> ... as a wakeup. In case of a wakeup with migration, PSI forgoes
> clearing the TSK_IOWAIT flag which seems to be the issue I encountered
> in my splat previously.
> 
> With that said, the below diff, based on Peter's original approach
> currently seems to work for me in the sense that I have not seen the
> inconsistent state warning for a while now with my stress test.
> 
> Two key points of the approach are:
> 
> o It uses "p->migration_flags" to indicate a delayed entity has
>    migrated to another runqueue and convey the same during psi_enqueue().
> 
> o It adds ENQUEUE_WAKEUP flag alongside ENQUEUE_DELAYED for
>    enqueue_task() in ttwu_runnable() since psi_enqueue() needs to know of
>    a wakeup without migration to clear the TSK_IOWAIT flag it would have
>    set during psi_task_switch() for blocking task and going down the
>    stack for enqueue_task_fair(), there seem to be no other observer of
>    the ENQUEUE_WAKEUP flag other than psi_enqueue() in the requeue path.
> 
> If there are no obvious objections, I'll send a clean patch soon.
> In the meantime, here is the diff:
> [..snip..]

That last diff was malformed! Sorry for the noise. Here is the corrected
diff:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 43e453ab7e20..885801432e9a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2009,12 +2009,19 @@ unsigned long get_wchan(struct task_struct *p)
  
  void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
  {
+	bool wakee_not_migrated = (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED);
+
  	if (!(flags & ENQUEUE_NOCLOCK))
  		update_rq_clock(rq);
  
  	if (!(flags & ENQUEUE_RESTORE)) {
  		sched_info_enqueue(rq, p);
-		psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
+
+		/* Notify PSI that the task was migrated in a delayed state before wakeup. */
+		if ((p->migration_flags & DELAYED_MIGRATED) && !task_on_rq_migrating(p)) {
+			wakee_not_migrated = false;
+			p->migration_flags &= ~DELAYED_MIGRATED;
+		}
  	}
  
  	p->sched_class->enqueue_task(rq, p, flags);
@@ -2023,6 +2030,8 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
  	 * ->sched_delayed.
  	 */
  	uclamp_rq_inc(rq, p);
+	if (!(flags & ENQUEUE_RESTORE))
+		psi_enqueue(p, wakee_not_migrated);
  
  	if (sched_core_enabled(rq))
  		sched_core_enqueue(rq, p);
@@ -2042,6 +2051,9 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
  	if (!(flags & DEQUEUE_SAVE)) {
  		sched_info_dequeue(rq, p);
  		psi_dequeue(p, flags & DEQUEUE_SLEEP);
+
+		if (p->se.sched_delayed && task_on_rq_migrating(p))
+			p->migration_flags |= DELAYED_MIGRATED;
  	}
  
  	/*
@@ -3733,7 +3745,7 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
  	if (task_on_rq_queued(p)) {
  		update_rq_clock(rq);
  		if (p->se.sched_delayed)
-			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_WAKEUP | ENQUEUE_DELAYED);
  		if (!task_on_cpu(rq, p)) {
  			/*
  			 * When on_rq && !on_cpu the task is preempted, see if
@@ -6537,6 +6549,7 @@ static void __sched notrace __schedule(int sched_mode)
  	 * as a preemption by schedule_debug() and RCU.
  	 */
  	bool preempt = sched_mode > SM_NONE;
+	bool block = false;
  	unsigned long *switch_count;
  	unsigned long prev_state;
  	struct rq_flags rf;
@@ -6622,6 +6635,7 @@ static void __sched notrace __schedule(int sched_mode)
  			 * After this, schedule() must not care about p->state any more.
  			 */
  			block_task(rq, prev, flags);
+			block = true;
  		}
  		switch_count = &prev->nvcsw;
  	}
@@ -6667,7 +6681,7 @@ static void __sched notrace __schedule(int sched_mode)
  
  		migrate_disable_switch(rq, prev);
  		psi_account_irqtime(rq, prev, next);
-		psi_sched_switch(prev, next, !task_on_rq_queued(prev));
+		psi_sched_switch(prev, next, block);
  
  		trace_sched_switch(preempt, prev, next, prev_state);
  
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b1c3588a8f00..2dc2c4cb4f5f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1326,6 +1326,7 @@ static inline int cpu_of(struct rq *rq)
  }
  
  #define MDF_PUSH		0x01
+#define DELAYED_MIGRATED	0x02 /* Task was migrated when in DELAYED_DEQUEUE state */
  
  static inline bool is_migration_disabled(struct task_struct *p)
  {
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 237780aa3c53..06a2c6d3ec1e 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -129,6 +129,13 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup)
  	if (static_branch_likely(&psi_disabled))
  		return;
  
+	/*
+	 * Delayed task is not ready to run yet!
+	 * Wait for a requeue before accounting.
+	 */
+	if (p->se.sched_delayed)
+		return;
+
  	if (p->in_memstall)
  		set |= TSK_MEMSTALL_RUNNING;
  
@@ -148,6 +155,9 @@ static inline void psi_dequeue(struct task_struct *p, bool sleep)
  	if (static_branch_likely(&psi_disabled))
  		return;
  
+	/* Delayed task can only be dequeued for migration. */
+	WARN_ON_ONCE(p->se.sched_delayed && sleep);
+
  	/*
  	 * A voluntary sleep is a dequeue followed by a task switch. To
  	 * avoid walking all ancestors twice, psi_task_switch() handles
--

Attaching an RFC patch in case I messed this up as well!

> -- 
> 
> Any and all suggestions are highly appreciated.
> 
>>
>> Thank you again both for taking a look.
>>
> 

-- 
Thanks and Regards,
Prateek
View attachment "0001-sched-psi-Fixup-PSI-accounting-with-DELAY_DEQUEUE.patch" of type "text/plain" (5869 bytes)