linux-kernel - Re: [REGRESSION] Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fae14e09-cd35-5feb-c3b4-8318a76b26a3@amd.com>
Date: Fri, 4 Oct 2024 22:13:55 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Johannes Weiner <hannes@...xchg.org>, Peter Zijlstra
	<peterz@...radead.org>
CC: Klaus Kudielka <klaus.kudielka@...il.com>, Chris Bainbridge
	<chris.bainbridge@...il.com>, <linux-kernel@...r.kernel.org>,
	<bsegall@...gle.com>, <dietmar.eggemann@....com>, <efault@....de>,
	<juri.lelli@...hat.com>, <mgorman@...e.de>, <mingo@...hat.com>,
	<rostedt@...dmis.org>, <tglx@...utronix.de>, <vincent.guittot@...aro.org>,
	<vschneid@...hat.com>, <wuyun.abel@...edance.com>,
	<youssefesmat@...omium.org>, <spasswolf@....de>,
	<regressions@...ts.linux.dev>, "Linux regression tracking (Thorsten
 Leemhuis)" <regressions@...mhuis.info>, "Gautham R. Shenoy"
	<gautham.shenoy@....com>
Subject: Re: [REGRESSION] Re: [PATCH 17/24] sched/fair: Implement delayed
 dequeue

Hello Johannes, Peter,

On 10/4/2024 7:27 PM, Johannes Weiner wrote:
> On Fri, Oct 04, 2024 at 02:35:06PM +0200, Peter Zijlstra wrote:
>> On Fri, Oct 04, 2024 at 04:40:08PM +0530, K Prateek Nayak wrote:
>>> Hello folks,
>>>
>>> On 10/3/2024 11:01 AM, Klaus Kudielka wrote:
>>>> On Sun, 2024-09-22 at 16:45 +0100, Chris Bainbridge wrote:
>>>>> On Fri, Aug 30, 2024 at 02:34:56PM +0200, Bert Karwatzki wrote:
>>>>>> Since linux next-20240820 the following messages appears when booting:
>>>>>>
>>>>>> [    T1] smp: Bringing up secondary CPUs ...
>>>>>> [    T1] smpboot: x86: Booting SMP configuration:
>>>>>> [    T1] .... node  #0, CPUs:        #2  #4  #6  #8 #10 #12 #14  #1
>>>>>> This is the line I'm concerend about:
>>>>>> [    T1] psi: inconsistent task state! task=61:cpuhp/3 cpu=0 psi_flags=4 clear=0 set=4
>>>>>> [    T1]   #3  #5  #7  #9 #11 #13 #15
>>>>>> [    T1] Spectre V2 : Update user space SMT mitigation: STIBP always-on
>>>>>> [    T1] smp: Brought up 1 node, 16 CPUs
>>>>>> [    T1] smpboot: Total of 16 processors activated (102216.16 BogoMIPS)
>>>>>>
>>>>>> I bisected this to commit 152e11f6df29 ("sched/fair: Implement delayed dequeue").
>>>>>> Is this normal or is this something I should worry about?
>>>>>>
>>>>>> Bert Karwatzki
>>>>>
>>>>> I am also getting a similar error on boot, and bisected it to the same commit:
>>>>>
>>>>> [    0.342931] psi: inconsistent task state! task=15:rcu_tasks_trace cpu=0 psi_flags=4 clear=0 set=4
>>>>>
>>>>> #regzbot introduced: 152e11f6df293e816a6a37c69757033cdc72667d
>>>>
>>>> Just another data point, while booting 6.12-rc1 on a Turris Omnia:
>>>>
>>>> [    0.000000] Linux version 6.12.0-rc1 (XXX) (arm-linux-gnueabihf-gcc (Debian 14.2.0-1) 14.2.0, GNU ld (GNU Binutils for Debian) 2.43.1) #1 SMP Thu Oct  3 06:59:25 CEST 2024
>>>> [    0.000000] CPU: ARMv7 Processor [414fc091] revision 1 (ARMv7), cr=10c5387d
>>>> [    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
>>>> [    0.000000] OF: fdt: Machine model: Turris Omnia
>>>> ...
>>>> [    0.000867] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
>>>> [    0.000876] psi: inconsistent task state! task=2:kthreadd cpu=0 psi_flags=4 clear=0 set=4
>>>>
>>>
>>> Not sure if someone took a stab at this but I haven't seen the "psi:
>>
>> I'm aware of the issue, but since it's just statistics and not
>> anything 'important', I've been spending my time on those crashing bugs.
>>
>>> inconsistent task state" warning with the below diff. I'm not sure if my
>>> approach is right which if why I'm pasting the diff before sending out
>>> an official series. Any comments or testing is greatly appreciated.
> 
> This fixes the bug for me.

Thank you for testing :)

> 
>> Anyway, assuming PSI wants to preserve current semantics, does something
>> like the below work?
> 
> This doesn't. But it's a different corruption now:
> 
> [    2.298408] psi: inconsistent task state! task=24:cpuhp/1 cpu=1 psi_flags=10 clear=14 set=0

I hit the same log (clear 14, set 0) and I tried the below changes on
top of Peter's diff:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0d766fb9fbc4..9cf3d4359994 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2012,9 +2012,10 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
  	if (!(flags & ENQUEUE_NOCLOCK))
  		update_rq_clock(rq);
  
-	if (!(flags & ENQUEUE_RESTORE) && !p->se.sched_delayed) {
+	if (!(flags & ENQUEUE_RESTORE) && (!p->se.sched_delayed || (flags & ENQUEUE_DELAYED))) {
  		sched_info_enqueue(rq, p);
-		psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
+		psi_enqueue(p, ((flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED)) ||
+			       (flags & ENQUEUE_DELAYED));
  	}
  
  	p->sched_class->enqueue_task(rq, p, flags);
--

... but it just changes the warning to:

     psi: task underflow! cpu=65 t=0 tasks=[0 0 0 0] clear=1 set=4
     psi: task underflow! cpu=31 t=0 tasks=[0 0 1 0] clear=1 set=0

Doing a dump_stack(), I see it come from psi_enqueue() and
psi_ttwu_dequeue() and I see "clear=1" as the common theme. I've
stared at it for a while but I'm at a loss currently. If something
jumps out, I'll update here.

Thank you again both for taking a look.

-- 
Thanks and Regards,
Prateek

> 
> It's psi_sched_switch(.sleep=true) trying to clear the running state
> and the queued state, but finds only the running state set.
> 
> I don't think it's an erroneous dequeue. __schedule() has that
> block_task() dequeue before the switch, but the DEQUEUE_SLEEP makes it
> a no-op and leaves the combined update to psi_sched_switch().
> 
> It looks instead it's missing an enqueue callback. This triggers:
> 
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 020d58967d4e..09a251e3986d 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -922,6 +922,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>   	u64 now = cpu_clock(cpu);
>   
>   	if (next->pid) {
> +		WARN_ON_ONCE(!(next->psi_flags & TSK_RUNNING));
> +
>   		psi_flags_change(next, 0, TSK_ONCPU);
>   		/*
>   		 * Set TSK_ONCPU on @next's cgroups. If @next shares any
> 
>> ---
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 43e453ab7e20..0d766fb9fbc4 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2012,7 +2012,7 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>>   	if (!(flags & ENQUEUE_NOCLOCK))
>>   		update_rq_clock(rq);
>>   
>> -	if (!(flags & ENQUEUE_RESTORE)) {
>> +	if (!(flags & ENQUEUE_RESTORE) && !p->se.sched_delayed) {
>>   		sched_info_enqueue(rq, p);
>>   		psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
>>   	}
> 
> ... we must be skipping here when we shouldn't.
> 
> I tried moving it past ->enqueue_task(), like we did for uclamp, to
> get the enqueue when sched_delayed is cleared by the callback and task
> is considered properly queued again. However, that results in yet
> another problem:
> 
> [    4.624776] psi: inconsistent task state! task=161:systemd-ssh-gen cpu=1 psi_flags=15 clear=14 set=1
> 
> This is a psi_sched_switch() trying to clear running|queued and set
> iowait. Task is already running|queued|iowait. It looks like we had a
> genuine wakeup that was signaled with psi_enqueue(.wakeup=false) (so
> it didn't clear the iowait).
> 
>> @@ -2039,7 +2039,7 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
>>   	if (!(flags & DEQUEUE_NOCLOCK))
>>   		update_rq_clock(rq);
>>   
>> -	if (!(flags & DEQUEUE_SAVE)) {
>> +	if (!(flags & DEQUEUE_SAVE) && !p->se.sched_delayed) {
>>   		sched_info_dequeue(rq, p);
>>   		psi_dequeue(p, flags & DEQUEUE_SLEEP);
>>   	}
>> @@ -6537,6 +6537,7 @@ static void __sched notrace __schedule(int sched_mode)
>>   	 * as a preemption by schedule_debug() and RCU.
>>   	 */
>>   	bool preempt = sched_mode > SM_NONE;
>> +	bool block = false;
>>   	unsigned long *switch_count;
>>   	unsigned long prev_state;
>>   	struct rq_flags rf;
>> @@ -6622,6 +6623,7 @@ static void __sched notrace __schedule(int sched_mode)
>>   			 * After this, schedule() must not care about p->state any more.
>>   			 */
>>   			block_task(rq, prev, flags);
>> +			block = true;
>>   		}
>>   		switch_count = &prev->nvcsw;
>>   	}
>> @@ -6667,7 +6669,7 @@ static void __sched notrace __schedule(int sched_mode)
>>   
>>   		migrate_disable_switch(rq, prev);
>>   		psi_account_irqtime(rq, prev, next);
>> -		psi_sched_switch(prev, next, !task_on_rq_queued(prev));
>> +		psi_sched_switch(prev, next, block);
>>   
>>   		trace_sched_switch(preempt, prev, next, prev_state);