linux-kernel - Re: [REGRESSION] Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8ae45988-9c5b-8a68-b351-5e1d41da59e6@amd.com>
Date: Fri, 4 Oct 2024 22:31:08 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Peter Zijlstra <peterz@...radead.org>
CC: Klaus Kudielka <klaus.kudielka@...il.com>, Chris Bainbridge
	<chris.bainbridge@...il.com>, <linux-kernel@...r.kernel.org>,
	<bsegall@...gle.com>, <dietmar.eggemann@....com>, <efault@....de>,
	<juri.lelli@...hat.com>, <mgorman@...e.de>, <mingo@...hat.com>,
	<rostedt@...dmis.org>, <tglx@...utronix.de>, <vincent.guittot@...aro.org>,
	<vschneid@...hat.com>, <wuyun.abel@...edance.com>,
	<youssefesmat@...omium.org>, <spasswolf@....de>,
	<regressions@...ts.linux.dev>, Johannes Weiner <hannes@...xchg.org>, "Linux
 regression tracking (Thorsten Leemhuis)" <regressions@...mhuis.info>,
	"Gautham R. Shenoy" <gautham.shenoy@....com>
Subject: Re: [REGRESSION] Re: [PATCH 17/24] sched/fair: Implement delayed
 dequeue

Hello Peter,

On 10/4/2024 6:05 PM, Peter Zijlstra wrote:
> On Fri, Oct 04, 2024 at 04:40:08PM +0530, K Prateek Nayak wrote:
>> Hello folks,
>>
>> On 10/3/2024 11:01 AM, Klaus Kudielka wrote:
>>> On Sun, 2024-09-22 at 16:45 +0100, Chris Bainbridge wrote:
>>>> On Fri, Aug 30, 2024 at 02:34:56PM +0200, Bert Karwatzki wrote:
>>>>> Since linux next-20240820 the following messages appears when booting:
>>>>>
>>>>> [    T1] smp: Bringing up secondary CPUs ...
>>>>> [    T1] smpboot: x86: Booting SMP configuration:
>>>>> [    T1] .... node  #0, CPUs:        #2  #4  #6  #8 #10 #12 #14  #1
>>>>> This is the line I'm concerend about:
>>>>> [    T1] psi: inconsistent task state! task=61:cpuhp/3 cpu=0 psi_flags=4 clear=0 set=4
>>>>> [    T1]   #3  #5  #7  #9 #11 #13 #15
>>>>> [    T1] Spectre V2 : Update user space SMT mitigation: STIBP always-on
>>>>> [    T1] smp: Brought up 1 node, 16 CPUs
>>>>> [    T1] smpboot: Total of 16 processors activated (102216.16 BogoMIPS)
>>>>>
>>>>> I bisected this to commit 152e11f6df29 ("sched/fair: Implement delayed dequeue").
>>>>> Is this normal or is this something I should worry about?
>>>>>
>>>>> Bert Karwatzki
>>>>
>>>> I am also getting a similar error on boot, and bisected it to the same commit:
>>>>
>>>> [    0.342931] psi: inconsistent task state! task=15:rcu_tasks_trace cpu=0 psi_flags=4 clear=0 set=4
>>>>
>>>> #regzbot introduced: 152e11f6df293e816a6a37c69757033cdc72667d
>>>
>>> Just another data point, while booting 6.12-rc1 on a Turris Omnia:
>>>
>>> [    0.000000] Linux version 6.12.0-rc1 (XXX) (arm-linux-gnueabihf-gcc (Debian 14.2.0-1) 14.2.0, GNU ld (GNU Binutils for Debian) 2.43.1) #1 SMP Thu Oct  3 06:59:25 CEST 2024
>>> [    0.000000] CPU: ARMv7 Processor [414fc091] revision 1 (ARMv7), cr=10c5387d
>>> [    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
>>> [    0.000000] OF: fdt: Machine model: Turris Omnia
>>> ...
>>> [    0.000867] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
>>> [    0.000876] psi: inconsistent task state! task=2:kthreadd cpu=0 psi_flags=4 clear=0 set=4
>>>
>>
>> Not sure if someone took a stab at this but I haven't seen the "psi:
> 
> I'm aware of the issue, but since it's just statistics and not
> anything 'important', I've been spending my time on those crashing bugs.
> 
>> inconsistent task state" warning with the below diff. I'm not sure if my
>> approach is right which if why I'm pasting the diff before sending out
>> an official series. Any comments or testing is greatly appreciated.
>>
>> The diff is based on:
>>
>>      git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/urgent
>>
>> at commit d4ac164bde7a ("sched/eevdf: Fix wakeup-preempt by checking
>> cfs_rq->nr_running")
> 
> Thanks, I just pushed all that out to tip/sched/urgent.
> 
>> My approach was as follows:
>>
>> o psi_dequeue() relied on psi_sched_switch() to set the PSI flags
>>    appropriately for a dequeued task. However, psi_sched_switch() used
>>    "!task_on_rq_queued(prev)" to judge if the prev task is blocked which
>>    is now untrue with DELAYED_DEQUEUE. Fix it by checking
>>    "p->se.sched_delayed" as well. I also added a matching check for
>>    ENQUEUE_DELAYED for psi_enqueue().
> 
> We already determine the whole sleep state earlier, the whole having
> called block_task() is a clue, perhaps we should propagate that state
> instead of trying to divinate it again.

Yup that makes sense!

> 
>> o With the above, the warning was put off for a few more seconds but it
>>    still appeared. I dumped all PSI flag transition along with
>>    "tsk->se.sched_delayed" to see what trips it and I saw the following
>>    state changes for the task that finally tripped it:
>>
>>      psi: task state: task=18:rcu_preempt cpu=0 psi_flags=0 clear=0 set=0 delayed=1
>>      psi: task state: task=18:rcu_preempt cpu=128 psi_flags=0 clear=0 set=4 delayed=1
>>      psi: task state: task=18:rcu_preempt cpu=128 psi_flags=4 clear=0 set=4 delayed=0
>>      psi: inconsistent task state! task=18:rcu_preempt cpu=128 psi_flags=4 clear=0 set=4 delayed=0
>>
>>    Note that cpu switched with "tsk->se.sched_delayed" still set which
>>    got me looking at the task migration path. The warning added below
>>    in "deactivate_task()" tripped without fail, just before the PSI
>>    warning was logged.
>>
>>    To prevent migration of a delayed entity (XXX: Is it a good idea?)
> 
> It is not. By migrating the entities they can get picked sooner and the
> delayed thing gets removed sooner. Less 'hidden' weight.

True that! I was thinking moving queued load could also potentially help
delayed entities being picked faster on the rq where they were delayed.
Both seem to help in one way or the other but I don't have any solid
data to conclusively say which might be better.

> 
>>    we do a "account_task_dequeue()" in the delayed dequeue case to
>>    remove the task from the "rq->cfs_list", thus removing it from the
>>    purview of the load balancer.
> 
> Anyway, assuming PSI wants to preserve current semantics, does something
> like the below work?

I've updated the details from my testing on the parallel thread by
Johannes.

tl;dr I still see PSI warnings, some more tinkering on top of your
changes altered the warning to "psi: task underflow!". So far, no
luck figuring out how that comes about.

Thank you for taking a look and for the quick patch!

-- 
Thanks and Regards,
Prateek

> 
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 43e453ab7e20..0d766fb9fbc4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2012,7 +2012,7 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>   	if (!(flags & ENQUEUE_NOCLOCK))
>   		update_rq_clock(rq);
>   
> -	if (!(flags & ENQUEUE_RESTORE)) {
> +	if (!(flags & ENQUEUE_RESTORE) && !p->se.sched_delayed) {
>   		sched_info_enqueue(rq, p);
>   		psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
>   	}
> @@ -2039,7 +2039,7 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
>   	if (!(flags & DEQUEUE_NOCLOCK))
>   		update_rq_clock(rq);
>   
> -	if (!(flags & DEQUEUE_SAVE)) {
> +	if (!(flags & DEQUEUE_SAVE) && !p->se.sched_delayed) {
>   		sched_info_dequeue(rq, p);
>   		psi_dequeue(p, flags & DEQUEUE_SLEEP);
>   	}
> @@ -6537,6 +6537,7 @@ static void __sched notrace __schedule(int sched_mode)
>   	 * as a preemption by schedule_debug() and RCU.
>   	 */
>   	bool preempt = sched_mode > SM_NONE;
> +	bool block = false;
>   	unsigned long *switch_count;
>   	unsigned long prev_state;
>   	struct rq_flags rf;
> @@ -6622,6 +6623,7 @@ static void __sched notrace __schedule(int sched_mode)
>   			 * After this, schedule() must not care about p->state any more.
>   			 */
>   			block_task(rq, prev, flags);
> +			block = true;
>   		}
>   		switch_count = &prev->nvcsw;
>   	}
> @@ -6667,7 +6669,7 @@ static void __sched notrace __schedule(int sched_mode)
>   
>   		migrate_disable_switch(rq, prev);
>   		psi_account_irqtime(rq, prev, next);
> -		psi_sched_switch(prev, next, !task_on_rq_queued(prev));
> +		psi_sched_switch(prev, next, block);
>   
>   		trace_sched_switch(preempt, prev, next, prev_state);
>