linux-kernel - Re: [PATCH v2 1/3] sched/fair: Fix warning if NEXT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d16cc372-b4ae-473f-bf86-83469fbead99@amd.com>
Date: Fri, 29 Nov 2024 09:58:17 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Adam Li <adamli@...amperecomputing.com>, <peterz@...radead.org>,
	<mingo@...hat.com>, <juri.lelli@...hat.com>, <vincent.guittot@...aro.org>
CC: <dietmar.eggemann@....com>, <rostedt@...dmis.org>, <bsegall@...gle.com>,
	<mgorman@...e.de>, <vschneid@...hat.com>, <linux-kernel@...r.kernel.org>,
	<patches@...erecomputing.com>, <cl@...ux.com>, <christian.loehle@....com>,
	<vineethr@...ux.ibm.com>
Subject: Re: [PATCH v2 1/3] sched/fair: Fix warning if NEXT_BUDDY enabled

Hello Adam,

On 11/29/2024 8:51 AM, Adam Li wrote:
> On 11/28/2024 3:29 PM, K Prateek Nayak wrote:
>> Hello Adam,
>>
> Hi Prateek,
> Thanks for comments.
> 
>> On 11/27/2024 11:26 AM, Adam Li wrote:
>>> Enabling NEXT_BUDDY triggers warning, and rcu stall:
>>>
>>> [  124.977300] cfs_rq->next->sched_delayed
>>
>> I could reproduce this with a run of "perf bench sched messaging" but
>> given that we hit this warning, it also means that either
>> set_next_buddy() has incorrectly set a delayed entity as next buddy, or
>> clear_next_buddy() did not clear a delayed entity.
>>
> Yes. The logic of this patch is a delayed entity should not be set as next buddy.
> 
>> I also see PSI splats like:
>>
>>      psi: inconsistent task state! task=2524:kworker/u1028:2 cpu=154 psi_flags=10 clear=14 set=0
>>
>> but the PSI flags it has set "(TSK_MEMSTALL_RUNNING | TSK_MEMSTALL)" and
>> the flags it is trying to clear
>> "(TSK_MEMSTALL_RUNNING | TSK_MEMSTALL | TSK_RUNNING)" seem to be only
>> possible if you have picked a dequeued entity for running before its
>> wakeup, which is also perhaps why the "nr_running" computation goes awry
>> and pick_eevdf() returns NULL (which it should never since
>> pick_next_entity() is only called when rq->cfs.nr_running is > 0)
> IIUC, one path for pick_eevdf() to return NULL is:
> pick_eevdf():
> <snip>
> 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> 		curr = NULL; <--- curr is set to NULL

"on_rq" is only cleared when the entity is dequeued so "curr" is in fact
going to sleep (proper sleep) and we've reached at pick_eevdf(),
otherwise, if "curr" is not eligible, there is at least one more tasks
on the cfs_rq which implies best has be found and will be non-null.

> <snip>
> found:
> 	if (!best || (curr && entity_before(curr, best)))
> 		best = curr; <--- curr and best are both NULL

Say "curr" is going to sleep, and there is no "best", in which case
"curr" is already blocked and "cfs_rq->nr_running" should be 0 and it
should have not reached pick_eevdf() in the first place since
pick_next_entity() is only called by pick_task_fair() if
"cfs_rq->nr_running" is non-zero.

So as long as "cfs_rq->nr_running" is non-zero, pick_eevdf() should
return a valid runnable entity. Failure to do so perhaps points to
"entity_eligible()" check going sideways somewhere or a bug in
"nr_running" accounting.

Chenyu had proposed a similar fix long back in
https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@intel.com/
but the consensus was it was covering up a larger problem which
then boiled down to avg_vruntime being computed incorrectly
https://lore.kernel.org/lkml/ZiAWTU5xb%2FJMn%2FHs@chenyu5-mobl2/

> 
> 	return best;  <--- return NULL
> 
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index fbdca89c677f..cd1188b7f3df 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -8748,6 +8748,8 @@ static void set_next_buddy(struct sched_entity *se)
>>>                return;
>>>            if (se_is_idle(se))
>>>                return;
>>> +        if (se->sched_delayed)
>>> +            return;
>>
>> I tried to put a SCHED_WARN_ON() here to track where this comes from and
>> seems like it is usually from attach_task() in the load balancing path
>> pulling a delayed task which is set as the next buddy in
>> check_preempt_wakeup_fair()
>>
>> Can you please try the following diff instead of the first two patches
>> and see if you still hit these warnings, stalls, and pick_eevdf()
>> returning NULL?
>>
> Tested. Run specjbb with NEXT_BUDDY enabled, warnings, stalls and panic disappear.

Thank you for testing. I'll let Peter come back on which approach he
prefers :)

> 
> Regards,
> -adam

-- 
Thanks and Regards,
Prateek