linux-kernel - Re: [v6.12] WARNING: at kernel/sched/deadline.c:1995 enqueue_dl_entity (task blocked for more than 28262 seconds)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20241209140108.GL8562@noisy.programming.kicks-ass.net>
Date: Mon, 9 Dec 2024 15:01:08 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: Vineeth Remanan Pillai <vineeth@...byteword.org>
Cc: Joel Fernandes <joel@...lfernandes.org>,
	Ilya Maximets <i.maximets@....org>,
	LKML <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...hat.com>,
	Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>, vineethrp@...gle.com,
	shraash@...gle.com, marcel.ziswiler@...ethink.co.uk
Subject: Re: [v6.12] WARNING: at kernel/sched/deadline.c:1995
 enqueue_dl_entity (task blocked for more than 28262 seconds)

On Mon, Dec 09, 2024 at 08:56:43AM -0500, Vineeth Remanan Pillai wrote:

> > So the scenario I had in mind was that we were doing something like:
> >
> >         current->state = TASK_INTERRUPTIBLE();
> >         schedule();
> >           deactivate_task()
> >             dl_stop_server();
> >           pick_next_task()
> >             pick_next_task_fair()
> >               sched_balance_newidle()
> >                 rq_unlock(this_rq)
> >
> > at which point another CPU can take our RQ-lock and do:
> >
> >         try_to_wake_up()
> >           ttwu_queue()
> >             rq_lock()
> >             ...
> >             activate_task()
> >               dl_server_start()
> >             wakeup_preempt() := check_preempt_wakeup_fair()
> >               update_curr()
> >                 update_curr_task()
> >                   if (current->dl_server)
> >                     dl_server_update()
> >                       enqueue_dl_entity()
> >
> >
> > Which then also goes *bang*. The above can't happen if we clear
> > current->dl_server in dl_stop_server().
> >
> I also thought this could be a possibility but the previous deactivate
> for this task would have cleared the dl_server no? 

That gets cleared in put_prev_set_next_task(), which gets called *after*
pick_next_task() completes. So until that time, current will have
dl_server set.

> Soon after this in
> update_curr() we again call dl_server_update if p_.dl_server !=
> rq->fair_server and this is also another possibility of a double
> enqueue.

Right, there's few possible paths there, I've not fully mapped them. But
I think clearing ->dl_server in dl_server_stop() is the cleanest option
for this.


> This should work as well. I was planning to send a second patch with
> the dl_server active flag as it was not strictly the root cause of
> this. But the active flag serves the purpose here and this change
> looks good to me :-). I will test this on my end and let you know. It
> takes more than 12 hours to reproduce in my test case ;-)

Urgh... Thanks!