[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAO7JXPgSYCzu0mtnWqBaS8ihmoQXX3WE_Yb_rEYuMeQn+B6KJg@mail.gmail.com>
Date: Mon, 9 Dec 2024 07:29:52 -0500
From: Vineeth Remanan Pillai <vineeth@...byteword.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Joel Fernandes <joel@...lfernandes.org>, Ilya Maximets <i.maximets@....org>,
LKML <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, vineethrp@...gle.com,
shraash@...gle.com
Subject: Re: [v6.12] WARNING: at kernel/sched/deadline.c:1995
enqueue_dl_entity (task blocked for more than 28262 seconds)
On Mon, Dec 9, 2024 at 5:55 AM Peter Zijlstra <peterz@...radead.org> wrote:
>
> On Fri, Dec 06, 2024 at 11:57:30AM -0500, Vineeth Remanan Pillai wrote:
>
> > I was able to reproduce this WARN_ON couple of days back with
> > syzkaller. dlserver's dl_se gets enqueued during a update_curr while
> > the dlserver is stopped. And subsequent dlserver start will cause a
> > double enqueue.
>
> Right, I spotted that hole late last week. There is this thread:
>
> https://lore.kernel.org/all/20241209094941.GF21636@noisy.programming.kicks-ass.net/T/#u
>
> Where I just added this thunk:
>
> @@ -1674,6 +1679,12 @@ void dl_server_start(struct sched_dl_entity *dl_se)
>
> void dl_server_stop(struct sched_dl_entity *dl_se)
> {
> + if (current->dl_server == dl_se) {
> + struct rq *rq = rq_of_dl_se(dl_se);
> + trace_printk("stop fair server %d\n", cpu_of(rq));
> + current->dl_server = NULL;
> + }
> +
> if (!dl_se->dl_runtime)
> return;
>
> Which was my attempt at plugging said hole. But since I do not have
> means of reproduction, I'm not at all sure it is sufficient :/
>
I think I was able to get to the root cause last week. So the issue
seems to be that dlserver is stopped in the pick_task path of dlserver
itself when the sched_delayed is set:
__pick_next_task
=> pick_task_dl -> server_pick_task
=> pick_task_fair
=> pick_next_entity (if (sched_delayed))
=> dequeue_entities
=> dl_server_stop
Now server_pick_task returns NULL and then we set dl_yielded and call
update_curr_dl_se. But dl_se is already dequeued and now the code is
confused and it does all sorts of things including setting a timer to
enqueue it back. This ultimately leads to double enqueue when dlserver
is started next time(based on timing of dl_server_start)
I think we should not call update_curr_dl_se when the dlserver is
dequeued. Based on this I have a small patch and it seems to solve the
issue:
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d9d5a702f1a6..a9f3f020e421 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2419,12 +2419,18 @@ static struct task_struct *__pick_task_dl(struct rq *rq)
if (dl_server(dl_se)) {
p = dl_se->server_pick_task(dl_se);
- if (!p) {
+ if (p) {
+ rq->dl_server = dl_se;
+ } else if (WARN_ON_ONCE(on_dl_rq(dl_se))) {
+ /*
+ * If server_pick_task returns NULL and dlserver is
+ * enqueued, we have a problem. Lets yield and do a
+ * pick again.
+ */
dl_se->dl_yielded = 1;
update_curr_dl_se(rq, dl_se, 0);
goto again;
}
- rq->dl_server = dl_se;
} else {
p = dl_task_of(dl_se);
}
I can send a formal patch if this looks okay to you all..
Thanks,
Vineeth
Powered by blists - more mailing lists