[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAO7JXPhdv+Jx_UpAq=-aG-BKwiyjZ2kvuZDM4+GLjbya_=ZJsw@mail.gmail.com>
Date: Fri, 6 Dec 2024 11:57:30 -0500
From: Vineeth Remanan Pillai <vineeth@...byteword.org>
To: Joel Fernandes <joel@...lfernandes.org>
Cc: Ilya Maximets <i.maximets@....org>, LKML <linux-kernel@...r.kernel.org>,
Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, vineethrp@...gle.com,
shraash@...gle.com
Subject: Re: [v6.12] WARNING: at kernel/sched/deadline.c:1995
enqueue_dl_entity (task blocked for more than 28262 seconds)
On Fri, Dec 6, 2024 at 10:18 AM Joel Fernandes <joel@...lfernandes.org> wrote:
>
> On Wed, Dec 04, 2024 at 01:47:44PM +0100, Ilya Maximets wrote:
> > Hi. It seems like I'm hitting some bug in the scheduler.
> >
> > I'm running some tests with Open vSwitch on v6.12 kernel and some time
> > 5 to 8 hours down the line I'm getting task blocked splats and I also
> > have a WARNING triggered in the scheduler code right before that:
> >
> > Dec 3 22:19:55 kernel: WARNING: CPU: 27 PID: 3391271 at kernel/sched/deadline.c:1995 enqueue_dl_entity
> >
> > I have a lot of processes (kernel threads and userpsace threads) stuck
> > in DN, Ds, D+ and D states. It feels like IO tasks are being scheduled,
> > but scheduler never picks them up or they are not being scheduled at all
> > for whatever reason, and threads waiting on these tasks are stuck.
> >
> > Dec 3 22:22:45 kernel: INFO: task khugepaged:330 blocked for more than 122 seconds.
> > Dec 3 22:22:45 kernel: INFO: task ovs-monitor-ips:3479822 blocked for more than 122 seconds.
> > Dec 3 22:22:45 kernel: INFO: task mv:3483072 blocked for more than 122 seconds.
> > Dec 3 22:24:48 kernel: INFO: task khugepaged:330 blocked for more than 245 seconds.
> > Dec 3 22:24:48 kernel: INFO: task ovs-monitor-ips:3479822 blocked for more than 245 seconds.
> > Dec 3 22:24:48 kernel: INFO: task ovs-monitor-ips:3480383 blocked for more than 122 seconds.
> > Dec 3 22:24:48 kernel: INFO: task ovs-monitor-ips:3481787 blocked for more than 122 seconds.
> > Dec 3 22:24:48 kernel: INFO: task ovs-monitor-ips:3482631 blocked for more than 122 seconds.
> > Dec 3 22:24:48 kernel: INFO: task mv:3483072 blocked for more than 245 seconds.
> > Dec 3 22:26:51 kernel: INFO: task khugepaged:330 blocked for more than 368 seconds.
> > ...
> > Dec 4 06:11:45 kernel: INFO: task khugepaged:330 blocked for more than 28262 seconds.
> >
> > I have two separate instances where this behavior is reproduced. One is mostly
> > around file systems, the other was more severe as multiple kernel threads got
> > stuck in netlink code. The traces do not have much in common, except that most
> > of blocked tasks are in scheduling. The system is also idle, nothing is really
> > running. Some of these tasks are holding resources that make other tasks to
> > block on those resources as well.
> >
> > I seem to be able to reproduce the issue, but it takes 5-8 hours to do so.
> >
>
> CC'ing a few more from my team as well.
>
> We haven't seen such an issue with the DL server, but we are also testing on
> slightly older kernels.
>
> Its coming from:
> WARN_ON_ONCE(on_dl_rq(dl_se));
>
Thanks for including me Joel :-)
I was able to reproduce this WARN_ON couple of days back with
syzkaller. dlserver's dl_se gets enqueued during a update_curr while
the dlserver is stopped. And subsequent dlserver start will cause a
double enqueue. On the peripheral, we don't track where dlserver is
active or not directly and an explicit tracking could solve this
issue. But the root cause is a little more deep and I think I
understood the real cause. I have a potential fix and doing more
testing to verify. Will send the fix out soon after a bit more
verification
Thanks,
Vineeth
Powered by blists - more mailing lists