linux-kernel - Re: [v6.12] WARNING: at kernel/sched/deadline.c:1995 enqueue_dl_entity (task blocked for more than 28262 seconds)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAEXW_YQnh9aD5yOv5HTDPk6LruEhi+2-mZ4SfXzivD1wdgpPvA@mail.gmail.com>
Date: Fri, 6 Dec 2024 12:24:17 -0500
From: Joel Fernandes <joel@...lfernandes.org>
To: Vineeth Remanan Pillai <vineeth@...byteword.org>
Cc: Ilya Maximets <i.maximets@....org>, LKML <linux-kernel@...r.kernel.org>, 
	Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, vineethrp@...gle.com, 
	shraash@...gle.com
Subject: Re: [v6.12] WARNING: at kernel/sched/deadline.c:1995
 enqueue_dl_entity (task blocked for more than 28262 seconds)

On Fri, Dec 6, 2024 at 11:57 AM Vineeth Remanan Pillai
<vineeth@...byteword.org> wrote:
>
> On Fri, Dec 6, 2024 at 10:18 AM Joel Fernandes <joel@...lfernandes.org> wrote:
> >
> > On Wed, Dec 04, 2024 at 01:47:44PM +0100, Ilya Maximets wrote:
> > > Hi.  It seems like I'm hitting some bug in the scheduler.
> > >
> > > I'm running some tests with Open vSwitch on v6.12 kernel and some time
> > > 5 to 8 hours down the line I'm getting task blocked splats and I also
> > > have a WARNING triggered in the scheduler code right before that:
> > >
> > > Dec  3 22:19:55 kernel: WARNING: CPU: 27 PID: 3391271 at kernel/sched/deadline.c:1995 enqueue_dl_entity
> > >
> > > I have a lot of processes (kernel threads and userpsace threads) stuck
> > > in DN, Ds, D+ and D states.  It feels like IO tasks are being scheduled,
> > > but scheduler never picks them up or they are not being scheduled at all
> > > for whatever reason, and threads waiting on these tasks are stuck.
> > >
> > > Dec  3 22:22:45 kernel: INFO: task khugepaged:330 blocked for more than 122 seconds.
> > > Dec  3 22:22:45 kernel: INFO: task ovs-monitor-ips:3479822 blocked for more than 122 seconds.
> > > Dec  3 22:22:45 kernel: INFO: task mv:3483072 blocked for more than 122 seconds.
> > > Dec  3 22:24:48 kernel: INFO: task khugepaged:330 blocked for more than 245 seconds.
> > > Dec  3 22:24:48 kernel: INFO: task ovs-monitor-ips:3479822 blocked for more than 245 seconds.
> > > Dec  3 22:24:48 kernel: INFO: task ovs-monitor-ips:3480383 blocked for more than 122 seconds.
> > > Dec  3 22:24:48 kernel: INFO: task ovs-monitor-ips:3481787 blocked for more than 122 seconds.
> > > Dec  3 22:24:48 kernel: INFO: task ovs-monitor-ips:3482631 blocked for more than 122 seconds.
> > > Dec  3 22:24:48 kernel: INFO: task mv:3483072 blocked for more than 245 seconds.
> > > Dec  3 22:26:51 kernel: INFO: task khugepaged:330 blocked for more than 368 seconds.
> > > ...
> > > Dec  4 06:11:45 kernel: INFO: task khugepaged:330 blocked for more than 28262 seconds.
> > >
> > > I have two separate instances where this behavior is reproduced.  One is mostly
> > > around file systems, the other was more severe as multiple kernel threads got
> > > stuck in netlink code.  The traces do not have much in common, except that most
> > > of blocked tasks are in scheduling.  The system is also idle, nothing is really
> > > running.  Some of these tasks are holding resources that make other tasks to
> > > block on those resources as well.
> > >
> > > I seem to be able to reproduce the issue, but it takes 5-8 hours to do so.
> > >
> >
> > CC'ing a few more from my team as well.
> >
> > We haven't seen such an issue with the DL server, but we are also testing on
> > slightly older kernels.
> >
> > Its coming from:
> >         WARN_ON_ONCE(on_dl_rq(dl_se));
> >
>
> Thanks for including me Joel :-)
>
> I was able to reproduce this WARN_ON couple of days back with
> syzkaller. dlserver's dl_se gets enqueued during a update_curr while
> the dlserver is stopped. And subsequent dlserver start will cause a
> double enqueue. On the peripheral, we don't track where dlserver is
> active or not directly and an explicit tracking could solve this
> issue. But the root cause is a little more deep and I think I
> understood the real cause. I have a potential fix and doing more
> testing to verify. Will send the fix out soon after a bit more
> verification

Oh, so we _have_ seen this issue :-). Thanks Vineeth, looking forward
to your fix! By the way, I do remember now some variation of this that
happened a long time ago but I thought it was fixed.

 - Joel