linux-kernel - Re: Bug 218665 - nohz_full=0 prevents kernel from booting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <7kugx5ivbplwwrcq5zp37djtpakl55b3pfy36gpbsbcx43dpcs@uheu6iv7gm7h>
Date: Thu, 11 Apr 2024 21:57:53 -0500
From: Bjorn Andersson <andersson@...nel.org>
To: Linux regressions mailing list <regressions@...ts.linux.dev>
Cc: Tejun Heo <tj@...nel.org>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: Bug 218665 - nohz_full=0 prevents kernel from booting

On Wed, Apr 10, 2024 at 11:18:04AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 08.04.24 00:52, Bjorn Andersson wrote:
> > On Tue, Apr 02, 2024 at 10:17:16AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> >>
> >> I noticed a regression report in bugzilla.kernel.org. As many (most?)
> >> kernel developers don't keep an eye on it, I decided to forward it by mail.
> >>
> >> Tejun, apparently it's cause by a change of yours.
> >>
> >> Note, you have to use bugzilla to reach the reporter, as I sadly[1] can
> >> not CCed them in mails like this.
> >>
> >> Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=218665 :
> >>
> >>> booting the current kernel (6.9.0-rc1, master/712e1425) on x86_64
> >>> with nohz_full=0 cause a page fault and prevents the kernel from
> >>> booting.
> > [...]
> > In addition to this report, I have finally bisected another regression
> > to the same commit:
> > 
> > I start neovim, send SIGSTOP (i.e. ^Z) to it, start another neovim
> > instance and upon sending SIGSTOP to that instance all of userspace
> > locks up - 100% reproducible.
> > 
> > The kernel seems to continue to operate, and tapping the power button
> > dislodge the lockup and I get a clean shutdown.
> > 
> > This is seen on multiple Arm64 (Qualcomm) machines with upstream
> > defconfig since commit '5797b1c18919 ("workqueue: Implement system-wide
> > nr_active enforcement for unbound workqueues")'.
> 
> Hmmm, I had hoped Tejun would reply and share an opinion if these
> problems are related. But that didn't happen. :-/ So let me at least ask
> one question that might help to answer that question: is the machine
> using CPU isolation, like the two other reports about problems caused by
> this commit do (see the
> https://bugzilla.kernel.org/show_bug.cgi?id=218665 and
> https://lore.kernel.org/all/20240402105847.GA24832@redhat.com/ for
> details) ?
> 

No, this is a clean SMP system running stock arch/arm64/defconfig,
booted with "clk_ignore_unused pd_ignore_unused audit=0" as the command
line.

Regards,
Bjorn