lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YzVvl+rv3iZS9vxk@alley>
Date:   Thu, 29 Sep 2022 12:12:39 +0200
From:   Petr Mladek <pmladek@...e.com>
To:     Conor Dooley <conor.dooley@...rochip.com>
Cc:     Thorsten Leemhuis <regressions@...mhuis.info>,
        Conor Dooley <conor@...nel.org>, senozhatsky@...omium.org,
        rostedt@...dmis.org, john.ogness@...utronix.de,
        linux-kernel@...r.kernel.org, regressions@...ts.linux.dev
Subject: Re: [resend][bug] low-probability console lockups since 5.19

On Thu 2022-09-29 10:29:05, Conor Dooley wrote:
> On Thu, Sep 29, 2022 at 11:06:01AM +0200, Thorsten Leemhuis wrote:
> > Hi Conor
> > 
> > On 28.09.22 18:55, Conor Dooley wrote:
> > > On Fri, Sep 23, 2022 at 05:24:17PM +0100, Conor Dooley wrote:
> > >>
> > >> Been bisecting a bug that is causing a boot failure in my CI & have
> > >> ended up here.. The bug in question is a low(ish) probability lock up
> > >> of the serial console, I would estimate about 1-in-5 chance on the
> > >> boards I could actually trigger it on which it has taken me so long
> > >> to realise that this was an actual problem. Thinking back on it, there
> > >> were other failures that I would retroactively attribute to this
> > >> problem too, but I had earlycon disabled
> > 
> > There is one thing I wonder when skimming this thread: was there maybe
> > some other change somewhere in the kernel between the introduction and
> > the revert of the printk console kthreads patches that is the real
> > culprit here that makes existing, older races easier to hit? But I guess
> > in the end that would be very hard to find and it's easier to fix the
> > problem in the console driver... :-/
> 
> Entirely possible that something arrived in the middle, yeah. I've done
> 100s of reboots on that interim section, albeit with the threaded
> printers enabled, as I restarted the bisection several times & never hit
> this failure then.

Interesting. I wonder if the used console was fixed during the window
when the kthreads were enabled.

> I don't know anything about console/printk/serial drivers unfortunately
> so I will almost certainly not be able to find the problem by
> inspection. I'd rather submit patches than send reports, but I really
> really need some help here. I looked at the two patterns Petr suggested,
> but the former I am not sure applies since the issue is present even
> when earlycon is disabled & the latter appears (to my untrained eye) to
> be accounted for in the 8250 driver.

The problem with the missing port->lock is visible only when the
early console is enabled. But It is really hard to hit without
the kthreads.

The problem with enabled IRQs was visible only with kthreads. The
original code called console->write() callback already with IRQs
disabled.

The kthreads called console->write() callback with IRQs enabled.
It made sense. They need to be disabled only when really needed
and the tested drivers did this correctly.

Best Reagrds,
Petr

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ