lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 24 Jun 2024 13:58:34 -0700
From: Doug Anderson <dianders@...omium.org>
To: Johan Hovold <johan@...nel.org>
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>, Jiri Slaby <jirislaby@...nel.org>, 
	Yicong Yang <yangyicong@...ilicon.com>, Tony Lindgren <tony@...mide.com>, 
	Andy Shevchenko <andriy.shevchenko@...ux.intel.com>, Johan Hovold <johan+linaro@...nel.org>, 
	John Ogness <john.ogness@...utronix.de>, linux-arm-msm@...r.kernel.org, 
	Bjorn Andersson <andersson@...nel.org>, Konrad Dybcio <konrad.dybcio@...aro.org>, 
	Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>, 
	Stephen Boyd <swboyd@...omium.org>, linux-serial@...r.kernel.org, 
	linux-kernel@...r.kernel.org, 
	Uwe Kleine-König <u.kleine-koenig@...gutronix.de>, 
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [PATCH v4 7/8] serial: qcom-geni: Fix suspend while active UART xfer

Hi,

On Mon, Jun 24, 2024 at 5:12 AM Johan Hovold <johan@...nel.org> wrote:
>
> On Mon, Jun 10, 2024 at 03:24:25PM -0700, Douglas Anderson wrote:
> > On devices using Qualcomm's GENI UART it is possible to get the UART
> > stuck such that it no longer outputs data. Specifically, logging in
> > via an agetty on the debug serial port (which was _not_ used for
> > kernel console) and running:
> >   cat /var/log/messages
> > ...and then (via an SSH session) forcing a few suspend/resume cycles
> > causes the UART to stop transmitting.
>
> An easier way to trigger this old bug is to just run a command like
> dmesg and hit ctrl-s in a serial console to stop tx. Interrupting the
> command or hitting ctrl-q to restart tx then triggers the soft lockup.
>
> > The root of the problems was with qcom_geni_serial_stop_tx_fifo()
> > which is called as part of the suspend process. Specific problems with
> > that function:
> > - When an in-progress "tx" command is cancelled it doesn't appear to
> >   fully drain the FIFO. That meant qcom_geni_serial_tx_empty()
> >   continued to report that the FIFO wasn't empty. The
> >   qcom_geni_serial_start_tx_fifo() function didn't re-enable
> >   interrupts in this case so the driver would never start transferring
> >   again.
> > - When the driver cancelled the current "tx" command but it forgot to
> >   zero out "tx_remaining". This confused logic elsewhere in the
> >   driver.
> > - From experimentation, it appears that cancelling the "tx" command
> >   could drop some of the queued up bytes.
> >
> > While qcom_geni_serial_stop_tx_fifo() could be fixed to drain the FIFO
> > and shut things down properly, stop_tx() isn't supposed to be a slow
> > function. It is run with local interrupts off and is documented to
> > stop transmitting "as soon as possible". Change the function to just
> > stop new bytes from being queued. In order to make this work, change
> > qcom_geni_serial_start_tx_fifo() to remove some conditions. It's
> > always safe to enable the watermark interrupt and the IRQ handler will
> > disable it if it's not needed.
> >
> > For system suspend the queue still needs to be drained. Failure to do
> > so means that the hardware won't provide new interrupts until a
> > "cancel" command is sent. Add draining logic (fixing the issues noted
> > above) at suspend time.
>
> So I spent the better part of the weekend looking at this driver and
> this is one of the bits I worry about with your approach as relying on
> draining anything won't work with hardware flow control.
>
> Cancelling commands can result stalled TX in a number of ways and
> there's still at least one that you don't handle. If you end up with
> data in in the FIFO, the watermark interrupt may never fire when you try
> to restart tx.

Ah, that's a good call. Right now it doesn't really happen since
people tend to hook up the debug UART without flow control lines (as
far as I've seen), but it's good to make sure it works.


> I'm leaning towards fixing the immediate hard lockup regression
> separately and then we can address the older bugs and rework driver
> without having to rush things.

Yeah, that's fair. I've responded to your patch with a
counter-proposal to fix the hard lockup regression, but I agree that
should take priority.


> I've prepared a minimal three patch series which fixes most of the
> discussed issues (hard and soft lockup and garbage characters) and that
> should be backportable as well.
>
> Currently, the diffstat is just:
>
>          drivers/tty/serial/qcom_geni_serial.c | 36 +++++++++++++++++++++++++-----------
>          1 file changed, 25 insertions(+), 11 deletions(-)

I'll respond more in dept to your patches, but I suspect that your
patch series won't fix the issues that Nícolas reported [1]. I also
tested and your patch series doesn't fix the kdb issue talked about in
my patch #8. Part of my reworking of stuff also changed the way that
the console and the polling commands worked since they were pretty
broken. Your series doesn't touch them.

We'll probably need something in-between taking advantage of some of
the stuff you figured out with "cancel" but also doing a bigger rework
than you did.

[1] https://lore.kernel.org/r/46f57349-1217-4594-85b2-84fa3a365c0c@notapiano

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ