lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZnlilDj5UrvrVasv@hovoldconsulting.com>
Date: Mon, 24 Jun 2024 14:12:04 +0200
From: Johan Hovold <johan@...nel.org>
To: Douglas Anderson <dianders@...omium.org>
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
	Jiri Slaby <jirislaby@...nel.org>,
	Yicong Yang <yangyicong@...ilicon.com>,
	Tony Lindgren <tony@...mide.com>,
	Andy Shevchenko <andriy.shevchenko@...ux.intel.com>,
	Johan Hovold <johan+linaro@...nel.org>,
	John Ogness <john.ogness@...utronix.de>,
	linux-arm-msm@...r.kernel.org,
	Bjorn Andersson <andersson@...nel.org>,
	Konrad Dybcio <konrad.dybcio@...aro.org>,
	Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>,
	Stephen Boyd <swboyd@...omium.org>, linux-serial@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	Uwe Kleine-König <u.kleine-koenig@...gutronix.de>,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [PATCH v4 7/8] serial: qcom-geni: Fix suspend while active UART
 xfer

On Mon, Jun 10, 2024 at 03:24:25PM -0700, Douglas Anderson wrote:
> On devices using Qualcomm's GENI UART it is possible to get the UART
> stuck such that it no longer outputs data. Specifically, logging in
> via an agetty on the debug serial port (which was _not_ used for
> kernel console) and running:
>   cat /var/log/messages
> ...and then (via an SSH session) forcing a few suspend/resume cycles
> causes the UART to stop transmitting.

An easier way to trigger this old bug is to just run a command like
dmesg and hit ctrl-s in a serial console to stop tx. Interrupting the
command or hitting ctrl-q to restart tx then triggers the soft lockup.

> The root of the problems was with qcom_geni_serial_stop_tx_fifo()
> which is called as part of the suspend process. Specific problems with
> that function:
> - When an in-progress "tx" command is cancelled it doesn't appear to
>   fully drain the FIFO. That meant qcom_geni_serial_tx_empty()
>   continued to report that the FIFO wasn't empty. The
>   qcom_geni_serial_start_tx_fifo() function didn't re-enable
>   interrupts in this case so the driver would never start transferring
>   again.
> - When the driver cancelled the current "tx" command but it forgot to
>   zero out "tx_remaining". This confused logic elsewhere in the
>   driver.
> - From experimentation, it appears that cancelling the "tx" command
>   could drop some of the queued up bytes.
> 
> While qcom_geni_serial_stop_tx_fifo() could be fixed to drain the FIFO
> and shut things down properly, stop_tx() isn't supposed to be a slow
> function. It is run with local interrupts off and is documented to
> stop transmitting "as soon as possible". Change the function to just
> stop new bytes from being queued. In order to make this work, change
> qcom_geni_serial_start_tx_fifo() to remove some conditions. It's
> always safe to enable the watermark interrupt and the IRQ handler will
> disable it if it's not needed.
> 
> For system suspend the queue still needs to be drained. Failure to do
> so means that the hardware won't provide new interrupts until a
> "cancel" command is sent. Add draining logic (fixing the issues noted
> above) at suspend time.

So I spent the better part of the weekend looking at this driver and
this is one of the bits I worry about with your approach as relying on
draining anything won't work with hardware flow control.

Cancelling commands can result stalled TX in a number of ways and
there's still at least one that you don't handle. If you end up with
data in in the FIFO, the watermark interrupt may never fire when you try
to restart tx.

I'm leaning towards fixing the immediate hard lockup regression
separately and then we can address the older bugs and rework driver
without having to rush things.

I've prepared a minimal three patch series which fixes most of the
discussed issues (hard and soft lockup and garbage characters) and that
should be backportable as well.

Currently, the diffstat is just:

	 drivers/tty/serial/qcom_geni_serial.c | 36 +++++++++++++++++++++++++-----------
	 1 file changed, 25 insertions(+), 11 deletions(-)

Fixing the hard lockup 6.10-rc1 regression is just a single line.

Johan

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ