lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 24 Jun 2024 14:23:52 -0700
From: Doug Anderson <dianders@...omium.org>
To: Johan Hovold <johan+linaro@...nel.org>
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>, Jiri Slaby <jirislaby@...nel.org>, 
	Konrad Dybcio <konrad.dybcio@...aro.org>, Bjorn Andersson <andersson@...nel.org>, 
	linux-arm-msm@...r.kernel.org, linux-serial@...r.kernel.org, 
	linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH 2/3] serial: qcom-geni: fix soft lockup on sw flow control
 and suspend

Hi,

On Mon, Jun 24, 2024 at 6:31 AM Johan Hovold <johan+linaro@...nel.org> wrote:
>
> The stop_tx() callback is used to implement software flow control and
> must not discard data as the Qualcomm GENI driver is currently doing
> when there is an active TX command.
>
> Cancelling an active command can also leave data in the hardware FIFO,
> which prevents the watermark interrupt from being enabled when TX is
> later restarted. This results in a soft lockup and is easily triggered
> by stopping TX using software flow control in a serial console but this
> can also happen after suspend.
>
> Fix this by only stopping any active command, and effectively clearing
> the hardware fifo, when shutting down the port. Make sure to temporarily
> raise the watermark level so that the interrupt fires when TX is
> restarted.

Nice! I did quite a few experiments, but it sounds like you found
something that I wasn't able to find. Specifically once I cancelled an
ongoing command I could never manage to get it started back up, but it
must have just been that data was still in the FIFO and thus the
watermark never fired again.

When I was experimenting, I also swore that there were cases where
geni would sometimes fully drop bytes when I tried to "cancel" a
command, but maybe I was mistaken. Everything I figured out was
essentially by running experiments and I could easily have had a bug
in my experiment.


> Fixes: c4f528795d1a ("tty: serial: msm_geni_serial: Add serial driver support for GENI based QUP")
> Cc: stable@...r.kernel.org      # 4.17
> Signed-off-by: Johan Hovold <johan+linaro@...nel.org>
> ---
>  drivers/tty/serial/qcom_geni_serial.c | 28 +++++++++++++++++----------
>  1 file changed, 18 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
> index 1d5d6045879a..72addeb9f461 100644
> --- a/drivers/tty/serial/qcom_geni_serial.c
> +++ b/drivers/tty/serial/qcom_geni_serial.c
> @@ -651,13 +651,8 @@ static void qcom_geni_serial_start_tx_fifo(struct uart_port *uport)
>  {
>         u32 irq_en;
>
> -       if (qcom_geni_serial_main_active(uport) ||
> -           !qcom_geni_serial_tx_empty(uport))
> -               return;
> -
>         irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
>         irq_en |= M_TX_FIFO_WATERMARK_EN | M_CMD_DONE_EN;
> -
>         writel(DEF_TX_WM, uport->membase + SE_GENI_TX_WATERMARK_REG);
>         writel(irq_en, uport->membase + SE_GENI_M_IRQ_EN);
>  }
> @@ -665,16 +660,28 @@ static void qcom_geni_serial_start_tx_fifo(struct uart_port *uport)
>  static void qcom_geni_serial_stop_tx_fifo(struct uart_port *uport)
>  {
>         u32 irq_en;
> -       struct qcom_geni_serial_port *port = to_dev_port(uport);
>
>         irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
>         irq_en &= ~(M_CMD_DONE_EN | M_TX_FIFO_WATERMARK_EN);
>         writel(0, uport->membase + SE_GENI_TX_WATERMARK_REG);
>         writel(irq_en, uport->membase + SE_GENI_M_IRQ_EN);
> -       /* Possible stop tx is called multiple times. */

If qcom_geni_serial_stop_tx_fifo() is supposed to be used for UART
flow control and you have a way to stop the transfer immediately
without losing data (by using geni_se_cancel_m_cmd), maybe we should
do that? If the other side wants us to stop transferring data and we
can stop it right away that would be ideal...


> +}
> +
> +static void qcom_geni_serial_clear_tx_fifo(struct uart_port *uport)
> +{
> +       struct qcom_geni_serial_port *port = to_dev_port(uport);
> +
>         if (!qcom_geni_serial_main_active(uport))
>                 return;
>
> +       /*
> +        * Increase watermark level so that TX can be restarted and wait for
> +        * sequencer to start to prevent lockups.
> +        */
> +       writel(port->tx_fifo_depth, uport->membase + SE_GENI_TX_WATERMARK_REG);
> +       qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
> +                                       M_TX_FIFO_WATERMARK_EN, true);

Oh, maybe this "wait for sequencer to start to prevent lockups." is
the part that I was missing? Can you explain more about what's going
on here? Why does waiting for the watermark interrupt to fire prevent
lockups? I would have imagined that the watermark interrupt would be
part of the geni hardware and have nothing to do with the firmware
running on the other end, so I'm not sure why it firing somehow would
prevent a lockup. Was this just by trial and error?


> @@ -684,6 +691,8 @@ static void qcom_geni_serial_stop_tx_fifo(struct uart_port *uport)
>                 writel(M_CMD_ABORT_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
>         }
>         writel(M_CMD_CANCEL_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
> +
> +       port->tx_remaining = 0;
>  }
>
>  static void qcom_geni_serial_handle_rx_fifo(struct uart_port *uport, bool drop)
> @@ -1069,11 +1078,10 @@ static void qcom_geni_serial_shutdown(struct uart_port *uport)
>  {
>         disable_irq(uport->irq);
>
> -       if (uart_console(uport))
> -               return;

Can you explain this part of the patch? I'm not saying it's wrong to
remove this special case since this driver seems to have lots of
needless special cases that are already handled by the core or by
other parts of the driver, but this change seems unrelated to the rest
of the patch. Could it be a separate patch?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ