linux-kernel - Re: [PATCH v2] spi: spi-geni-qcom: Speculative fix of "nobody cared" about interrupt

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <158448096503.88485.8894151453752608519@swboyd.mtv.corp.google.com>
Date:   Tue, 17 Mar 2020 14:36:05 -0700
From:   Stephen Boyd <swboyd@...omium.org>
To:     Douglas Anderson <dianders@...omium.org>,
        Mark Brown <broonie@...nel.org>
Cc:     Alok Chauhan <alokc@...eaurora.org>, skakit@...eaurora.org,
        Douglas Anderson <dianders@...omium.org>,
        Andy Gross <agross@...nel.org>,
        Bjorn Andersson <bjorn.andersson@...aro.org>,
        Girish Mahadevan <girishm@...eaurora.org>,
        linux-arm-msm@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-spi@...r.kernel.org
Subject: Re: [PATCH v2] spi: spi-geni-qcom: Speculative fix of "nobody cared" about interrupt

Quoting Douglas Anderson (2020-03-17 13:37:06)
> Every once in a while (like once in a few months on a device) people
> have seen warnings on devices using spi-geni-qcom like:
> 
>   irq ...: nobody cared (try booting with the "irqpoll" option)
> 
> ...where the interrupt number listed matches up with "spi_geni" in
> /proc/interrupts.
> 
> You can get "nobody cared" if the interrupt handler returns IRQ_NONE.
> Once you get into this state the driver will just stop working.
> 
> Auditing the code makes me believe that it's probably not so good
> checking "cur_mcmd" in the interrupt handler not under spinlock.
> Let's move it to be under spinlock.  Looking more closely at the code,
> it looks as if there are some other places that could be under
> spinlock, so let's add.  It looks as if the original code was assuming
> that by the time that the interrupt handler got called that the write
> to "cur_mcmd" (to make it not CMD_NONE anymore) would make it to the
> processor handling the interrupt.  Perhaps with weakly ordered memory
> this sometimes (though very rarely) happened.
> 
> Let's also add a warning (with the IRQ status) in the case that we
> ever end up getting an interrupt when we think we shouldn't.  This
> would give us a better chance to debug if this patch doesn't help the
> issue.  We'll also try our best to clear the interrupt to hopefully
> get us out of this state.
> 
> Patch is marked "speculative" because I have no way to reproduce this
> problem, so I'm just hoping this fixes it.  Weakly ordered memory
> makes my brain hurt.

It could be that. It could also be the poor design of geni_se_init() and
how it enables many interrupts that this driver doesn't look to handle?
Why do we allow the wrapper to enable all those bits in
M_COMMON_GENI_M_IRQ_EN and S_COMMON_GENI_S_IRQ_EN if nobody is going to
handle them?

> 
> Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP")
> Signed-off-by: Douglas Anderson <dianders@...omium.org>
> ---
> 
> Changes in v2:
> - Detect true spurious interrupt.
> - Still return IRQ_NONE for state machine mismatch, but print warn.
> 
>  drivers/spi/spi-geni-qcom.c | 35 ++++++++++++++++++++++++++++++-----
>  1 file changed, 30 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
> index c3972424af71..92e51ccb427f 100644
> --- a/drivers/spi/spi-geni-qcom.c
> +++ b/drivers/spi/spi-geni-qcom.c
> @@ -151,16 +151,19 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag)
>         struct geni_se *se = &mas->se;
>         unsigned long time_left;
>  
> -       reinit_completion(&mas->xfer_done);
>         pm_runtime_get_sync(mas->dev);
>         if (!(slv->mode & SPI_CS_HIGH))
>                 set_flag = !set_flag;
>  
> +       spin_lock_irq(&mas->lock);

Why is this spin_lock_irq() vs. spin_lock_irqsave()? This isn't possible
to be called from somewhere that hasn't changed irq flags?

> +       reinit_completion(&mas->xfer_done);
> +
>         mas->cur_mcmd = CMD_CS;
>         if (set_flag)
>                 geni_se_setup_m_cmd(se, SPI_CS_ASSERT, 0);
>         else
>                 geni_se_setup_m_cmd(se, SPI_CS_DEASSERT, 0);
> +       spin_unlock_irq(&mas->lock);

This will force on interrupts if they were masked.

>  
>         time_left = wait_for_completion_timeout(&mas->xfer_done, HZ);
>         if (!time_left)
> @@ -307,6 +310,8 @@ static void setup_fifo_xfer(struct spi_transfer *xfer,
>         u32 spi_tx_cfg, len;
>         struct geni_se *se = &mas->se;
>  
> +       spin_lock_irq(&mas->lock);
> +
>         spi_tx_cfg = readl(se->base + SE_SPI_TRANS_CFG);
>         if (xfer->bits_per_word != mas->cur_bits_per_word) {
>                 spi_setup_word_len(mas, mode, xfer->bits_per_word);
> @@ -376,6 +381,8 @@ static void setup_fifo_xfer(struct spi_transfer *xfer,
>          */
>         if (m_cmd & SPI_TX_ONLY)
>                 writel(mas->tx_wm, se->base + SE_GENI_TX_WATERMARK_REG);
> +
> +       spin_unlock_irq(&mas->lock);
>  }
>  
>  static int spi_geni_transfer_one(struct spi_master *spi,
> @@ -478,13 +485,29 @@ static irqreturn_t geni_spi_isr(int irq, void *data)
>         struct geni_se *se = &mas->se;
>         u32 m_irq;
>         unsigned long flags;
> -
> -       if (mas->cur_mcmd == CMD_NONE)
> -               return IRQ_NONE;
> +       irqreturn_t ret = IRQ_HANDLED;
>  
>         spin_lock_irqsave(&mas->lock, flags);
>         m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS);

Does this read need to be inside the lock?

>  
> +       /* Check for spurious interrupt */
> +       if (!m_irq) {
> +               ret = IRQ_NONE;
> +               goto exit;

I ask because it could be simplified by reading the status and then
immediately returning IRQ_NONE if no bits are set without having to do
the lock/unlock dance. And does writing 0 to the irq clear register do
anything?

> +       }
> +
> +       /*
> +        * If we got a real interrupt but software state machine thinks
> +        * we were idle then give a warning.  We'll do our best to clear
> +        * the interrupt but we'll still return IRQ_NONE.  If this keeps
> +        * happening the system will eventually disable us.
> +        */
> +       if (mas->cur_mcmd == CMD_NONE) {
> +               pr_warn("Unexpected SPI interrupt: %#010x\n", m_irq);
> +               ret = IRQ_NONE;
> +               goto exit;
> +       }
> +
>         if ((m_irq & M_RX_FIFO_WATERMARK_EN) || (m_irq & M_RX_FIFO_LAST_EN))
>                 geni_spi_handle_rx(mas);
>  
> @@ -523,9 +546,11 @@ static irqreturn_t geni_spi_isr(int irq, void *data)
>                 complete(&mas->xfer_done);
>         }
>  
> +exit:
>         writel(m_irq, se->base + SE_GENI_M_IRQ_CLEAR);
>         spin_unlock_irqrestore(&mas->lock, flags);
> -       return IRQ_HANDLED;
> +
> +       return ret;
>  }
>  
>  static int spi_geni_probe(struct platform_device *pdev)