[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAD=FV=WuQjKC6GHy8d2nuqS-fgsUfxYrJosg3eyC9JU1FPCcjw@mail.gmail.com>
Date: Thu, 10 Dec 2020 15:07:39 -0800
From: Doug Anderson <dianders@...omium.org>
To: Stephen Boyd <swboyd@...omium.org>
Cc: Roja Rani Yarubandi <rojay@...eaurora.org>,
Mark Brown <broonie@...nel.org>,
Andy Gross <agross@...nel.org>,
Bjorn Andersson <bjorn.andersson@...aro.org>,
linux-arm-msm <linux-arm-msm@...r.kernel.org>,
linux-spi <linux-spi@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>,
Akash Asthana <akashast@...eaurora.org>,
msavaliy@....qualcomm.com
Subject: Re: [PATCH] spi: spi-geni-qcom: Fix NULL pointer access in geni_spi_isr
Hi,
On Thu, Dec 10, 2020 at 2:58 PM Stephen Boyd <swboyd@...omium.org> wrote:
>
> Quoting Doug Anderson (2020-12-10 09:14:15)
> >
> > This is my untested belief of what's happening
> >
> > CPU0 CPU1
> > ---- ----
> > setup_fifo_xfer()
> > ...
> > geni_se_setup_m_cmd()
> > <hardware starts transfer>
> > <unrelated interrupt storm> spin_unlock_irq()
> > <continued interrupt storm> <time passes>
> > <continued interrupt storm> <transfer complets in hardware>
> > <continued interrupt storm> <hardware sets M_RX_FIFO_WATERMARK_EN>
> > <continued interrupt storm> <time passes>
> > <continued interrupt storm> handle_fifo_timeout()
> > <continued interrupt storm> spin_lock_irq()
> > <continued interrupt storm> mas->cur_xfer = NULL
> > <continued interrupt storm> geni_se_cancel_m_cmd()
> > <continued interrupt storm> spin_unlock_irq()
> > <continued interrupt storm> wait_for_completion_timeout() => timeout
> > <continued interrupt storm> spin_lock_irq()
> > <continued interrupt storm> geni_se_abort_m_cmd()
> > <continued interrupt storm> spin_unlock_irq()
> > <continued interrupt storm> wait_for_completion_timeout() => timeout
> > <interrupt storm ends>
> > geni_spi_isr()
> > spin_lock()
> > if (m_irq & M_RX_FIFO_WATERMARK_EN)
> > geni_spi_handle_rx()
> > mas->cur_xfer NULL derefrence
>
> Ok so the one line summary is "If geni_spi_isr() is sufficiently delayed
> then we may deref NULL in the irq handler because the handler tries to
> deref mas->cur_xfer after the timeout handling code has set it to NULL".
Sure.
> CPU0 CPU1
> ---- ----
> setup_fifo_xfer()
> ...
> geni_se_setup_m_cmd()
> <hardware starts transfer>
> unrelated_irq_handler() spin_unlock_irq()
> ...
> <transfer completes in hardware>
> <hardware sets M_RX_FIFO_WATERMARK_EN>
>
> handle_fifo_timeout()
> spin_lock_irq()
> mas->cur_xfer = NULL
> geni_se_cancel_m_cmd()
> spin_unlock_irq()
> wait_for_completion_timeout() => timeout
> spin_lock_irq()
> geni_se_abort_m_cmd()
> spin_unlock_irq()
> wait_for_completion_timeout() => timeout
> return IRQ_HANDLED;
> gic_handle_irq()
> geni_spi_isr()
> spin_lock()
> if (m_irq & M_RX_FIFO_WATERMARK_EN)
> geni_spi_handle_rx()
> rx_buf = mas->cur_xfer->rx_buf <--- OOPS!
>
> > With my proposed fix, I believe that would transform into:
> >
> > CPU0 CPU1
> > ---- ----
> > setup_fifo_xfer()
> > ...
> > geni_se_setup_m_cmd()
> > <hardware starts transfer>
> > <unrelated interrupt storm> spin_unlock_irq()
> > <continued interrupt storm> <time passes>
> > <continued interrupt storm> <transfer complets in hardware>
> > <continued interrupt storm> <hardware sets M_RX_FIFO_WATERMARK_EN>
> > <continued interrupt storm> <time passes>
> > <continued interrupt storm> handle_fifo_timeout()
> > <continued interrupt storm> synchronize_irq()
> > <continued interrupt storm> <time passes>
> > <interrupt storm ends>
> > geni_spi_isr()
> > ...
> > <synchronize_irq() finishes>
> > spin_lock_irq()
> > mas->cur_xfer = NULL
> > geni_se_cancel_m_cmd()
> > spin_unlock_irq()
> > geni_spi_isr()
> > ...
> > wait_for_completion_timeout() => success
> >
> > The extra synchronize_irq() I was suggesting at the end of the
> > function would be an extra bit of paranoia. Maybe a new storm showed
> > up while we were processing the timeout?
>
> Shouldn't we check in the timeout logic to see if m_irq has
> M_RX_FIFO_WATERMARK_EN or M_TX_FIFO_WATERMARK_EN set instead? Similarly
> for the CS assert/deassert stuff. If the timeout hits but one of those
> bits are set then it seems we've run into some poor irqsoff section but
> the hardware is still working. Calling synchronize_irq() wouldn't help
> if the CPU handling the irqs (i.e. CPU0) had irqs off for a long time,
> right? It will only ensure that other irq handlers have completed, which
> may be a problem, but not the only one.
>
> TL;DR: Peek at the irq status register in the timeout logic and skip it
> if the irq is pending?
I don't have tons of experience with synchronize_irq(), but the
function comment seems to indicate that as long as the interrupt is
pending synchronize_irq() will do what we want even if the CPU that
should handle the interrupt is in an irqsoff section. Digging a
little bit I guess it relies upon the interrupt controller being able
to read this state, but (hopefully) the GIC can?
If it doesn't work like I think it does, I'd be OK with peeking in the
IRQ status register, but we shouldn't _skip_ the logic IMO. As long
as we believe that an interrupt could happen in the future we
shouldn't return from handle_fifo_timeout(). It's impossible to
reason about how future transfers would work if the pending interrupt
from the previous transfer could fire at any point.
-Doug
Powered by blists - more mailing lists