[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aWonUKplkzM2unQV@orome>
Date: Fri, 16 Jan 2026 13:06:42 +0100
From: Thierry Reding <thierry.reding@...il.com>
To: Usama Arif <usamaarif642@...il.com>
Cc: Breno Leitao <leitao@...ian.org>,
Jonathan Hunter <jonathanh@...dia.com>, Sowjanya Komatineni <skomatineni@...dia.com>,
Laxman Dewangan <ldewangan@...dia.com>, Mark Brown <broonie@...nel.org>, Vishwaroop A <va@...dia.com>,
Thierry Reding <treding@...dia.com>, linux-tegra@...r.kernel.org, linux-spi@...r.kernel.org,
linux-kernel@...r.kernel.org, kernel-team@...a.com, puranjay@...nel.org
Subject: Re: [PATCH 1/6] spi: tegra210-quad: Return IRQ_HANDLED when timeout
already processed transfer
On Fri, Jan 16, 2026 at 11:48:44AM +0000, Usama Arif wrote:
>
>
> On 16/01/2026 10:41, Breno Leitao wrote:
> > When the ISR thread wakes up late and finds that the timeout handler
> > has already processed the transfer (curr_xfer is NULL), return
> > IRQ_HANDLED instead of IRQ_NONE.
> >
> > Use a similar approach to tegra_qspi_handle_timeout() by reading
> > QSPI_TRANS_STATUS and checking the QSPI_RDY bit to determine if the
> > hardware actually completed the transfer. If QSPI_RDY is set, the
> > interrupt was legitimate and triggered by real hardware activity.
> > The fact that the timeout path handled it first doesn't make it
> > spurious. Returning IRQ_NONE incorrectly suggests the interrupt
> > wasn't for this device, which can cause issues with shared interrupt
> > lines and interrupt accounting.
> >
> > Fixes: b4e002d8a7ce ("spi: tegra210-quad: Fix timeout handling")
> > Signed-off-by: Breno Leitao <leitao@...ian.org>
> > ---
> > drivers/spi/spi-tegra210-quad.c | 19 +++++++++++++++++--
> > 1 file changed, 17 insertions(+), 2 deletions(-)
>
> Hi,
>
> I am not familiar with tegra SPI specifically, so I might be missing something, but
> I was wondering if its actually better to not handle the transfer in timeout at all.
>
> We have 1000ms before timeout and I feel like that should have been enough.
> Looking at other spi drivers: imx, fsl,.. these dont handle transfers in timeout
> and are therefore lockless. tegra_qspi_handle_timeout also at the end checks
> if for some reason the transfer fails (although it looks like handle_cpu/dma_based_xfer
> cant really fail?), and would just return failure.
>
> Removing the attempt to transfer in timeout will get rid of the spinlock, all the bugs,
> make isr handling quicker as you dont need to acquire a spinlock (which might lead to lesser
> timeouts?) and makes the whole driver much more simpler. Or maybe I am missing something?
>
> I have a potential untested patch below for how it would look like. We can work on testing this
> if it makes sense to the maintainers?
These issues on Tegra are subtle. The primary reason for these latest
rounds of fixes is that under typical circumstances there are no issues
because transfers complete normally.
However, it turns out that if there's enough load on CPU 0, the threaded
IRQ handler ends up running much too late (we've seen something on the
order of several seconds and in extreme cases even 10s of seconds).
Granted, these are extreme circumstances, such as when an erroneous
driver blocks CPU 0 for prolonged amounts of time during boot (in this
case it was a display driver reading EDID and using mdelay() in a loop)
or high CPU load at runtime caused by things like NVMe I/O stress tests
and such.
I think the root problem is that we rely entirely on the threaded IRQ
handler, which was fine for some of the more classic cases for this
driver but it ends up breaking in these new cases. The problem is that
most of these transfers actually do complete at a hardware level, i.e.
the data ends up in the SPI client chip, it's just that we get
notification too late. We've verified this by testing with a hard IRQ
handler and that one is never late. The threaded IRQ is the problem, but
we also need threading here because of all the DMA handling, etc.
Given that the transfers at a hardware level complete, we cannot simply
return failure on timeout. The recovery mechanism implemented is the
simplest way to make sure software and hardware state remain consistent.
Unfortunately it's riddled with bugs right now.
So I don't think getting rid of the failure/timeout handling and the
locking is the right move here. I think maybe a significant improvement
would be to not rely on the threaded IRQ handler as much, but instead
employ a combination of a hard IRQ handler and a work queue to defer the
actual processing to. That might also be a bit overkill because the SPI
transfers are already serialized by the core, so we're only ever going
to have a single transfer in flight.
Thierry
Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)
Powered by blists - more mailing lists