[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aXyE1kfP4GeOdYs5@orome>
Date: Fri, 30 Jan 2026 11:16:12 +0100
From: Thierry Reding <thierry.reding@...nel.org>
To: Breno Leitao <leitao@...ian.org>
Cc: Thierry Reding <thierry.reding@...il.com>,
Jonathan Hunter <jonathanh@...dia.com>, Sowjanya Komatineni <skomatineni@...dia.com>,
Laxman Dewangan <ldewangan@...dia.com>, Mark Brown <broonie@...nel.org>, Vishwaroop A <va@...dia.com>,
Thierry Reding <treding@...dia.com>, linux-tegra@...r.kernel.org, linux-spi@...r.kernel.org,
linux-kernel@...r.kernel.org, kernel-team@...a.com, soto@...dia.com
Subject: Re: [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL
pointer dereference and spurious IRQ
On Mon, Jan 26, 2026 at 09:50:25AM -0800, Breno Leitao wrote:
> The tegra-quad-spi driver is crashing on some hosts. Analysis revealed
> the following failure sequence:
>
> 1) After running for a while, the interrupt gets marked as spurious:
>
> irq 63: nobody cared (try booting with the "irqpoll" option)
> Disabling IRQ #63
>
> 2) The IRQ handler (tegra_qspi_isr_thread->handle_cpu_based_xfer) is
> responsible for signaling xfer_completion.
> Once the interrupt is disabled, xfer_completion is never completed, causing
> transfers to hit the timeout:
>
> WARNING: CPU: 64 PID: 844224 at drivers/spi/spi-tegra210-quad.c:1222 tegra_qspi_transfer_one_message+0x7a0/0x9b0
>
> 3) The timeout handler completes the transfer:
>
> tegra-qspi NVDA1513:00: QSPI interrupt timeout, but transfer complete
>
> 4) Later, the ISR thread finally runs and crashes trying to dereference
> curr_xfer which the timeout handler already set to NULL:
>
> Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
> pc : handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad]
> lr : tegra_qspi_handle_timeout+0xb4/0xf0 [spi_tegra210_quad]
> Call trace:
> handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad] (P)
>
> Root cause analysis identified three issues:
>
> 1) Race condition on tqspi->curr_xfer
>
> The curr_xfer pointer can change during ISR execution without proper
> synchronization. The timeout path clears curr_xfer while the ISR
> thread may still be accessing it.
>
> This is trivially reproducible by decreasing QSPI_DMA_TIMEOUT and
> adding instrumentation to tegra_qspi_isr_thread() to check curr_xfer
> at entry and exit - the value changes mid-execution. I've used the
> following test to reproduce this issue:
>
> https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh
>
> The existing comment in the ISR acknowledges this race but the
> protection is insufficient:
>
> /*
> * Occasionally the IRQ thread takes a long time to wake up (usually
> * when the CPU that it's running on is excessively busy) and we have
> * already reached the timeout before and cleaned up the timed out
> * transfer. Avoid any processing in that case and bail out early.
> */
>
> This is bad because tqspi->curr_xfer can just get NULLed
>
> 2) Incorrect IRQ_NONE return causing spurious IRQ detection
>
> When the timeout handler processes a transfer before the ISR thread
> runs, tegra_qspi_isr_thread() returns IRQ_NONE.
>
> After enough IRQ_NONE returns, the kernel marks the interrupt as spurious
> and disables it - but these were legitimate interrupts that happened to be
> processed by the timeout path first.
>
> Interrupt handlers shouldn't return IRQ_NONE, if the driver somehow handled
> the interrupt (!?)
>
> 3) Complex locking makes full protection difficult
>
> Ideally the entire tqspi structure would be protected by tqspi->lock,
> but handle_dma_based_xfer() calls wait_for_completion_interruptible_timeout()
> which can sleep, preventing the lock from being held across the entire
> ISR execution.
>
> Usama Arif has some ideas here, and he can share more.
>
> This patchset addresses these issues:
>
> Return IRQ_HANDLED instead of IRQ_NONE when the timeout path has
> already processed the transfer. Use the QSPI_RDY bit in
> QSPI_TRANS_STATUS (same approach as tegra_qspi_handle_timeout()) to
> distinguish real interrupts from truly spurious ones.
>
> Protect curr_xfer access with spinlock everywhere in the code, given
> Interrupt handling can run in parallel with timeout and transfer.
> This prevents the NULL pointer dereference by ensuring curr_xfer cannot
> be cleared while being checked.
>
> While this may not provide complete protection for all tqspi fields
> (which might be necessary?!), it fixes the observed crashes and prevents
> the spurious IRQ detection that was disabling the interrupt entirely.
>
> This was tested with a simple TPM application, where the TPM lives
> behind the tegra qspi driver:
>
> https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh
>
> A special thanks for Usama Arif for his help investigating the problem
> and helping with the fixes.
>
> Signed-off-by: Breno Leitao <leitao@...ian.org>
> ---
> Changes in v2:
> - Replaced the TODO comment to clarify why the lock is being released.
> - Link to v1: https://patch.msgid.link/20260116-tegra_xfer-v1-0-02d96c790619@debian.org
>
> ---
> Breno Leitao (6):
> spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
> spi: tegra210-quad: Move curr_xfer read inside spinlock
> spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one
> spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer
> spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer
> spi: tegra210-quad: Protect curr_xfer check in IRQ handler
>
> drivers/spi/spi-tegra210-quad.c | 56 ++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 52 insertions(+), 4 deletions(-)
For the series:
Acked-by: Thierry Reding <treding@...dia.com>
Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)
Powered by blists - more mailing lists