lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aXyE1kfP4GeOdYs5@orome>
Date: Fri, 30 Jan 2026 11:16:12 +0100
From: Thierry Reding <thierry.reding@...nel.org>
To: Breno Leitao <leitao@...ian.org>
Cc: Thierry Reding <thierry.reding@...il.com>, 
	Jonathan Hunter <jonathanh@...dia.com>, Sowjanya Komatineni <skomatineni@...dia.com>, 
	Laxman Dewangan <ldewangan@...dia.com>, Mark Brown <broonie@...nel.org>, Vishwaroop A <va@...dia.com>, 
	Thierry Reding <treding@...dia.com>, linux-tegra@...r.kernel.org, linux-spi@...r.kernel.org, 
	linux-kernel@...r.kernel.org, kernel-team@...a.com, soto@...dia.com
Subject: Re: [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL
 pointer dereference and spurious IRQ

On Mon, Jan 26, 2026 at 09:50:25AM -0800, Breno Leitao wrote:
> The tegra-quad-spi driver is crashing on some hosts. Analysis revealed
> the following failure sequence:
> 
> 1) After running for a while, the interrupt gets marked as spurious:
> 
>     irq 63: nobody cared (try booting with the "irqpoll" option)
>     Disabling IRQ #63
> 
> 2) The IRQ handler (tegra_qspi_isr_thread->handle_cpu_based_xfer) is
>    responsible for signaling xfer_completion.
>    Once the interrupt is disabled, xfer_completion is never completed, causing
>    transfers to hit the timeout:
> 
>     WARNING: CPU: 64 PID: 844224 at drivers/spi/spi-tegra210-quad.c:1222 tegra_qspi_transfer_one_message+0x7a0/0x9b0
> 
> 3) The timeout handler completes the transfer:
> 
>     tegra-qspi NVDA1513:00: QSPI interrupt timeout, but transfer complete
> 
> 4) Later, the ISR thread finally runs and crashes trying to dereference
>    curr_xfer which the timeout handler already set to NULL:
> 
>     Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
>     pc : handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad]
>     lr : tegra_qspi_handle_timeout+0xb4/0xf0 [spi_tegra210_quad]
>     Call trace:
>       handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad] (P)
> 
> Root cause analysis identified three issues:
> 
> 1) Race condition on tqspi->curr_xfer
> 
>    The curr_xfer pointer can change during ISR execution without proper
>    synchronization. The timeout path clears curr_xfer while the ISR
>    thread may still be accessing it.
> 
>    This is trivially reproducible by decreasing QSPI_DMA_TIMEOUT and
>    adding instrumentation to tegra_qspi_isr_thread() to check curr_xfer
>    at entry and exit - the value changes mid-execution. I've used the
>    following test to reproduce this issue:
> 
>    https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh
> 
>    The existing comment in the ISR acknowledges this race but the
>    protection is insufficient:
> 
>        /*
>         * Occasionally the IRQ thread takes a long time to wake up (usually
>         * when the CPU that it's running on is excessively busy) and we have
>         * already reached the timeout before and cleaned up the timed out
>         * transfer. Avoid any processing in that case and bail out early.
>         */
> 
>    This is bad because tqspi->curr_xfer can just get NULLed
> 
> 2) Incorrect IRQ_NONE return causing spurious IRQ detection
> 
>    When the timeout handler processes a transfer before the ISR thread
>    runs, tegra_qspi_isr_thread() returns IRQ_NONE.
> 
>    After enough IRQ_NONE returns, the kernel marks the interrupt as spurious
>    and disables it - but these were legitimate interrupts that happened to be
>    processed by the timeout path first.
> 
>    Interrupt handlers shouldn't return IRQ_NONE, if the driver somehow handled
>    the interrupt (!?)
> 
> 3) Complex locking makes full protection difficult
> 
>    Ideally the entire tqspi structure would be protected by tqspi->lock,
>    but handle_dma_based_xfer() calls wait_for_completion_interruptible_timeout()
>    which can sleep, preventing the lock from being held across the entire
>    ISR execution.
> 
>    Usama Arif has some ideas here, and he can share more.
> 
> This patchset addresses these issues:
> 
> Return IRQ_HANDLED instead of IRQ_NONE when the timeout path has
> already processed the transfer. Use the QSPI_RDY bit in
> QSPI_TRANS_STATUS (same approach as tegra_qspi_handle_timeout()) to
> distinguish real interrupts from truly spurious ones.
> 
> Protect curr_xfer access with spinlock everywhere in the code, given
> Interrupt handling can run in parallel with timeout and transfer.
> This prevents the NULL pointer dereference by ensuring curr_xfer cannot
> be cleared while being checked.
> 
> While this may not provide complete protection for all tqspi fields
> (which might be necessary?!), it fixes the observed crashes and prevents
> the spurious IRQ detection that was disabling the interrupt entirely.
> 
> This was tested with a simple TPM application, where the TPM lives
> behind the tegra qspi driver:
> 
> https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh
> 
> A special thanks for Usama Arif for his help investigating the problem
> and helping with the fixes.
> 
> Signed-off-by: Breno Leitao <leitao@...ian.org>
> ---
> Changes in v2:
> - Replaced the TODO comment to clarify why the lock is being released.
> - Link to v1: https://patch.msgid.link/20260116-tegra_xfer-v1-0-02d96c790619@debian.org
> 
> ---
> Breno Leitao (6):
>       spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
>       spi: tegra210-quad: Move curr_xfer read inside spinlock
>       spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one
>       spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer
>       spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer
>       spi: tegra210-quad: Protect curr_xfer check in IRQ handler
> 
>  drivers/spi/spi-tegra210-quad.c | 56 ++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 52 insertions(+), 4 deletions(-)

For the series:

Acked-by: Thierry Reding <treding@...dia.com>

Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ