lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20260116-tegra_xfer-v1-0-02d96c790619@debian.org>
Date: Fri, 16 Jan 2026 02:41:40 -0800
From: Breno Leitao <leitao@...ian.org>
To: Thierry Reding <thierry.reding@...il.com>, 
 Jonathan Hunter <jonathanh@...dia.com>, 
 Sowjanya Komatineni <skomatineni@...dia.com>, 
 Laxman Dewangan <ldewangan@...dia.com>, Mark Brown <broonie@...nel.org>, 
 Vishwaroop A <va@...dia.com>
Cc: Thierry Reding <treding@...dia.com>, linux-tegra@...r.kernel.org, 
 linux-spi@...r.kernel.org, linux-kernel@...r.kernel.org, 
 Breno Leitao <leitao@...ian.org>, kernel-team@...a.com, puranjay@...nel.org, 
 usamaarif642@...il.com
Subject: [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL
 pointer dereference and spurious IRQ

The tegra-quad-spi driver is crashing on some hosts. Analysis revealed
the following failure sequence:

1) After running for a while, the interrupt gets marked as spurious:

    irq 63: nobody cared (try booting with the "irqpoll" option)
    Disabling IRQ #63

2) The IRQ handler (tegra_qspi_isr_thread->handle_cpu_based_xfer) is
   responsible for signaling xfer_completion.
   Once the interrupt is disabled, xfer_completion is never completed, causing
   transfers to hit the timeout:

    WARNING: CPU: 64 PID: 844224 at drivers/spi/spi-tegra210-quad.c:1222 tegra_qspi_transfer_one_message+0x7a0/0x9b0

3) The timeout handler completes the transfer:

    tegra-qspi NVDA1513:00: QSPI interrupt timeout, but transfer complete

4) Later, the ISR thread finally runs and crashes trying to dereference
   curr_xfer which the timeout handler already set to NULL:

    Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
    pc : handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad]
    lr : tegra_qspi_handle_timeout+0xb4/0xf0 [spi_tegra210_quad]
    Call trace:
      handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad] (P)

Root cause analysis identified three issues:

1) Race condition on tqspi->curr_xfer

   The curr_xfer pointer can change during ISR execution without proper
   synchronization. The timeout path clears curr_xfer while the ISR
   thread may still be accessing it.

   This is trivially reproducible by decreasing QSPI_DMA_TIMEOUT and
   adding instrumentation to tegra_qspi_isr_thread() to check curr_xfer
   at entry and exit - the value changes mid-execution. I've used the
   following test to reproduce this issue:

   https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh

   The existing comment in the ISR acknowledges this race but the
   protection is insufficient:

       /*
        * Occasionally the IRQ thread takes a long time to wake up (usually
        * when the CPU that it's running on is excessively busy) and we have
        * already reached the timeout before and cleaned up the timed out
        * transfer. Avoid any processing in that case and bail out early.
        */

   This is bad because tqspi->curr_xfer can just get NULLed

2) Incorrect IRQ_NONE return causing spurious IRQ detection

   When the timeout handler processes a transfer before the ISR thread
   runs, tegra_qspi_isr_thread() returns IRQ_NONE.

   After enough IRQ_NONE returns, the kernel marks the interrupt as spurious
   and disables it - but these were legitimate interrupts that happened to be
   processed by the timeout path first.

   Interrupt handlers shouldn't return IRQ_NONE, if the driver somehow handled
   the interrupt (!?)

3) Complex locking makes full protection difficult

   Ideally the entire tqspi structure would be protected by tqspi->lock,
   but handle_dma_based_xfer() calls wait_for_completion_interruptible_timeout()
   which can sleep, preventing the lock from being held across the entire
   ISR execution.

   Usama Arif has some ideas here, and he can share more.

This patchset addresses these issues:

Return IRQ_HANDLED instead of IRQ_NONE when the timeout path has
already processed the transfer. Use the QSPI_RDY bit in
QSPI_TRANS_STATUS (same approach as tegra_qspi_handle_timeout()) to
distinguish real interrupts from truly spurious ones.

Protect curr_xfer access with spinlock everywhere in the code, given
Interrupt handling can run in parallel with timeout and transfer.
This prevents the NULL pointer dereference by ensuring curr_xfer cannot
be cleared while being checked.

While this may not provide complete protection for all tqspi fields
(which might be necessary?!), it fixes the observed crashes and prevents
the spurious IRQ detection that was disabling the interrupt entirely.

This was tested with a simple TPM application, where the TPM lives
behind the tegra qspi driver:

https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh

A special thanks for Usama Arif for his help investigating the problem
and helping with the fixes.

Signed-off-by: Breno Leitao <leitao@...ian.org>
---
Breno Leitao (6):
      spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
      spi: tegra210-quad: Move curr_xfer read inside spinlock
      spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one
      spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer
      spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer
      spi: tegra210-quad: Protect curr_xfer check in IRQ handler

 drivers/spi/spi-tegra210-quad.c | 54 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 50 insertions(+), 4 deletions(-)
---
base-commit: 9b7977f9e39b7768c70c2aa497f04e7569fd3e00
change-id: 20260112-tegra_xfer-6acb30a6720f

Best regards,
--  
Breno Leitao <leitao@...ian.org>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ