[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <57a63922-e194-4882-aa52-9447255ea1a3@nvidia.com>
Date: Fri, 30 Jan 2026 13:41:44 +0000
From: Jon Hunter <jonathanh@...dia.com>
To: Thierry Reding <thierry.reding@...nel.org>,
Breno Leitao <leitao@...ian.org>
Cc: Thierry Reding <thierry.reding@...il.com>,
Sowjanya Komatineni <skomatineni@...dia.com>,
Laxman Dewangan <ldewangan@...dia.com>, Mark Brown <broonie@...nel.org>,
Vishwaroop A <va@...dia.com>, Thierry Reding <treding@...dia.com>,
linux-tegra@...r.kernel.org, linux-spi@...r.kernel.org,
linux-kernel@...r.kernel.org, kernel-team@...a.com, soto@...dia.com
Subject: Re: [PATCH v2 0/6] spi: tegra-qspi: Fix race condition causing NULL
pointer dereference and spurious IRQ
On 30/01/2026 10:16, Thierry Reding wrote:
> On Mon, Jan 26, 2026 at 09:50:25AM -0800, Breno Leitao wrote:
>> The tegra-quad-spi driver is crashing on some hosts. Analysis revealed
>> the following failure sequence:
>>
>> 1) After running for a while, the interrupt gets marked as spurious:
>>
>> irq 63: nobody cared (try booting with the "irqpoll" option)
>> Disabling IRQ #63
>>
>> 2) The IRQ handler (tegra_qspi_isr_thread->handle_cpu_based_xfer) is
>> responsible for signaling xfer_completion.
>> Once the interrupt is disabled, xfer_completion is never completed, causing
>> transfers to hit the timeout:
>>
>> WARNING: CPU: 64 PID: 844224 at drivers/spi/spi-tegra210-quad.c:1222 tegra_qspi_transfer_one_message+0x7a0/0x9b0
>>
>> 3) The timeout handler completes the transfer:
>>
>> tegra-qspi NVDA1513:00: QSPI interrupt timeout, but transfer complete
>>
>> 4) Later, the ISR thread finally runs and crashes trying to dereference
>> curr_xfer which the timeout handler already set to NULL:
>>
>> Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
>> pc : handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad]
>> lr : tegra_qspi_handle_timeout+0xb4/0xf0 [spi_tegra210_quad]
>> Call trace:
>> handle_cpu_based_xfer+0x90/0x388 [spi_tegra210_quad] (P)
>>
>> Root cause analysis identified three issues:
>>
>> 1) Race condition on tqspi->curr_xfer
>>
>> The curr_xfer pointer can change during ISR execution without proper
>> synchronization. The timeout path clears curr_xfer while the ISR
>> thread may still be accessing it.
>>
>> This is trivially reproducible by decreasing QSPI_DMA_TIMEOUT and
>> adding instrumentation to tegra_qspi_isr_thread() to check curr_xfer
>> at entry and exit - the value changes mid-execution. I've used the
>> following test to reproduce this issue:
>>
>> https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh
>>
>> The existing comment in the ISR acknowledges this race but the
>> protection is insufficient:
>>
>> /*
>> * Occasionally the IRQ thread takes a long time to wake up (usually
>> * when the CPU that it's running on is excessively busy) and we have
>> * already reached the timeout before and cleaned up the timed out
>> * transfer. Avoid any processing in that case and bail out early.
>> */
>>
>> This is bad because tqspi->curr_xfer can just get NULLed
>>
>> 2) Incorrect IRQ_NONE return causing spurious IRQ detection
>>
>> When the timeout handler processes a transfer before the ISR thread
>> runs, tegra_qspi_isr_thread() returns IRQ_NONE.
>>
>> After enough IRQ_NONE returns, the kernel marks the interrupt as spurious
>> and disables it - but these were legitimate interrupts that happened to be
>> processed by the timeout path first.
>>
>> Interrupt handlers shouldn't return IRQ_NONE, if the driver somehow handled
>> the interrupt (!?)
>>
>> 3) Complex locking makes full protection difficult
>>
>> Ideally the entire tqspi structure would be protected by tqspi->lock,
>> but handle_dma_based_xfer() calls wait_for_completion_interruptible_timeout()
>> which can sleep, preventing the lock from being held across the entire
>> ISR execution.
>>
>> Usama Arif has some ideas here, and he can share more.
>>
>> This patchset addresses these issues:
>>
>> Return IRQ_HANDLED instead of IRQ_NONE when the timeout path has
>> already processed the transfer. Use the QSPI_RDY bit in
>> QSPI_TRANS_STATUS (same approach as tegra_qspi_handle_timeout()) to
>> distinguish real interrupts from truly spurious ones.
>>
>> Protect curr_xfer access with spinlock everywhere in the code, given
>> Interrupt handling can run in parallel with timeout and transfer.
>> This prevents the NULL pointer dereference by ensuring curr_xfer cannot
>> be cleared while being checked.
>>
>> While this may not provide complete protection for all tqspi fields
>> (which might be necessary?!), it fixes the observed crashes and prevents
>> the spurious IRQ detection that was disabling the interrupt entirely.
>>
>> This was tested with a simple TPM application, where the TPM lives
>> behind the tegra qspi driver:
>>
>> https://github.com/leitao/debug/blob/main/arm/tegra/tpm_torture_test.sh
>>
>> A special thanks for Usama Arif for his help investigating the problem
>> and helping with the fixes.
>>
>> Signed-off-by: Breno Leitao <leitao@...ian.org>
>> ---
>> Changes in v2:
>> - Replaced the TODO comment to clarify why the lock is being released.
>> - Link to v1: https://patch.msgid.link/20260116-tegra_xfer-v1-0-02d96c790619@debian.org
>>
>> ---
>> Breno Leitao (6):
>> spi: tegra210-quad: Return IRQ_HANDLED when timeout already processed transfer
>> spi: tegra210-quad: Move curr_xfer read inside spinlock
>> spi: tegra210-quad: Protect curr_xfer assignment in tegra_qspi_setup_transfer_one
>> spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer
>> spi: tegra210-quad: Protect curr_xfer clearing in tegra_qspi_non_combined_seq_xfer
>> spi: tegra210-quad: Protect curr_xfer check in IRQ handler
>>
>> drivers/spi/spi-tegra210-quad.c | 56 ++++++++++++++++++++++++++++++++++++++---
>> 1 file changed, 52 insertions(+), 4 deletions(-)
>
> For the series:
>
> Acked-by: Thierry Reding <treding@...dia.com>
This also resolves a NULL pointer deference I see on Tegra194 Jetson
Xavier NX and so ...
Tested-by: Jon Hunter <jonathanh@...dia.com>
Acked-by: Jon Hunter <jonathanh@...dia.com>
Thanks!
Jon
--
nvpublic
Powered by blists - more mailing lists