linux-kernel - Re: [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aW-seUXIJv4Lz7bK@gmail.com>
Date: Tue, 20 Jan 2026 08:49:23 -0800
From: Breno Leitao <leitao@...ian.org>
To: Vishwaroop A <va@...dia.com>
Cc: thierry.reding@...il.com, treding@...dia.com, jonathanh@...dia.com, 
	skomatineni@...dia.com, ldewangan@...dia.com, broonie@...nel.org, 
	linux-spi@...r.kernel.org, linux-tegra@...r.kernel.org, linux-kernel@...r.kernel.org, 
	kernel-team@...a.com, puranjay@...nel.org, usamaarif642@...il.com
Subject: Re: [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL
 pointer dereference and spurious IRQ

Hello Vishwaroop,

First of all, thanks for you answer and help,  

On Tue, Jan 20, 2026 at 11:22:42AM +0000, Vishwaroop A wrote:
> Thanks for posting this series. I've been working closely with our team at 
> NVIDIA on this exact issue after the reports from Meta's fleet.
> 
> Your analysis is correct, I have some technical feedback on the series:
> 
> **Patches 1-2 (IRQ_HANDLED + spinlock for curr_xfer):
> 
> These directly address the race we identified. The spinlock-protected read of 
> curr_xfer in Patch 2 mirrors the fix we developed internally and resolves the 
> TOCTOU race between the timeout handler and the delayed ISR thread.
> 
> **Patch 3 (Spinlock in tegra_qspi_setup_transfer_one): Improves formal correctness**
> 
> While our original position was that this lock isn't strictly required due to 
> call ordering (setup completes before HW start, so no IRQ can fire yet), I 
> agree that explicit locking here improves maintainability. It makes the 
> synchronization model clearer for future readers and removes any dependency on 
> implicit ordering guarantees.
> 
> **Patch 4 (device_reset_optional):
> 
> Your observation about ACPI platforms lacking _RST is accurate. On server 
> platforms, device_reset() fails and logs unnecessary errors. 
> device_reset_optional() is the right semantic here. I'd suggest coordinating 
> with the device core maintainers (Greg KH, Rafael Wysocki) to ensure this API 
> addition gets proper review—it's useful beyond just QSPI.

We are counting patch numbers differently. Patch 4 in this patchset is called
"[PATCH 4/6] spi: tegra210-quad: Protect curr_xfer in tegra_qspi_combined_seq_xfer",
 and basically protect the curr_xfer clearing at the exit path of
tegra_qspi_combined_seq_xfer() with the spinlock to prevent a race with the
interrupt handler that reads this field.

https://lore.kernel.org/all/20260116-tegra_xfer-v1-4-02d96c790619@debian.org/

The discussion about device_reset_optional() happened a while ago (March 2025),
but it never landed. Do you want me to include in this patchset?

Here is the discussion link:
https://lore.kernel.org/all/20250317-tegra-v1-1-78474efc0386@debian.org/

> **Patch 6 (Timeout error path): Logic issue—needs rework**

I understand you are talking about patch "[PATCH 1/6] spi:
tegra210-quad: Return IRQ_HANDLED when timeout already processed
transfer", right?

https://lore.kernel.org/all/20260116-tegra_xfer-v1-1-02d96c790619@debian.org/

> I see a problem here. If QSPI_RDY is not set (hardware actually timed out), 
> you return success (0) from tegra_qspi_handle_timeout(). This causes 
> __spi_transfer_message_noqueue() to call spi_finalize_current_message() with 
> status = 0, signaling success to the client when the transfer actually failed.
> 
> The correct behavior should be:
> - If QSPI_RDY is set: Hardware completed, recover the data, return success (0)
> - If QSPI_RDY is NOT set: True timeout, reset hardware, return error (-ETIMEDOUT)

Thanks for the heads-up.

> The current logic inverts this. I'd suggest:
> 
>   if (fifo_status & QSPI_RDY) {
>       /* HW completed, recover */
>       handle_cpu/dma_based_xfer();
>       return 0;
>   }
>   /* True timeout */
>   dev_err(..., "Transfer timeout");
>   tegra_qspi_handle_error(tqspi);
>   return -ETIMEDOUT;

I am not sure we can return -ETIMEDOUT here, given its return function
is irqreturn, which is:

  /**
   * enum irqreturn - irqreturn type values
   * @IRQ_NONE:           interrupt was not from this device or was not handled
   * @IRQ_HANDLED:        interrupt was handled by this device
   * @IRQ_WAKE_THREAD:    handler requests to wake the handler thread
   */
  enum irqreturn {
          IRQ_NONE                = (0 << 0),
          IRQ_HANDLED             = (1 << 0),
          IRQ_WAKE_THREAD         = (1 << 1),
  };


What about something as:

diff --git a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index cdc3cb7c01f9b..028b0b0cfdb25 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -1552,15 +1552,30 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
 static irqreturn_t tegra_qspi_isr_thread(int irq, void *context_data)
 {
        struct tegra_qspi *tqspi = context_data;
+       u32 status;
+
+       /*
+        * Read transfer status to check if interrupt was triggered by transfer
+        * completion
+        */
+       status = tegra_qspi_readl(tqspi, QSPI_TRANS_STATUS);

        /*
         * Occasionally the IRQ thread takes a long time to wake up (usually
         * when the CPU that it's running on is excessively busy) and we have
         * already reached the timeout before and cleaned up the timed out
         * transfer. Avoid any processing in that case and bail out early.
+        *
+        * If no transfer is in progress, check if this was a real interrupt
+        * that the timeout handler already processed, or a spurious one.
         */
-       if (!tqspi->curr_xfer)
+       if (!tqspi->curr_xfer) {
+               /* Spurious interrupt - transfer not ready */
+               if (!(status & QSPI_RDY))
+                       return IRQ_HANDLED;
+               /* Real interrupt, already handled by timeout path */
                return IRQ_NONE;
+       }

        tqspi->status_reg = tegra_qspi_readl(tqspi, QSPI_FIFO_STATUS);


> I saw your tpm_torture_test.sh in the cover letter. We haven't been able to 
> reproduce the issue locally yet—it seems to require the specific TPM device 
> configuration and CPU load patterns present in Meta's fleet. 

You can try to simulate a timeout using something like:

t a/drivers/spi/spi-tegra210-quad.c b/drivers/spi/spi-tegra210-quad.c
index cdc3cb7c01f9..7116c876c62b 100644
--- a/drivers/spi/spi-tegra210-quad.c
+++ b/drivers/spi/spi-tegra210-quad.c
@@ -159,7 +159,8 @@
 #define DATA_DIR_TX                            BIT(0)
 #define DATA_DIR_RX                            BIT(1)

-#define QSPI_DMA_TIMEOUT                       (msecs_to_jiffies(1000))
+/* INSTRUMENTATION: Reduced from 1000ms to 20ms to trigger timeouts */
+#define QSPI_DMA_TIMEOUT                       (msecs_to_jiffies(20))
 #define DEFAULT_QSPI_DMA_BUF_LEN               SZ_64K

 enum tegra_qspi_transfer_type {
@@ -1552,6 +1553,10 @@ static irqreturn_t handle_dma_based_xfer(struct tegra_qspi *tqspi)
 static irqreturn_t tegra_qspi_isr_thread(int irq, void *context_data)
 {
        struct tegra_qspi *tqspi = context_data;
+       mdelay(10);

        /*
         * Occasionally the IRQ thread takes a long time to wake up (usually

I've also uploaded another stress test in
https://github.com/leitao/debug/tree/main/arm/tegra. Those are the  
stress test I am using to reproduce this issue.

> If you have 
> additional details on the reproduction environment (TPM vendor/model, specific 
> workload characteristics, CPU affinity settings), that would help us validate 
> the fix on our end.

I am not doing anything special, this is a bit more about my host:

  # dmesg | grep tegra
  [    8.903792] tegra-qspi NVDA1513:00: cannot use DMA: -19
  [    8.903806] tegra-qspi NVDA1513:00: falling back to PIO
  [    8.903811] tegra-qspi NVDA1513:00: device reset failed

  # udevadm info -a /dev/tpm0  | cat
  looking at device '/devices/platform/NVDA1513:00/spi_master/spi0/spi-PRP0001:01/tpm/tpm0':
    KERNEL=="tpm0"
    SUBSYSTEM=="tpm"
    DRIVER==""
    ATTR{null_name}=="000bb4c148f37daef5c917ebdd9267edad857263b872d9a9cd05f2af96c060212e6f"
    ATTR{pcr-sha256/0}=="9CC1646906FD362B5F6D182CBADE226057CD32A6F5D6C6197A49AA78B4241929"
    ATTR{pcr-sha256/1}=="BFB3D60EA045EE30E924F239C812E11D7D02D380D94B1A53CDF8E336F4BD5EFF"
    ATTR{pcr-sha256/10}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/11}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/12}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/13}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/14}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/15}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/16}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/17}=="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
    ATTR{pcr-sha256/18}=="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
    ATTR{pcr-sha256/19}=="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
    ATTR{pcr-sha256/2}=="ECE24999889A3DBA6BDC392ED64CA0B6A968DFCAF46D8DBDDAD57A7A0423AD2C"
    ATTR{pcr-sha256/20}=="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
    ATTR{pcr-sha256/21}=="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
    ATTR{pcr-sha256/22}=="FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"
    ATTR{pcr-sha256/23}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/3}=="3D458CFE55CC03EA1F443F1562BEEC8DF51C75E14A9FCF9A7234A13F198E7969"
    ATTR{pcr-sha256/4}=="7E7811C1F6A74895706FFFBED8180452AABA032209D708A7B08066B16F38524F"
    ATTR{pcr-sha256/5}=="D679AB040FA69F51A392DC467BA09AC0A5040FAAD386A9DA99A23C465DB0BCE1"
    ATTR{pcr-sha256/6}=="3D458CFE55CC03EA1F443F1562BEEC8DF51C75E14A9FCF9A7234A13F198E7969"
    ATTR{pcr-sha256/7}=="65CAF8DD1E0EA7A6347B635D2B379C93B9A1351EDC2AFC3ECDA700E534EB3068"
    ATTR{pcr-sha256/8}=="0000000000000000000000000000000000000000000000000000000000000000"
    ATTR{pcr-sha256/9}=="94BACCCA95B21E689C38511D4FE26D2DCB1E2C20CD5CECC5C4F2FC99C28452BF"
    ATTR{power/control}=="auto"
    ATTR{power/runtime_active_time}=="0"
    ATTR{power/runtime_status}=="unsupported"
    ATTR{power/runtime_suspended_time}=="0"
    ATTR{tpm_version_major}=="2"

  looking at parent device '/devices/platform/NVDA1513:00/spi_master/spi0/spi-PRP0001:01':
    KERNELS=="spi-PRP0001:01"
    SUBSYSTEMS=="spi"
    DRIVERS=="tpm_tis_spi"
    ATTRS{driver_override}==""
    ATTRS{power/control}=="auto"
    ATTRS{power/runtime_active_time}=="0"
    ATTRS{power/runtime_status}=="unsupported"
    ATTRS{power/runtime_suspended_time}=="0"
    ATTRS{statistics/bytes}=="0"
    ATTRS{statistics/bytes_rx}=="0"
    ATTRS{statistics/bytes_tx}=="0"
    ATTRS{statistics/errors}=="0"
    ATTRS{statistics/messages}=="0"
    ATTRS{statistics/spi_async}=="0"
    ATTRS{statistics/spi_sync}=="3542838"
    ATTRS{statistics/spi_sync_immediate}=="3542838"
    ATTRS{statistics/timedout}=="0"
    ATTRS{statistics/transfer_bytes_histo_0-1}=="0"
    ATTRS{statistics/transfer_bytes_histo_1024-2047}=="0"
    ATTRS{statistics/transfer_bytes_histo_128-255}=="0"
    ATTRS{statistics/transfer_bytes_histo_16-31}=="0"
    ATTRS{statistics/transfer_bytes_histo_16384-32767}=="0"
    ATTRS{statistics/transfer_bytes_histo_2-3}=="0"
    ATTRS{statistics/transfer_bytes_histo_2048-4095}=="0"
    ATTRS{statistics/transfer_bytes_histo_256-511}=="0"
    ATTRS{statistics/transfer_bytes_histo_32-63}=="0"
    ATTRS{statistics/transfer_bytes_histo_32768-65535}=="0"
    ATTRS{statistics/transfer_bytes_histo_4-7}=="0"
    ATTRS{statistics/transfer_bytes_histo_4096-8191}=="0"
    ATTRS{statistics/transfer_bytes_histo_512-1023}=="0"
    ATTRS{statistics/transfer_bytes_histo_64-127}=="0"
    ATTRS{statistics/transfer_bytes_histo_65536+}=="0"
    ATTRS{statistics/transfer_bytes_histo_8-15}=="0"
    ATTRS{statistics/transfer_bytes_histo_8192-16383}=="0"
    ATTRS{statistics/transfers}=="0"
    ATTRS{statistics/transfers_split_maxsize}=="0"

  looking at parent device '/devices/platform/NVDA1513:00/spi_master/spi0':
    KERNELS=="spi0"
    SUBSYSTEMS=="spi_master"
    DRIVERS==""
    ATTRS{power/control}=="auto"
    ATTRS{power/runtime_active_time}=="0"
    ATTRS{power/runtime_status}=="unsupported"
    ATTRS{power/runtime_suspended_time}=="0"
    ATTRS{statistics/bytes}=="0"
    ATTRS{statistics/bytes_rx}=="0"
    ATTRS{statistics/bytes_tx}=="0"
    ATTRS{statistics/errors}=="0"
    ATTRS{statistics/messages}=="0"
    ATTRS{statistics/spi_async}=="0"
    ATTRS{statistics/spi_sync}=="3542838"
    ATTRS{statistics/spi_sync_immediate}=="3542838"
    ATTRS{statistics/timedout}=="0"
    ATTRS{statistics/transfer_bytes_histo_0-1}=="0"
    ATTRS{statistics/transfer_bytes_histo_1024-2047}=="0"
    ATTRS{statistics/transfer_bytes_histo_128-255}=="0"
    ATTRS{statistics/transfer_bytes_histo_16-31}=="0"
    ATTRS{statistics/transfer_bytes_histo_16384-32767}=="0"
    ATTRS{statistics/transfer_bytes_histo_2-3}=="0"
    ATTRS{statistics/transfer_bytes_histo_2048-4095}=="0"
    ATTRS{statistics/transfer_bytes_histo_256-511}=="0"
    ATTRS{statistics/transfer_bytes_histo_32-63}=="0"
    ATTRS{statistics/transfer_bytes_histo_32768-65535}=="0"
    ATTRS{statistics/transfer_bytes_histo_4-7}=="0"
    ATTRS{statistics/transfer_bytes_histo_4096-8191}=="0"
    ATTRS{statistics/transfer_bytes_histo_512-1023}=="0"
    ATTRS{statistics/transfer_bytes_histo_64-127}=="0"
    ATTRS{statistics/transfer_bytes_histo_65536+}=="0"
    ATTRS{statistics/transfer_bytes_histo_8-15}=="0"
    ATTRS{statistics/transfer_bytes_histo_8192-16383}=="0"
    ATTRS{statistics/transfers}=="0"
    ATTRS{statistics/transfers_split_maxsize}=="0"

  looking at parent device '/devices/platform/NVDA1513:00':
    KERNELS=="NVDA1513:00"
    SUBSYSTEMS=="platform"
    DRIVERS=="tegra-qspi"
    ATTRS{driver_override}=="(null)"
    ATTRS{power/control}=="auto"
    ATTRS{power/runtime_active_time}=="86751"
    ATTRS{power/runtime_status}=="suspended"
    ATTRS{power/runtime_suspended_time}=="340179486"

  looking at parent device '/devices/platform':
    KERNELS=="platform"
    SUBSYSTEMS==""
    DRIVERS==""
    ATTRS{power/control}=="auto"
    ATTRS{power/runtime_active_time}=="0"
    ATTRS{power/runtime_status}=="unsupported"
    ATTRS{power/runtime_suspended_time}=="0"

> I'm happy to:
> - Test this series on our hardware platforms

Please test with the patch above (reduced QSPI_DMA_TIMEOUT and the mdelay). 

> - Collaborate on v2 with the fixes above

Thanks. I understand that the only concern for v1 is that QSPI_RDY if inverted,
and I can fix it and send a v2. is there anything else you want fixed in v2?

> - I will work on hard IRQ handler follow-up that Thierry suggested for 
>   long-term robustness

This is awesome, appreciate!

Thanks for you support,
--breno