linux-kernel - Re: [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20260120112242.3766700-1-va@nvidia.com>
Date: Tue, 20 Jan 2026 11:22:42 +0000
From: Vishwaroop A <va@...dia.com>
To: <leitao@...ian.org>
CC: <thierry.reding@...il.com>, <treding@...dia.com>, <jonathanh@...dia.com>,
	<skomatineni@...dia.com>, <ldewangan@...dia.com>, <broonie@...nel.org>,
	<linux-spi@...r.kernel.org>, <linux-tegra@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <kernel-team@...a.com>,
	<puranjay@...nel.org>, <usamaarif642@...il.com>, Vishwaroop A <va@...dia.com>
Subject: Re: [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ

Hi Breno,

Thanks for posting this series. I've been working closely with our team at 
NVIDIA on this exact issue after the reports from Meta's fleet.

Your analysis is correct, I have some technical feedback on the series:

**Patches 1-2 (IRQ_HANDLED + spinlock for curr_xfer):

These directly address the race we identified. The spinlock-protected read of 
curr_xfer in Patch 2 mirrors the fix we developed internally and resolves the 
TOCTOU race between the timeout handler and the delayed ISR thread.

**Patch 3 (Spinlock in tegra_qspi_setup_transfer_one): Improves formal correctness**

While our original position was that this lock isn't strictly required due to 
call ordering (setup completes before HW start, so no IRQ can fire yet), I 
agree that explicit locking here improves maintainability. It makes the 
synchronization model clearer for future readers and removes any dependency on 
implicit ordering guarantees.

**Patch 4 (device_reset_optional):

Your observation about ACPI platforms lacking _RST is accurate. On server 
platforms, device_reset() fails and logs unnecessary errors. 
device_reset_optional() is the right semantic here. I'd suggest coordinating 
with the device core maintainers (Greg KH, Rafael Wysocki) to ensure this API 
addition gets proper review—it's useful beyond just QSPI.

**Patch 5 (synchronize_irq):

This ensures any in-flight ISR thread completes before we proceed with 
recovery. It closes a potential gap where a delayed ISR could access state 
after a reset. Combined with the spinlock in Patch 2, this provides robust 
protection.

**Patch 6 (Timeout error path): Logic issue—needs rework**

I see a problem here. If QSPI_RDY is not set (hardware actually timed out), 
you return success (0) from tegra_qspi_handle_timeout(). This causes 
__spi_transfer_message_noqueue() to call spi_finalize_current_message() with 
status = 0, signaling success to the client when the transfer actually failed.

The correct behavior should be:
- If QSPI_RDY is set: Hardware completed, recover the data, return success (0)
- If QSPI_RDY is NOT set: True timeout, reset hardware, return error (-ETIMEDOUT)

The current logic inverts this. I'd suggest:

  if (fifo_status & QSPI_RDY) {
      /* HW completed, recover */
      handle_cpu/dma_based_xfer();
      return 0;
  }
  /* True timeout */
  dev_err(..., "Transfer timeout");
  tegra_qspi_handle_error(tqspi);
  return -ETIMEDOUT;

---

I saw your tpm_torture_test.sh in the cover letter. We haven't been able to 
reproduce the issue locally yet—it seems to require the specific TPM device 
configuration and CPU load patterns present in Meta's fleet. If you have 
additional details on the reproduction environment (TPM vendor/model, specific 
workload characteristics, CPU affinity settings), that would help us validate 
the fix on our end.

---

I'm happy to:
- Test this series on our hardware platforms
- Collaborate on v2 with the fixes above
- I will work on hard IRQ handler follow-up that Thierry suggested for 
  long-term robustness

Let me know how you'd like to coordinate. Thanks again for tackling this—it's 
been a high-priority issue for our server customers.

Best regards,
Vishwaroop