[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260120112242.3766700-1-va@nvidia.com>
Date: Tue, 20 Jan 2026 11:22:42 +0000
From: Vishwaroop A <va@...dia.com>
To: <leitao@...ian.org>
CC: <thierry.reding@...il.com>, <treding@...dia.com>, <jonathanh@...dia.com>,
<skomatineni@...dia.com>, <ldewangan@...dia.com>, <broonie@...nel.org>,
<linux-spi@...r.kernel.org>, <linux-tegra@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <kernel-team@...a.com>,
<puranjay@...nel.org>, <usamaarif642@...il.com>, Vishwaroop A <va@...dia.com>
Subject: Re: [PATCH 0/6] spi: tegra-qspi: Fix race condition causing NULL pointer dereference and spurious IRQ
Hi Breno,
Thanks for posting this series. I've been working closely with our team at
NVIDIA on this exact issue after the reports from Meta's fleet.
Your analysis is correct, I have some technical feedback on the series:
**Patches 1-2 (IRQ_HANDLED + spinlock for curr_xfer):
These directly address the race we identified. The spinlock-protected read of
curr_xfer in Patch 2 mirrors the fix we developed internally and resolves the
TOCTOU race between the timeout handler and the delayed ISR thread.
**Patch 3 (Spinlock in tegra_qspi_setup_transfer_one): Improves formal correctness**
While our original position was that this lock isn't strictly required due to
call ordering (setup completes before HW start, so no IRQ can fire yet), I
agree that explicit locking here improves maintainability. It makes the
synchronization model clearer for future readers and removes any dependency on
implicit ordering guarantees.
**Patch 4 (device_reset_optional):
Your observation about ACPI platforms lacking _RST is accurate. On server
platforms, device_reset() fails and logs unnecessary errors.
device_reset_optional() is the right semantic here. I'd suggest coordinating
with the device core maintainers (Greg KH, Rafael Wysocki) to ensure this API
addition gets proper review—it's useful beyond just QSPI.
**Patch 5 (synchronize_irq):
This ensures any in-flight ISR thread completes before we proceed with
recovery. It closes a potential gap where a delayed ISR could access state
after a reset. Combined with the spinlock in Patch 2, this provides robust
protection.
**Patch 6 (Timeout error path): Logic issue—needs rework**
I see a problem here. If QSPI_RDY is not set (hardware actually timed out),
you return success (0) from tegra_qspi_handle_timeout(). This causes
__spi_transfer_message_noqueue() to call spi_finalize_current_message() with
status = 0, signaling success to the client when the transfer actually failed.
The correct behavior should be:
- If QSPI_RDY is set: Hardware completed, recover the data, return success (0)
- If QSPI_RDY is NOT set: True timeout, reset hardware, return error (-ETIMEDOUT)
The current logic inverts this. I'd suggest:
if (fifo_status & QSPI_RDY) {
/* HW completed, recover */
handle_cpu/dma_based_xfer();
return 0;
}
/* True timeout */
dev_err(..., "Transfer timeout");
tegra_qspi_handle_error(tqspi);
return -ETIMEDOUT;
---
I saw your tpm_torture_test.sh in the cover letter. We haven't been able to
reproduce the issue locally yet—it seems to require the specific TPM device
configuration and CPU load patterns present in Meta's fleet. If you have
additional details on the reproduction environment (TPM vendor/model, specific
workload characteristics, CPU affinity settings), that would help us validate
the fix on our end.
---
I'm happy to:
- Test this series on our hardware platforms
- Collaborate on v2 with the fixes above
- I will work on hard IRQ handler follow-up that Thierry suggested for
long-term robustness
Let me know how you'd like to coordinate. Thanks again for tackling this—it's
been a high-priority issue for our server customers.
Best regards,
Vishwaroop
Powered by blists - more mailing lists