[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260122152903.GA1247682@bhelgaas>
Date: Thu, 22 Jan 2026 09:29:03 -0600
From: Bjorn Helgaas <helgaas@...nel.org>
To: Jon Hunter <jonathanh@...dia.com>
Cc: manivannan.sadhasivam@....qualcomm.com,
Bjorn Helgaas <bhelgaas@...gle.com>,
Manivannan Sadhasivam <mani@...nel.org>,
Lorenzo Pieralisi <lpieralisi@...nel.org>,
Krzysztof WilczyĆski <kwilczynski@...nel.org>,
Rob Herring <robh@...nel.org>, linux-pci@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-arm-msm@...r.kernel.org,
"David E. Box" <david.e.box@...ux.intel.com>,
Kai-Heng Feng <kai.heng.feng@...onical.com>,
"Rafael J. Wysocki" <rafael@...nel.org>,
Heiner Kallweit <hkallweit1@...il.com>,
Chia-Lin Kao <acelan.kao@...onical.com>,
"linux-tegra@...r.kernel.org" <linux-tegra@...r.kernel.org>,
Keith Busch <kbusch@...nel.org>, Jens Axboe <axboe@...nel.dk>,
Christoph Hellwig <hch@....de>, Sagi Grimberg <sagi@...mberg.me>,
linux-nvme@...ts.infradead.org
Subject: Re: [PATCH v2 1/2] PCI/ASPM: Override the ASPM and Clock PM states
set by BIOS for devicetree platforms
[+cc NVMe folks]
On Thu, Jan 22, 2026 at 12:12:42PM +0000, Jon Hunter wrote:
> ...
> Since this commit was added in Linux v6.18, I have been observing a suspend
> test failures on some of our boards. The suspend test suspends the devices
> for 20 secs and before this change the board would resume in about ~27 secs
> (including the 20 sec sleep). After this change the board would take over 80
> secs to resume and this triggered a failure.
>
> Looking at the logs, I can see it is the NVMe device on the board that is
> having an issue, and I see the reset failing ...
>
> [ 945.754939] r8169 0007:01:00.0 enP7p1s0: Link is Up - 1Gbps/Full -
> flow control rx/tx
> [ 1002.467432] nvme nvme0: I/O tag 12 (400c) opcode 0x9 (Admin Cmd) QID
> 0 timeout, reset controller
> [ 1002.493713] nvme nvme0: 12/0/0 default/read/poll queues
> [ 1003.050448] nvme nvme0: ctrl state 1 is not RESETTING
> [ 1003.050481] OOM killer enabled.
> [ 1003.054035] nvme nvme0: Disabling device after reset failure: -19
>
> From the above timestamps the delay is coming from the NVMe. I see this
> issue on several boards with different NVMe devices and I can workaround
> this by disabling ASPM L0/L1 for these devices ...
>
> DECLARE_PCI_FIXUP_HEADER(0x15b7, 0x5011, quirk_disable_aspm_l0s_l1);
> DECLARE_PCI_FIXUP_HEADER(0x15b7, 0x5036, quirk_disable_aspm_l0s_l1);
> DECLARE_PCI_FIXUP_HEADER(0x1b4b, 0x1322, quirk_disable_aspm_l0s_l1);
> DECLARE_PCI_FIXUP_HEADER(0xc0a9, 0x540a, quirk_disable_aspm_l0s_l1);
>
> I am curious if you have seen any similar issues?
>
> Other PCIe devices seem to be OK (like the realtek r8169) but just
> the NVMe is having issues. So I am trying to figure out the best way
> to resolve this?
For context, "this commit" refers to f3ac2ff14834, modified by
df5192d9bb0e:
f3ac2ff14834 ("PCI/ASPM: Enable all ClockPM and ASPM states for devicetree platforms")
df5192d9bb0e ("PCI/ASPM: Enable only L0s and L1 for devicetree platforms")
The fact that this suspend issue only affects NVMe reminds me of the
code in dw_pcie_suspend_noirq() [1] that bails out early if L1 is
enabled because of some NVMe expectation:
dw_pcie_suspend_noirq()
{
...
/*
* If L1SS is supported, then do not put the link into L2 as some
* devices such as NVMe expect low resume latency.
*/
if (dw_pcie_readw_dbi(pci, offset + PCI_EXP_LNKCTL) & PCI_EXP_LNKCTL_ASPM_L1)
return 0;
...
That suggests there's some NVMe/ASPM interaction that the PCI core
doesn't understand yet.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/controller/dwc/pcie-designware-host.c?id=v6.18#n1146
Powered by blists - more mailing lists