[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <58e343d9-adf3-4853-9dec-df7c1892d6b2@cixtech.com>
Date: Sat, 3 May 2025 00:20:52 +0800
From: Hans Zhang <hans.zhang@...tech.com>
To: Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>
Cc: Bjorn Helgaas <helgaas@...nel.org>, kbusch@...nel.org, axboe@...nel.dk,
hch@....de, sagi@...mberg.me, linux-nvme@...ts.infradead.org,
linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org
Subject: Re: [PATCH] nvme-pci: Fix system hang when ASPM L1 is enabled during
suspend
On 2025/5/3 00:07, Hans Zhang wrote:
>
>
> On 2025/5/2 23:58, Manivannan Sadhasivam wrote:
>> EXTERNAL EMAIL
>>
>> On Fri, May 02, 2025 at 11:49:07PM +0800, Hans Zhang wrote:
>>>
>>>
>>> On 2025/5/2 23:00, Bjorn Helgaas wrote:
>>>> EXTERNAL EMAIL
>>>>
>>>> On Fri, May 02, 2025 at 11:20:51AM +0800, hans.zhang@...tech.com wrote:
>>>>> From: Hans Zhang <hans.zhang@...tech.com>
>>>>>
>>>>> When PCIe ASPM L1 is enabled (CONFIG_PCIEASPM_POWERSAVE=y), certain
>>>>
>>>> CONFIG_PCIEASPM_POWERSAVE=y only sets the default. L1 can be enabled
>>>> dynamically regardless of the config.
>>>>
>>>
>>> Dear Bjorn,
>>>
>>> Thank you very much for your reply.
>>>
>>> Yes. To reduce the power consumption of the SOC system, we have
>>> enabled ASPM
>>> L1 by default.
>>>
>>>>> NVMe controllers fail to release LPI MSI-X interrupts during system
>>>>> suspend, leading to a system hang. This occurs because the driver's
>>>>> existing power management path does not fully disable the device
>>>>> when ASPM is active.
>>>>
>>>> I have no idea what this has to do with ASPM L1. I do see that
>>>> nvme_suspend() tests pcie_aspm_enabled(pdev) (which seems kind of
>>>> janky and racy). But this doesn't explain anything about what would
>>>> cause a system hang.
>>>
>>> [ 92.411265] [pid:322,cpu11,kworker/u24:6]nvme 0000:91:00.0: PM:
>>> calling
>>> pci_pm_suspend_noirq+0x0/0x2c0 @ 322, parent: 0000:90:00.0
>>> [ 92.423028] [pid:322,cpu11,kworker/u24:6]nvme 0000:91:00.0: PM:
>>> pci_pm_suspend_noirq+0x0/0x2c0 returned 0 after 1 usecs
>>> [ 92.433894] [pid:324,cpu10,kworker/u24:7]pcieport 0000:90:00.0: PM:
>>> calling pci_pm_suspend_noirq+0x0/0x2c0 @ 324, parent: pci0000:90
>>> [ 92.445880] [pid:324,cpu10,kworker/u24:7]pcieport 0000:90:00.0: PM:
>>> pci_pm_suspend_noirq+0x0/0x2c0 returned 0 after 39 usecs
>>> [ 92.457227] [pid:916,cpu7,bash]sky1-pcie a070000.pcie: PM: calling
>>> sky1_pcie_suspend_noirq+0x0/0x174 @ 916, parent: soc@0
>>> [ 92.479315] [pid:916,cpu7,bash]cix-pcie-phy a080000.pcie_phy:
>>> pcie_phy_common_exit end
>>> [ 92.487389] [pid:916,cpu7,bash]sky1-pcie a070000.pcie:
>>> sky1_pcie_suspend_noirq
>>> [ 92.494604] [pid:916,cpu7,bash]sky1-pcie a070000.pcie: PM:
>>> sky1_pcie_suspend_noirq+0x0/0x174 returned 0 after 26379 usecs
>>> [ 92.505619] [pid:916,cpu7,bash]sky1-audss-clk
>>> 7110000.system-controller:clock-controller: PM: calling
>>> genpd_suspend_noirq+0x0/0x80 @ 916, parent: 7110000.system-controller
>>> [ 92.520919] [pid:916,cpu7,bash]sky1-audss-clk
>>> 7110000.system-controller:clock-controller: PM:
>>> genpd_suspend_noirq+0x0/0x80
>>> returned 0 after 1 usecs
>>> [ 92.534214] [pid:916,cpu7,bash]Disabling non-boot CPUs ...
>>>
>>>
>>> Hans: Before I added the printk for debugging, it hung here.
>>>
>>>
>>> I added the log output after debugging printk.
>>>
>>> Sky1 SOC Root Port driver's suspend function: sky1_pcie_suspend_noirq
>>> Our hardware is in STR(suspend to ram), and the controller and PHY
>>> will lose
>>> power.
>>>
>>> So in sky1_pcie_suspend_noirq, the AXI,APB clock, etc. of the PCIe
>>> controller will be turned off. In sky1_pcie_resume_noirq, the PCIe
>>> controller and PHY will be reinitialized. If suspend does not close
>>> the AXI
>>> and APB clock, and the AXI is reopened during the resume process, the
>>> APB
>>> clock will cause the reference count of the kernel API to accumulate
>>> continuously.
>>>
>>
>> So this is the actual issue (controller loosing power during system
>> suspend) and
>> everything else (ASPM, MSIX write) are all side effects of it.
>>
Dear Mani,
There are some things I don't understand here. Why doesn't the NVMe SSD
driver release the MSI/MSIx interrupt when ASPM is enabled? However, if
ASPM is not enabled, the MSI/MSIx interrupt will be released instead.
Best regards,
Hans
>> Yes, this issue is more common with several vendors and we need to
>> come up with
>> a generic solution instead of hacking up the client drivers. I'm
>> planning to
>> work on it in the coming days. Will keep you in the loop.
>>
>
> Dear Mani,
>
> Thank you very much for your reply. Thank you very much for helping to
> solve this problem together. If possible, I'd be very glad to help with
> the test together.
>
> Best regards,
> Hans
>
>
>
Powered by blists - more mailing lists