lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <58e343d9-adf3-4853-9dec-df7c1892d6b2@cixtech.com>
Date: Sat, 3 May 2025 00:20:52 +0800
From: Hans Zhang <hans.zhang@...tech.com>
To: Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>
Cc: Bjorn Helgaas <helgaas@...nel.org>, kbusch@...nel.org, axboe@...nel.dk,
 hch@....de, sagi@...mberg.me, linux-nvme@...ts.infradead.org,
 linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org
Subject: Re: [PATCH] nvme-pci: Fix system hang when ASPM L1 is enabled during
 suspend



On 2025/5/3 00:07, Hans Zhang wrote:
> 
> 
> On 2025/5/2 23:58, Manivannan Sadhasivam wrote:
>> EXTERNAL EMAIL
>>
>> On Fri, May 02, 2025 at 11:49:07PM +0800, Hans Zhang wrote:
>>>
>>>
>>> On 2025/5/2 23:00, Bjorn Helgaas wrote:
>>>> EXTERNAL EMAIL
>>>>
>>>> On Fri, May 02, 2025 at 11:20:51AM +0800, hans.zhang@...tech.com wrote:
>>>>> From: Hans Zhang <hans.zhang@...tech.com>
>>>>>
>>>>> When PCIe ASPM L1 is enabled (CONFIG_PCIEASPM_POWERSAVE=y), certain
>>>>
>>>> CONFIG_PCIEASPM_POWERSAVE=y only sets the default.  L1 can be enabled
>>>> dynamically regardless of the config.
>>>>
>>>
>>> Dear Bjorn,
>>>
>>> Thank you very much for your reply.
>>>
>>> Yes. To reduce the power consumption of the SOC system, we have 
>>> enabled ASPM
>>> L1 by default.
>>>
>>>>> NVMe controllers fail to release LPI MSI-X interrupts during system
>>>>> suspend, leading to a system hang. This occurs because the driver's
>>>>> existing power management path does not fully disable the device
>>>>> when ASPM is active.
>>>>
>>>> I have no idea what this has to do with ASPM L1.  I do see that
>>>> nvme_suspend() tests pcie_aspm_enabled(pdev) (which seems kind of
>>>> janky and racy).  But this doesn't explain anything about what would
>>>> cause a system hang.
>>>
>>> [   92.411265] [pid:322,cpu11,kworker/u24:6]nvme 0000:91:00.0: PM: 
>>> calling
>>> pci_pm_suspend_noirq+0x0/0x2c0 @ 322, parent: 0000:90:00.0
>>> [   92.423028] [pid:322,cpu11,kworker/u24:6]nvme 0000:91:00.0: PM:
>>> pci_pm_suspend_noirq+0x0/0x2c0 returned 0 after 1 usecs
>>> [   92.433894] [pid:324,cpu10,kworker/u24:7]pcieport 0000:90:00.0: PM:
>>> calling pci_pm_suspend_noirq+0x0/0x2c0 @ 324, parent: pci0000:90
>>> [   92.445880] [pid:324,cpu10,kworker/u24:7]pcieport 0000:90:00.0: PM:
>>> pci_pm_suspend_noirq+0x0/0x2c0 returned 0 after 39 usecs
>>> [   92.457227] [pid:916,cpu7,bash]sky1-pcie a070000.pcie: PM: calling
>>> sky1_pcie_suspend_noirq+0x0/0x174 @ 916, parent: soc@0
>>> [   92.479315] [pid:916,cpu7,bash]cix-pcie-phy a080000.pcie_phy:
>>> pcie_phy_common_exit end
>>> [   92.487389] [pid:916,cpu7,bash]sky1-pcie a070000.pcie:
>>> sky1_pcie_suspend_noirq
>>> [   92.494604] [pid:916,cpu7,bash]sky1-pcie a070000.pcie: PM:
>>> sky1_pcie_suspend_noirq+0x0/0x174 returned 0 after 26379 usecs
>>> [   92.505619] [pid:916,cpu7,bash]sky1-audss-clk
>>> 7110000.system-controller:clock-controller: PM: calling
>>> genpd_suspend_noirq+0x0/0x80 @ 916, parent: 7110000.system-controller
>>> [   92.520919] [pid:916,cpu7,bash]sky1-audss-clk
>>> 7110000.system-controller:clock-controller: PM: 
>>> genpd_suspend_noirq+0x0/0x80
>>> returned 0 after 1 usecs
>>> [   92.534214] [pid:916,cpu7,bash]Disabling non-boot CPUs ...
>>>
>>>
>>> Hans: Before I added the printk for debugging, it hung here.
>>>
>>>
>>> I added the log output after debugging printk.
>>>
>>> Sky1 SOC Root Port driver's suspend function: sky1_pcie_suspend_noirq
>>> Our hardware is in STR(suspend to ram), and the controller and PHY 
>>> will lose
>>> power.
>>>
>>> So in sky1_pcie_suspend_noirq, the AXI,APB clock, etc. of the PCIe
>>> controller will be turned off. In sky1_pcie_resume_noirq, the PCIe
>>> controller and PHY will be reinitialized. If suspend does not close 
>>> the AXI
>>> and APB clock, and the AXI is reopened during the resume process, the 
>>> APB
>>> clock will cause the reference count of the kernel API to accumulate
>>> continuously.
>>>
>>
>> So this is the actual issue (controller loosing power during system 
>> suspend) and
>> everything else (ASPM, MSIX write) are all side effects of it.
>>

Dear Mani,

There are some things I don't understand here. Why doesn't the NVMe SSD 
driver release the MSI/MSIx interrupt when ASPM is enabled? However, if 
ASPM is not enabled, the MSI/MSIx interrupt will be released instead.

Best regards,
Hans

>> Yes, this issue is more common with several vendors and we need to 
>> come up with
>> a generic solution instead of hacking up the client drivers. I'm 
>> planning to
>> work on it in the coming days. Will keep you in the loop.
>>
> 
> Dear Mani,
> 
> Thank you very much for your reply. Thank you very much for helping to 
> solve this problem together. If possible, I'd be very glad to help with 
> the test together.
> 
> Best regards,
> Hans
> 
> 
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ