lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <onw47gzc6mda2unsew36b2cmp2et3ijrjqlmgpueeko5vucgph@wrkaiqlbo2fp>
Date: Fri, 2 May 2025 23:35:29 +0530
From: Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>
To: Hans Zhang <hans.zhang@...tech.com>
Cc: Bjorn Helgaas <helgaas@...nel.org>, kbusch@...nel.org, axboe@...nel.dk, 
	hch@....de, sagi@...mberg.me, linux-nvme@...ts.infradead.org, 
	linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org
Subject: Re: [PATCH] nvme-pci: Fix system hang when ASPM L1 is enabled during
 suspend

On Sat, May 03, 2025 at 12:20:52AM +0800, Hans Zhang wrote:
> 
> 
> On 2025/5/3 00:07, Hans Zhang wrote:
> > 
> > 
> > On 2025/5/2 23:58, Manivannan Sadhasivam wrote:
> > > EXTERNAL EMAIL
> > > 
> > > On Fri, May 02, 2025 at 11:49:07PM +0800, Hans Zhang wrote:
> > > > 
> > > > 
> > > > On 2025/5/2 23:00, Bjorn Helgaas wrote:
> > > > > EXTERNAL EMAIL
> > > > > 
> > > > > On Fri, May 02, 2025 at 11:20:51AM +0800, hans.zhang@...tech.com wrote:
> > > > > > From: Hans Zhang <hans.zhang@...tech.com>
> > > > > > 
> > > > > > When PCIe ASPM L1 is enabled (CONFIG_PCIEASPM_POWERSAVE=y), certain
> > > > > 
> > > > > CONFIG_PCIEASPM_POWERSAVE=y only sets the default.  L1 can be enabled
> > > > > dynamically regardless of the config.
> > > > > 
> > > > 
> > > > Dear Bjorn,
> > > > 
> > > > Thank you very much for your reply.
> > > > 
> > > > Yes. To reduce the power consumption of the SOC system, we have
> > > > enabled ASPM
> > > > L1 by default.
> > > > 
> > > > > > NVMe controllers fail to release LPI MSI-X interrupts during system
> > > > > > suspend, leading to a system hang. This occurs because the driver's
> > > > > > existing power management path does not fully disable the device
> > > > > > when ASPM is active.
> > > > > 
> > > > > I have no idea what this has to do with ASPM L1.  I do see that
> > > > > nvme_suspend() tests pcie_aspm_enabled(pdev) (which seems kind of
> > > > > janky and racy).  But this doesn't explain anything about what would
> > > > > cause a system hang.
> > > > 
> > > > [   92.411265] [pid:322,cpu11,kworker/u24:6]nvme 0000:91:00.0:
> > > > PM: calling
> > > > pci_pm_suspend_noirq+0x0/0x2c0 @ 322, parent: 0000:90:00.0
> > > > [   92.423028] [pid:322,cpu11,kworker/u24:6]nvme 0000:91:00.0: PM:
> > > > pci_pm_suspend_noirq+0x0/0x2c0 returned 0 after 1 usecs
> > > > [   92.433894] [pid:324,cpu10,kworker/u24:7]pcieport 0000:90:00.0: PM:
> > > > calling pci_pm_suspend_noirq+0x0/0x2c0 @ 324, parent: pci0000:90
> > > > [   92.445880] [pid:324,cpu10,kworker/u24:7]pcieport 0000:90:00.0: PM:
> > > > pci_pm_suspend_noirq+0x0/0x2c0 returned 0 after 39 usecs
> > > > [   92.457227] [pid:916,cpu7,bash]sky1-pcie a070000.pcie: PM: calling
> > > > sky1_pcie_suspend_noirq+0x0/0x174 @ 916, parent: soc@0
> > > > [   92.479315] [pid:916,cpu7,bash]cix-pcie-phy a080000.pcie_phy:
> > > > pcie_phy_common_exit end
> > > > [   92.487389] [pid:916,cpu7,bash]sky1-pcie a070000.pcie:
> > > > sky1_pcie_suspend_noirq
> > > > [   92.494604] [pid:916,cpu7,bash]sky1-pcie a070000.pcie: PM:
> > > > sky1_pcie_suspend_noirq+0x0/0x174 returned 0 after 26379 usecs
> > > > [   92.505619] [pid:916,cpu7,bash]sky1-audss-clk
> > > > 7110000.system-controller:clock-controller: PM: calling
> > > > genpd_suspend_noirq+0x0/0x80 @ 916, parent: 7110000.system-controller
> > > > [   92.520919] [pid:916,cpu7,bash]sky1-audss-clk
> > > > 7110000.system-controller:clock-controller: PM:
> > > > genpd_suspend_noirq+0x0/0x80
> > > > returned 0 after 1 usecs
> > > > [   92.534214] [pid:916,cpu7,bash]Disabling non-boot CPUs ...
> > > > 
> > > > 
> > > > Hans: Before I added the printk for debugging, it hung here.
> > > > 
> > > > 
> > > > I added the log output after debugging printk.
> > > > 
> > > > Sky1 SOC Root Port driver's suspend function: sky1_pcie_suspend_noirq
> > > > Our hardware is in STR(suspend to ram), and the controller and
> > > > PHY will lose
> > > > power.
> > > > 
> > > > So in sky1_pcie_suspend_noirq, the AXI,APB clock, etc. of the PCIe
> > > > controller will be turned off. In sky1_pcie_resume_noirq, the PCIe
> > > > controller and PHY will be reinitialized. If suspend does not
> > > > close the AXI
> > > > and APB clock, and the AXI is reopened during the resume
> > > > process, the APB
> > > > clock will cause the reference count of the kernel API to accumulate
> > > > continuously.
> > > > 
> > > 
> > > So this is the actual issue (controller loosing power during system
> > > suspend) and
> > > everything else (ASPM, MSIX write) are all side effects of it.
> > > 
> 
> Dear Mani,
> 
> There are some things I don't understand here. Why doesn't the NVMe SSD
> driver release the MSI/MSIx interrupt when ASPM is enabled? However, if ASPM
> is not enabled, the MSI/MSIx interrupt will be released instead.
> 

You mean by calling pci_free_irq_vectors()? If so, the reason is that if ASPM is
unavailable, then the NVMe cannot be put into low power APST state during
suspend. So shutting down it is the only sane option to save power, with the
cost of increased resume latency. But if ASPM is available, then the driver
doesn't shut the NVMe as it relies on APST to keep the NVMe controller/memory in
low power mode.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ