lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <jkj7rk3eosohklyml5wicid4pamahbeqjroosfomherkd4sxdx@qyduw665jhzf>
Date: Mon, 1 Dec 2025 16:07:01 +0530
From: Manivannan Sadhasivam <mani@...nel.org>
To: Val Packett <val@...kett.cool>
Cc: Bjorn Helgaas <helgaas@...nel.org>, 
	Manivannan Sadhasivam <manivannan.sadhasivam@....qualcomm.com>, bhelgaas@...gle.com, linux-pci@...r.kernel.org, 
	linux-kernel@...r.kernel.org, Konrad Dybcio <konrad.dybcio@....qualcomm.com>, 
	Alexey Bogoslavsky <Alexey.Bogoslavsky@...disk.com>, Jeffrey Lien <Jeff.Lien@...disk.com>, 
	Avinash M N <Avinash.M.N@...disk.com>
Subject: Re: [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740
 NVMe SSDs

On Mon, Dec 01, 2025 at 03:48:13AM -0300, Val Packett wrote:
> 
> On 11/25/25 2:21 AM, Manivannan Sadhasivam wrote:
> > [..]
> > There are a couple of points that made me convince myself:
> > 
> > * Other X1E laptops are working fine with ASPM L1.
> > * This laptop has WCN785x WiFi/BT combo card connected to the other controller
> > instance and L1 is working fine for it.
> > * There is no known issue with ASPM L1 in X1E chipsets.
> > 
> > Because of these, I was so certain that the NVMe is the fault here.
> 
> There is *a* known issue with ASPM L1 on X1E, reported by maaaany users on
> #aarch64-laptops, that we discussed in another thread..
> 

The other thread you are referring to is this one I believe:
https://lore.kernel.org/linux-pci/21398de7-3dd9-4c43-97d9-7c3002c401e5@packett.cool/

>From this, I cannot conclude that controller was at the fault. Atleast, not
until now.

> But it is a full system freeze, **not** a correctable AER message, and it
> definitely happens with a bunch of various SSDs on various laptops. I
> personally have had it happen both with the SN740 and an SK Hynix drive, on
> a Latitude 7455. It's an SSD-only issue (disabling ASPM just for the drive,
> but keeping it on for the WiFi, was enough to get to month-long uptime) but
> not specific to any SSD model.
> 

Please confirm whether you disabled all ASPM states (L0s, L1 and L1ss) or just
L1ss for the controller instance where SSD is connected. Starting from
v6.18-rc3, only L0s and L1 will be enabled by default without any
cmdline/Kconfig changes.

> One bit of news I have about it is that I recently started using EL2
> (slbounce), and I did see something that looked like that hang.. but unlike
> in EL1, right before the reboot the panic LED did start blinking. So if that
> was indeed from the same issue, I should now be able to catch it into pstore
> (if pstore works.. trying blk with sdhc instead of efi now 0.o)

That would be helpful. I guess Abel did it on XPS13, but need to check more.

> Maybe QHEE
> was eating the fault and itself crashing, since it "owns" the PCIe IOMMU
> when it's running.. (???)
> 

Yes, they all are captured by QHEE for post mortem analsys that could only be
performed using Qcom tools and on non-production devices. I don't know how to
capture those logs on production laptops.

Anyhow, to isolate this issue to ASPM L1 on the X1E PCIe controller, please
disable all ASPM states by selecting CONFIG_PCIEASPM_PERFORMANCE in Kconfig and
let it run. If you do not see the crash at all for some time (or days), then the
crash was related to ASPM issue in the controller (since you said the crash was
repro. with other SSDs as well). If not, there is something else going wrong.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ