lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <23475d2b-3201-41f1-9a60-a951250a9d60@packett.cool>
Date: Thu, 4 Dec 2025 18:28:00 -0300
From: Val Packett <val@...kett.cool>
To: Konrad Dybcio <konrad.dybcio@....qualcomm.com>,
 Manivannan Sadhasivam <mani@...nel.org>, Bjorn Helgaas <helgaas@...nel.org>
Cc: Manivannan Sadhasivam <manivannan.sadhasivam@....qualcomm.com>,
 bhelgaas@...gle.com, linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740
 NVMe SSDs

On 12/4/25 9:51 AM, Konrad Dybcio wrote:

> On 12/1/25 7:48 AM, Val Packett wrote:
>> On 11/25/25 2:21 AM, Manivannan Sadhasivam wrote:
>>> [..]
>>> There are a couple of points that made me convince myself:
>>>
>>> * Other X1E laptops are working fine with ASPM L1.
>>> * This laptop has WCN785x WiFi/BT combo card connected to the other controller
>>> instance and L1 is working fine for it.
>>> * There is no known issue with ASPM L1 in X1E chipsets.
>>>
>>> Because of these, I was so certain that the NVMe is the fault here.
>> There is *a* known issue with ASPM L1 on X1E, reported by maaaany users on #aarch64-laptops, that we discussed in another thread..
>>
>> But it is a full system freeze, **not** a correctable AER message, and it definitely happens with a bunch of various SSDs on various laptops. I personally have had it happen both with the SN740 and an SK Hynix drive, on a Latitude 7455. It's an SSD-only issue (disabling ASPM just for the drive, but keeping it on for the WiFi, was enough to get to month-long uptime) but not specific to any SSD model.
> Are the steps to reproduce roughly
>
> * boot without disabling ASPM
> * wait
> * system reboots on its own (or just freezes?)
>
> ?

Yeah.

Wait can be anywhere from minutes to days, it seems completely random 
and "luck based".

In EL1, the system freezes for a minute and gets rebooted by the watchdog.

In EL2 as I have just now discovered, some cores can still be running 
(presumably those that haven't tried accessing the drive) as others 
hang, and we can get a proper panic, I got this logged to efi_pstore:

<0>[ 1500.017790] watchdog: CPU3: Watchdog detected hard LOCKUP on cpu 4
<4>[ 1500.017801] Modules linked in: [..]
<6>[ 1500.017937] Sending NMI from CPU 3 to CPUs 4:
<0>[ 1510.017956] Kernel panic - not syncing: Hard LOCKUP
<4>[ 1510.017970] Call trace: [one with watchdog_hardlockup_check, from 
CPU3]
<2>[ 1510.018062] SMP: stopping secondary CPUs
<4>[ 1511.085450] SMP: failed to stop secondary CPUs 4-11

No traces from the frozen cores are logged as they don't respond to NMI. 
They are *completely* wedged.

~val


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ