[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <23475d2b-3201-41f1-9a60-a951250a9d60@packett.cool>
Date: Thu, 4 Dec 2025 18:28:00 -0300
From: Val Packett <val@...kett.cool>
To: Konrad Dybcio <konrad.dybcio@....qualcomm.com>,
Manivannan Sadhasivam <mani@...nel.org>, Bjorn Helgaas <helgaas@...nel.org>
Cc: Manivannan Sadhasivam <manivannan.sadhasivam@....qualcomm.com>,
bhelgaas@...gle.com, linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] PCI: Add quirk to disable ASPM L1 for Sandisk SN740
NVMe SSDs
On 12/4/25 9:51 AM, Konrad Dybcio wrote:
> On 12/1/25 7:48 AM, Val Packett wrote:
>> On 11/25/25 2:21 AM, Manivannan Sadhasivam wrote:
>>> [..]
>>> There are a couple of points that made me convince myself:
>>>
>>> * Other X1E laptops are working fine with ASPM L1.
>>> * This laptop has WCN785x WiFi/BT combo card connected to the other controller
>>> instance and L1 is working fine for it.
>>> * There is no known issue with ASPM L1 in X1E chipsets.
>>>
>>> Because of these, I was so certain that the NVMe is the fault here.
>> There is *a* known issue with ASPM L1 on X1E, reported by maaaany users on #aarch64-laptops, that we discussed in another thread..
>>
>> But it is a full system freeze, **not** a correctable AER message, and it definitely happens with a bunch of various SSDs on various laptops. I personally have had it happen both with the SN740 and an SK Hynix drive, on a Latitude 7455. It's an SSD-only issue (disabling ASPM just for the drive, but keeping it on for the WiFi, was enough to get to month-long uptime) but not specific to any SSD model.
> Are the steps to reproduce roughly
>
> * boot without disabling ASPM
> * wait
> * system reboots on its own (or just freezes?)
>
> ?
Yeah.
Wait can be anywhere from minutes to days, it seems completely random
and "luck based".
In EL1, the system freezes for a minute and gets rebooted by the watchdog.
In EL2 as I have just now discovered, some cores can still be running
(presumably those that haven't tried accessing the drive) as others
hang, and we can get a proper panic, I got this logged to efi_pstore:
<0>[ 1500.017790] watchdog: CPU3: Watchdog detected hard LOCKUP on cpu 4
<4>[ 1500.017801] Modules linked in: [..]
<6>[ 1500.017937] Sending NMI from CPU 3 to CPUs 4:
<0>[ 1510.017956] Kernel panic - not syncing: Hard LOCKUP
<4>[ 1510.017970] Call trace: [one with watchdog_hardlockup_check, from
CPU3]
<2>[ 1510.018062] SMP: stopping secondary CPUs
<4>[ 1511.085450] SMP: failed to stop secondary CPUs 4-11
No traces from the frozen cores are logged as they don't respond to NMI.
They are *completely* wedged.
~val
Powered by blists - more mailing lists