[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CACK8Z6HgywQVD5RH+Bzg3G7c8XfMjekuc7hGGaBRNF0qqP00Kg@mail.gmail.com>
Date: Thu, 11 Jan 2018 12:22:26 -0800
From: Rajat Jain <rajatja@...gle.com>
To: Keith Busch <keith.busch@...el.com>
Cc: Maik Broemme <mbroemme@...mpq.org>,
Bjorn Helgaas <helgaas@...nel.org>,
linux-pci <linux-pci@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: ASPM powersupersave change NVMe SSD Samsung 960 PRO capacity to 0
and read-only
On Thu, Jan 11, 2018 at 9:59 AM, Keith Busch <keith.busch@...el.com> wrote:
> On Thu, Jan 11, 2018 at 06:50:40PM +0100, Maik Broemme wrote:
>> I've re-run the test with 4.15rc7.r111.g5f615b97cdea and the following
>> patches from Keith:
>>
>> [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported
>> [PATCH 2/4] PCI/AER: Provide API for getting AER information
>> [PATCH 3/4] PCI/DPC: Enable DPC in conjuction with AER
>> [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling
>>
>> The issue is still the same. Additionally to the output before I see now:
>>
>> Jan 11 18:34:45 server.theraso.int kernel: dpc 0000:00:10.0:pcie010: DPC containment event, status:0x1f09 source:0x0000
>> Jan 11 18:34:45 server.theraso.int kernel: dpc 0000:00:10.0:pcie010: DPC unmasked uncorrectable error detected, remove downstream devices
>> Jan 11 18:34:45 server.theraso.int kernel: pcieport 0000:00:10.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0080(Receiver ID)
>> Jan 11 18:34:45 server.theraso.int kernel: pcieport 0000:00:10.0: device [8086:19aa] error status/mask=00000020/00000000
>> Jan 11 18:34:45 server.theraso.int kernel: pcieport 0000:00:10.0: [ 5] Surprise Down Error (First)
>> Jan 11 18:34:46 server.theraso.int kernel: nvme0n1: detected capacity change from 1024209543168 to 0
>
> Okay, so that series wasn't going to fix anything, but at least it gets
> some visibility into what's happened. The DPC was triggered due to a
> Surprise Down uncorrectable error, so the power settting is causing the
> link to fail.
>
> The NVMe driver has quirks specifically for this vendor's devices to
> fence off NVMe specific automated power settings. Your observations
> appear to align with the same issues.
Agree.
/*
* Samsung SSD 960 EVO drops off the PCIe bus after system
* suspend on a Ryzen board, ASUS PRIME B350M-A.
*/
if (dmi_match(DMI_BOARD_VENDOR, "ASUSTeK COMPUTER INC.") &&
dmi_match(DMI_BOARD_NAME, "PRIME B350M-A"))
return NVME_QUIRK_NO_APST;
It seems that the attempt to save extrapower using ASPM L1 substates
is causing it to fall off. Sorry but I suspect that it may be
difficult to debug without a pcie analyzer, some debugging directions
can be:
- Assuming this is a hotpluggable device, try with another NVMe to
verify if the issue is specific to this device.
- Can you please try switch the ASPM policy back from "powersupersave"
-> powersave, and potentially do a rescan (echo 1 >
/sys/bus/pci/rescan), and see if the device comes back (and goes away
again when you switch back to supersave)?
- May be put some debug prints in pcie_config_aspm_l1ss() to see
writing to which register causes the device to fall off (most likely
this would be the last statement, but just throwing ideas).
- May be dump the timing parameters link->l1ss.ctl1 and
link->l1ss.ctl2 from aspm_calc_l1ss_info(), and try to play with them
a little.
Powered by blists - more mailing lists