lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <ZHZlS48EKigmNihh@google.com>
Date:   Tue, 30 May 2023 21:06:19 +0000
From:   Matthias Kaehlcke <mka@...omium.org>
To:     楊宗翰 <ecs.taipeikernel@...il.com>
Cc:     Bjorn Helgaas <helgaas@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Bob Moragues <moragues@...gle.com>,
        Abner Yen <abner.yen@....com.tw>,
        Doug Anderson <dianders@...omium.org>,
        Stephen Boyd <swboyd@...omium.org>, Harvey <hunge@...gle.com>,
        Gavin Lee <gavin.lee@....com.tw>,
        Bjorn Helgaas <bhelgaas@...gle.com>, linux-pci@...r.kernel.org
Subject: Re: [PATCH v1] drivers: pci: quirks: Add suspend fixup for SSD on
 sc7280

On Mon, May 29, 2023 at 02:24:53PM +0800, 楊宗翰 wrote:
> Hi Bjorn,
> 
> Thanks for your kind directions.
> 
>   - Subject line in style of the file (use "git log --oneline
>     drivers/pci/quirks.c").
> Done, and I resend in topic "[PATCH v1] PCI: Add suspend fixup for SSD
>  on sc7280", please review it.
> 
>   - Format commit log correctly (fill 75 columns, no leading spaces).
> Done.
> 
>   - Description of incorrect behavior.  What does the user see?  If
>     there's a bug report, include a link to it.
> This issue seems to be discovered in ChromeOS only. SSD will randomly 
> crashed at 100~250+ suspend/resume cycle. Phison and Qualcomm 
> found that its due to NVMe entering D3cold instead of L1ss.

It should be noted that D3cold (or whatever condition that causes the
issue) is not always entered, but only in the failure case (at least
that was the case for the Kioxia NVMe, which has a similar issue).

>   - Multi-line code comments in style of the file (look at existing
>     comments in the file).
> Done.
> 
>   - Details of "the correct ASPM state".  ASPM may be enabled or
>     disabled by the user, so you can't assume any particular ASPM
>     configuration.
> According to Qualcomm. This issue has been found last year and they have
> attempt to submit some patches to fix the pci suspend behavior. 
> (ref:https://patchwork.kernel.org/project/linux-arm-msm/list/?
> series=665060&state=%2A&archive=both). 
> But somehow these patches were rejected because of its complexity. And 
> we've got advise from Google that it will be more efficient that we implement 
> a quirks to fix this issue.

IIRC the primary goal of this series was to be able to turn off the
PCI clocks during suspend, to allow the SoC to enter a lower power
state. This fixing element for NVMe with the issue described above
is the the retry loop of "PCI: qcom: Add retry logic for link to be
stable in L1ss" [1].

It is currently unclear why *some* NVMe *sometimes* need a longer
time to enter the L1 sub-state. That's something Qualcomm and the
vendors of impacted NVMes should figure out.

[1] https://patchwork.kernel.org/project/linux-arm-msm/patch/1659526134-22978-4-git-send-email-quic_krichai@quicinc.com/

>   - Details on the Qualcomm sc7280 connection.  This quirk would
>     affect Phison SSDs on *all* platforms, not just sc7280.  I don't
>     want to slow down suspend on all platforms just for a sc7280
>     issue.

As of now the issue has only been observed on QC SC7280, I don't
know if ECS has tried this part on other platforms. The issue could
be QC/SC7280-specific or not.

> The DECLARE_PCI_FIXUP_SUSPEND function has already specify the PCI device 
> ID. And this SSD will only be used at our Chromebook device only.

It could be used in devices that are produced by other manufacturers.

A dedicated Kconfig option for the Phison NVMe could be an option.
Or a QC specific #ifdef (ugh ...) with a comment explaining that the
issue has been only observed on QC SC7280 *so far*.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ