lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAd53p5p=JJ9OOQd=XPzJgW7yib+hMJxZqj7PZFsd2uFtK94xg@mail.gmail.com>
Date: Tue, 26 Mar 2024 09:52:28 +0800
From: Kai-Heng Feng <kai.heng.feng@...onical.com>
To: Bjorn Helgaas <helgaas@...nel.org>
Cc: adrian.hunter@...el.com, ulf.hansson@...aro.org, 
	Victor Shih <victor.shih@...esyslogic.com.tw>, Ben Chuang <benchuanggli@...il.com>, 
	linux-mmc@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] mmc: sdhci-pci-gli: GL975x: Mask rootport's replay
 timer timeout during suspend

On Tue, Mar 26, 2024 at 3:02 AM Bjorn Helgaas <helgaas@...nel.org> wrote:
>
> On Mon, Mar 25, 2024 at 10:02:27AM +0800, Kai-Heng Feng wrote:
> > On Sat, Mar 23, 2024 at 12:43 AM Bjorn Helgaas <helgaas@...nel.org> wrote:
> > > On Thu, Mar 21, 2024 at 06:05:33PM +0800, Kai-Heng Feng wrote:
> > > > On Sat, Jan 20, 2024 at 6:41 AM Bjorn Helgaas <helgaas@...nel.org> wrote:
> > > > > On Thu, Jan 18, 2024 at 02:40:50PM +0800, Kai-Heng Feng wrote:
> > > > > > On Sat, Jan 13, 2024 at 1:37 AM Bjorn Helgaas <helgaas@...nel.org> wrote:
> > > > > > > On Fri, Jan 12, 2024 at 01:14:42PM +0800, Kai-Heng Feng wrote:
> > > > > > > > On Sat, Jan 6, 2024 at 5:19 AM Bjorn Helgaas <helgaas@...nel.org> wrote:
> > > > > > > > > On Thu, Dec 21, 2023 at 11:21:47AM +0800, Kai-Heng Feng wrote:
> > > > > > > > > > Spamming `lspci -vv` can still observe the replay timer timeout error
> > > > > > > > > > even after commit 015c9cbcf0ad ("mmc: sdhci-pci-gli: GL9750: Mask the
> > > > > > > > > > replay timer timeout of AER"), albeit with a lower reproduce rate.
> > > > > > > > >
> > > > > > > > > I'm not sure what this is telling me.  By "spamming `lspci -vv`, do
> > > > > > > > > you mean that if you run lspci continually, you still see Replay Timer
> > > > > > > > > Timeout logged, e.g.,
> > > > > > > > >
> > > > > > > > >   CESta:        ... Timeout+
> > > > > > > >
> > > > > > > > Yes it's logged and the AER IRQ is raised.
> > > > > > >
> > > > > > > IIUC the AER IRQ is the important thing.
> > > > > > >
> > > > > > > Neither 015c9cbcf0ad nor this patch affects logging in
> > > > > > > PCI_ERR_COR_STATUS, so the lspci output won't change and mentioning it
> > > > > > > here doesn't add useful information.
> > > > > >
> > > > > > You are right. That's just a way to access config space to reproduce
> > > > > > the issue.
> > > > >
> > > > > Oh, I think I completely misunderstood you!  I thought you were saying
> > > > > that suspending the device caused the PCI_ERR_COR_REP_TIMER error, and
> > > > > you happened to see that it was logged when you ran lspci.
> > > >
> > > > Both running lspci and suspending the device can observe the error,
> > > > because both are accessing the config space.
> > > >
> > > > > But I guess you mean that running lspci actually *causes* the error?
> > > > > I.e., lspci does a config access while we're suspending the device
> > > > > causes the error, and the config access itself causes the error, which
> > > > > causes the ERR_COR message and ultimately the AER interrupt, and that
> > > > > interrupt prevents the system suspend.
> > > >
> > > > My point was that any kind of PCI config access can cause the error.
> > > > Using lspci is just make the error more easier to reproduce.
> > > >
> > > > > If that's the case, I wonder if this is a generic problem that could
> > > > > happen with *any* device, not just GL975x.
> > > >
> > > > For now, it's just GL975x.
> > > >
> > > > > What power state do we put the GL975x in during system suspend?
> > > > > D3hot?  D3cold?  Is there anything that prevents config access while
> > > > > we suspend it?
> > > >
> > > > The target device state is D3hot.
> > > > However, the issue happens when the devices is in D0, when the PCI
> > > > core is saving the device's config space.
> > > >
> > > > So I think the issue isn't related to the device state.
> > > >
> > > > > We do have dev->block_cfg_access, and there's a comment that says
> > > > > "we're required to prevent config accesses during D-state
> > > > > transitions," but I don't see it being used during D-state
> > > > > transitions.
> > > >
> > > > Yes, there isn't any D-state change happens here.
> > >
> > > So the timeout happens sometimes on any config accesses to the device,
> > > no matter what the power state is?
> >
> > Yes.
> >
> > > If that's the case, why do the
> > > masking in the suspend/resume callbacks?
> >
> > Because there's no functional impact when the error happens, other
> > than suspend/resume.
>
> Oh, I think I see.  Is this accurate?
>
>   Due to a hardware defect in GL975x, config accesses when ASPM is
>   enabled frequently cause Replay Timer Timeouts in the Port leading
>   to the device.
>
>   These are Correctable Errors, so the Downstream Port logs it in its
>   PCI_ERR_COR_STATUS and, when the error is not masked, sends an
>   ERR_COR message upstream.  The message terminates at a Root Port,
>   which may generate an AER interrupt so the OS can log it.
>
>   The Correctable Error logging is an annoyance but normally not a
>   major issue.  But when the AER interrupt happens during suspend, it
>   can prevent the system from suspending.

That's totally the case here.

This brings up another different but related topic  - should the port
driver disable AER/DPC IRQ during suspend?
We've discussed this many times, I still think that's the right
approach to "quiesce" many unexpected errors during system state
transition.


>
> > > If it's not related to a power state change, it sounds like something
> > > that should be a quirk or done at probe time.
> >
> > Sure, I'll change that to be done at probe time.
>
> In general, we want to log Correctable Errors because they give an
> indication of link integrity.  But I'm guessing that since this is
> related to a GL975x defect, the errors occur pretty frequently and on
> all systems, so they aren't an indication of poor link quality and
> they only break suspend, annoy users, and cause problem reports.
>
> Masking them at suspend time would avoid the suspend/resume problem,
> but we would still have the annoying logging and get the problem
> reports.
>
> If this is all true, I think masking via a quirk is probably the right
> thing.  That way we won't get reports that "lspci causes errors" even
> when the sdhci driver is not loaded.
>
> I think we should log a hint in dmesg that we're masking
> PCI_ERR_COR_REP_TIMER because the error will still be logged in the
> PCI_ERR_COR_STATUS register, and that will be visible via lspci, and a
> dmesg hint will save debugging time when people report that.

Sure. Where do you think it's a better place to implement the quirk? I
Assume PCI quirk is a better place than driver's probe routine?

Kai-Heng

>
> Bjorn

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ