lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAd53p6qE76QjJmjr5ei0mU8xcSNE32hJMOE9Frwz-BuC3gDkA@mail.gmail.com>
Date: Thu, 18 Jan 2024 14:40:50 +0800
From: Kai-Heng Feng <kai.heng.feng@...onical.com>
To: Bjorn Helgaas <helgaas@...nel.org>
Cc: adrian.hunter@...el.com, ulf.hansson@...aro.org, 
	Victor Shih <victor.shih@...esyslogic.com.tw>, Ben Chuang <benchuanggli@...il.com>, 
	linux-mmc@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] mmc: sdhci-pci-gli: GL975x: Mask rootport's replay
 timer timeout during suspend

On Sat, Jan 13, 2024 at 1:37 AM Bjorn Helgaas <helgaas@...nel.org> wrote:
>
> On Fri, Jan 12, 2024 at 01:14:42PM +0800, Kai-Heng Feng wrote:
> > On Sat, Jan 6, 2024 at 5:19 AM Bjorn Helgaas <helgaas@...nel.org> wrote:
> > >
> > > On Thu, Dec 21, 2023 at 11:21:47AM +0800, Kai-Heng Feng wrote:
> > > > Spamming `lspci -vv` can still observe the replay timer timeout error
> > > > even after commit 015c9cbcf0ad ("mmc: sdhci-pci-gli: GL9750: Mask the
> > > > replay timer timeout of AER"), albeit with a lower reproduce rate.
> > >
> > > I'm not sure what this is telling me.  By "spamming `lspci -vv`, do
> > > you mean that if you run lspci continually, you still see Replay Timer
> > > Timeout logged, e.g.,
> > >
> > >   CESta:        ... Timeout+
> >
> > Yes it's logged and the AER IRQ is raised.
>
> IIUC the AER IRQ is the important thing.
>
> Neither 015c9cbcf0ad nor this patch affects logging in
> PCI_ERR_COR_STATUS, so the lspci output won't change and mentioning it
> here doesn't add useful information.

You are right. That's just a way to access config space to reproduce the issue.

>
> I'd suggest more specific wording than "spamming `lspci -vv`", e.g.,
>
>   015c9cbcf0ad ("mmc: sdhci-pci-gli: GL9750: Mask the replay timer
>   timeout of AER") masks Replay Timer Timeout errors at the GL975x
>   Endpoint.  When the Endpoint detects these errors, it still logs
>   them in its PCI_ERR_COR_STATUS, but masking prevents it from sending
>   ERR_COR messages upstream.
>
>   The Downstream Port leading to a GL975x Endpoint is unaffected by
>   015c9cbcf0ad.  Previously, when that Port detected a Replay Timer
>   Timeout, it sent an ERR_COR message upstream, which eventually
>   caused an AER IRQ, which prevented the system from suspending.
>
>   Mask Replay Timer Timeout errors at the Downstream Port.  The errors
>   will still be logged in PCI_ERR_COR_STATUS, but no ERR_COR will be
>   sent.

That's phrased much better then mine :)

>
> > > 015c9cbcf0ad uses hard-coded PCI_GLI_9750_CORRERR_MASK offset and
> > > PCI_GLI_9750_CORRERR_MASK_REPLAY_TIMER_TIMEOUT value, which look like
> > > they *could* be PCI_ERR_COR_MASK and PCI_ERR_COR_REP_TIMER, but
> > > without the lspci output I can't tell for sure.  If they are, it would
> > > be nice to use the generic macros instead of defining new ones so it's
> > > easier to analyze PCI_ERR_COR_MASK usage.
> > >
> > > If 015c9cbcf0ad is updating the generic PCI_ERR_COR_MASK, it should
> > > only prevent sending ERR_COR.  It should not affect the *logging* in
> > > PCI_ERR_COR_STATUS (see PCIe r6.0, sec 6.2.3.2.2), so it shouldn't
> > > affect the lspci output.
> >
> > PCI_GLI_9750_CORRERR_MASK is specific to GLI 975x devices, so it
> > doesn't conform to generic PCI_ERR_COR_STATUS behavior.
>
> *Could* 015c9cbcf0ad have used the generic PCI_ERR_COR_MASK to
> accomplish the same effect?  Is there an advantage to using the
> device-specific PCI_GLI_9750_CORRERR_MASK?
>
> If masking via PCI_ERR_COR_MASK would work, that would be much better
> because the PCI core can see, manage, and make that visible, e.g., via
> sysfs.  The core doesn't do that today, but people are working on it.

I don't think so. Please see below.

>
> > > If 015c9cbcf0ad is actually updating PCI_ERR_COR_MASK, it would be
> > > nice to clean that up, too.  And maybe PCI_ERR_COR_REP_TIMER should be
> > > masked/restored at the same place for both the Downstream Port and the
> > > Endpoint?
> >
> > Since PCI_ERR_COR_REP_TIMER is already masked before 015c9cbcf0ad,
> > so I didn't think that's necessary.  Do you think it should still be
> > masked just to be safe?
>
> Did you mean "PCI_ERR_COR_REP_TIMER is already masked *by*
> 015c9cbcf0ad"?

No. The PCI_ERR_COR_REP_TIMER is masked with or without 015c9cbcf0ad.
That means before 015c9cbcf0ad, Reply Timeout error was reported with
PCI_ERR_COR_REP_TIMER masked.

So using PCI_GLI_9750_CORRERR_MASK is necessary for the endpoint.

>
> If masking PCI_ERR_COR_REP_TIMER using the generic PCI_ERR_COR_MASK in
> the GL975x would have the same effect as masking it with
> PCI_GLI_9750_CORRERR_MASK, then I think you should *only* use the
> generic PCI_ERR_COR_MASK.
>
> No need to do both if the generic one is sufficient.  And I think both
> should be done in the same place since they're basically solving the
> same problem, just at both ends of the link.

Do you mean only mask PCI_GLI_9750_CORRERR_MASK during suspend?
That will not be ideal because accessing its config space (e.g. `lspci
-vv`) will have many errors logged.

Kai-Heng

>
> Bjorn

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ