lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20240326211954.GA1497572@bhelgaas>
Date: Tue, 26 Mar 2024 16:19:54 -0500
From: Bjorn Helgaas <helgaas@...nel.org>
To: Kai-Heng Feng <kai.heng.feng@...onical.com>
Cc: adrian.hunter@...el.com, ulf.hansson@...aro.org,
	Victor Shih <victor.shih@...esyslogic.com.tw>,
	Ben Chuang <benchuanggli@...il.com>, linux-mmc@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] mmc: sdhci-pci-gli: GL975x: Mask rootport's replay
 timer timeout during suspend

On Tue, Mar 26, 2024 at 09:52:28AM +0800, Kai-Heng Feng wrote:
> On Tue, Mar 26, 2024 at 3:02 AM Bjorn Helgaas <helgaas@...nel.org> wrote:
> > On Mon, Mar 25, 2024 at 10:02:27AM +0800, Kai-Heng Feng wrote:
> > > On Sat, Mar 23, 2024 at 12:43 AM Bjorn Helgaas <helgaas@...nel.org> wrote:
> ...

> > > > If that's the case, why do the
> > > > masking in the suspend/resume callbacks?
> > >
> > > Because there's no functional impact when the error happens, other
> > > than suspend/resume.
> >
> > Oh, I think I see.  Is this accurate?
> >
> >   Due to a hardware defect in GL975x, config accesses when ASPM is
> >   enabled frequently cause Replay Timer Timeouts in the Port leading
> >   to the device.
> >
> >   These are Correctable Errors, so the Downstream Port logs it in its
> >   PCI_ERR_COR_STATUS and, when the error is not masked, sends an
> >   ERR_COR message upstream.  The message terminates at a Root Port,
> >   which may generate an AER interrupt so the OS can log it.
> >
> >   The Correctable Error logging is an annoyance but normally not a
> >   major issue.  But when the AER interrupt happens during suspend, it
> >   can prevent the system from suspending.
> 
> That's totally the case here.
> 
> This brings up another different but related topic  - should the port
> driver disable AER/DPC IRQ during suspend?
> We've discussed this many times, I still think that's the right
> approach to "quiesce" many unexpected errors during system state
> transition.

Maybe so.  We can continue that in the context of that patch.  Maybe
it needs to be reposted; I can't remember where it's at right now.

> > I think we should log a hint in dmesg that we're masking
> > PCI_ERR_COR_REP_TIMER because the error will still be logged in the
> > PCI_ERR_COR_STATUS register, and that will be visible via lspci, and a
> > dmesg hint will save debugging time when people report that.
> 
> Sure. Where do you think it's a better place to implement the quirk? I
> Assume PCI quirk is a better place than driver's probe routine?

Yes, I think drivers/pci/quirks.c is a better place so we can mask it
even if the driver isn't loaded.  Users can still run lspci and see
these errors even if the driver isn't loaded.

Bjorn

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ