linux-kernel - Re: [PATCH v2] mmc: sdhci-pci-gli: GL975x: Mask rootport's replay timer timeout during suspend

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240105211911.GA1867400@bhelgaas>
Date: Fri, 5 Jan 2024 15:19:11 -0600
From: Bjorn Helgaas <helgaas@...nel.org>
To: Kai-Heng Feng <kai.heng.feng@...onical.com>
Cc: adrian.hunter@...el.com, ulf.hansson@...aro.org,
	Victor Shih <victor.shih@...esyslogic.com.tw>,
	Ben Chuang <benchuanggli@...il.com>, linux-mmc@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] mmc: sdhci-pci-gli: GL975x: Mask rootport's replay
 timer timeout during suspend

On Thu, Dec 21, 2023 at 11:21:47AM +0800, Kai-Heng Feng wrote:
> Spamming `lspci -vv` can still observe the replay timer timeout error
> even after commit 015c9cbcf0ad ("mmc: sdhci-pci-gli: GL9750: Mask the
> replay timer timeout of AER"), albeit with a lower reproduce rate.

I'm not sure what this is telling me.  By "spamming `lspci -vv`, do
you mean that if you run lspci continually, you still see Replay Timer
Timeout logged, e.g.,

  CESta:	... Timeout+

015c9cbcf0ad uses hard-coded PCI_GLI_9750_CORRERR_MASK offset and
PCI_GLI_9750_CORRERR_MASK_REPLAY_TIMER_TIMEOUT value, which look like
they *could* be PCI_ERR_COR_MASK and PCI_ERR_COR_REP_TIMER, but
without the lspci output I can't tell for sure.  If they are, it would
be nice to use the generic macros instead of defining new ones so it's
easier to analyze PCI_ERR_COR_MASK usage.

If 015c9cbcf0ad is updating the generic PCI_ERR_COR_MASK, it should
only prevent sending ERR_COR.  It should not affect the *logging* in
PCI_ERR_COR_STATUS (see PCIe r6.0, sec 6.2.3.2.2), so it shouldn't
affect the lspci output.

> Such AER interrupt can still prevent the system from suspending, so let
> root port mask and unmask replay timer timeout during suspend and
> resume, respectively.

015c9cbcf0ad looks like it masks PCI_ERR_COR_REP_TIMER in the gl975x
Endpoint, while this patch masks it in the upstream bridge (which
might be either a Root Port or a Switch Downstream Port, so the
subject and this sentence are not quite right).

015c9cbcf0ad says it is related to a hardware defect, and maybe this
patch is also (mention it if so).  Both patches can prevent ERR_COR
messages and the eventual AER interrupt, depending on whether the
Downstream Port or the Endpoint detects the Replay Timer Timeout.
Maybe this should have a Fixes: tag for 015c9cbcf0ad to try to keep
these together?

If 015c9cbcf0ad is actually updating PCI_ERR_COR_MASK, it would be
nice to clean that up, too.  And maybe PCI_ERR_COR_REP_TIMER should be
masked/restored at the same place for both the Downstream Port and the
Endpoint?

> +#ifdef CONFIG_PCIEAER
> +static void mask_replay_timer_timeout(struct pci_dev *pdev)
> +{
> +	struct pci_dev *parent = pci_upstream_bridge(pdev);
> +	u32 val;
> +
> +	if (!parent || !parent->aer_cap)
> +		return;
> +
> +	pci_read_config_dword(parent, parent->aer_cap + PCI_ERR_COR_MASK, &val);
> +	val |= PCI_ERR_COR_REP_TIMER;
> +	pci_write_config_dword(parent, parent->aer_cap + PCI_ERR_COR_MASK, val);
> +}
> +
> +static void unmask_replay_timer_timeout(struct pci_dev *pdev)
> +{
> +	struct pci_dev *parent = pci_upstream_bridge(pdev);
> +	u32 val;
> +
> +	if (!parent || !parent->aer_cap)
> +		return;
> +
> +	pci_read_config_dword(pdev, parent->aer_cap + PCI_ERR_COR_MASK, &val);
> +	val &= ~PCI_ERR_COR_REP_TIMER;

I think I would save the previous PCI_ERR_COR_REP_TIMER value and
restore it here, so it is preserved if there is ever a generic
interface via sysfs or similar to manage correctable error masking.

> +	pci_write_config_dword(pdev, parent->aer_cap + PCI_ERR_COR_MASK, val);
> +}