[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240105211911.GA1867400@bhelgaas>
Date: Fri, 5 Jan 2024 15:19:11 -0600
From: Bjorn Helgaas <helgaas@...nel.org>
To: Kai-Heng Feng <kai.heng.feng@...onical.com>
Cc: adrian.hunter@...el.com, ulf.hansson@...aro.org,
Victor Shih <victor.shih@...esyslogic.com.tw>,
Ben Chuang <benchuanggli@...il.com>, linux-mmc@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] mmc: sdhci-pci-gli: GL975x: Mask rootport's replay
timer timeout during suspend
On Thu, Dec 21, 2023 at 11:21:47AM +0800, Kai-Heng Feng wrote:
> Spamming `lspci -vv` can still observe the replay timer timeout error
> even after commit 015c9cbcf0ad ("mmc: sdhci-pci-gli: GL9750: Mask the
> replay timer timeout of AER"), albeit with a lower reproduce rate.
I'm not sure what this is telling me. By "spamming `lspci -vv`, do
you mean that if you run lspci continually, you still see Replay Timer
Timeout logged, e.g.,
CESta: ... Timeout+
015c9cbcf0ad uses hard-coded PCI_GLI_9750_CORRERR_MASK offset and
PCI_GLI_9750_CORRERR_MASK_REPLAY_TIMER_TIMEOUT value, which look like
they *could* be PCI_ERR_COR_MASK and PCI_ERR_COR_REP_TIMER, but
without the lspci output I can't tell for sure. If they are, it would
be nice to use the generic macros instead of defining new ones so it's
easier to analyze PCI_ERR_COR_MASK usage.
If 015c9cbcf0ad is updating the generic PCI_ERR_COR_MASK, it should
only prevent sending ERR_COR. It should not affect the *logging* in
PCI_ERR_COR_STATUS (see PCIe r6.0, sec 6.2.3.2.2), so it shouldn't
affect the lspci output.
> Such AER interrupt can still prevent the system from suspending, so let
> root port mask and unmask replay timer timeout during suspend and
> resume, respectively.
015c9cbcf0ad looks like it masks PCI_ERR_COR_REP_TIMER in the gl975x
Endpoint, while this patch masks it in the upstream bridge (which
might be either a Root Port or a Switch Downstream Port, so the
subject and this sentence are not quite right).
015c9cbcf0ad says it is related to a hardware defect, and maybe this
patch is also (mention it if so). Both patches can prevent ERR_COR
messages and the eventual AER interrupt, depending on whether the
Downstream Port or the Endpoint detects the Replay Timer Timeout.
Maybe this should have a Fixes: tag for 015c9cbcf0ad to try to keep
these together?
If 015c9cbcf0ad is actually updating PCI_ERR_COR_MASK, it would be
nice to clean that up, too. And maybe PCI_ERR_COR_REP_TIMER should be
masked/restored at the same place for both the Downstream Port and the
Endpoint?
> +#ifdef CONFIG_PCIEAER
> +static void mask_replay_timer_timeout(struct pci_dev *pdev)
> +{
> + struct pci_dev *parent = pci_upstream_bridge(pdev);
> + u32 val;
> +
> + if (!parent || !parent->aer_cap)
> + return;
> +
> + pci_read_config_dword(parent, parent->aer_cap + PCI_ERR_COR_MASK, &val);
> + val |= PCI_ERR_COR_REP_TIMER;
> + pci_write_config_dword(parent, parent->aer_cap + PCI_ERR_COR_MASK, val);
> +}
> +
> +static void unmask_replay_timer_timeout(struct pci_dev *pdev)
> +{
> + struct pci_dev *parent = pci_upstream_bridge(pdev);
> + u32 val;
> +
> + if (!parent || !parent->aer_cap)
> + return;
> +
> + pci_read_config_dword(pdev, parent->aer_cap + PCI_ERR_COR_MASK, &val);
> + val &= ~PCI_ERR_COR_REP_TIMER;
I think I would save the previous PCI_ERR_COR_REP_TIMER value and
restore it here, so it is preserved if there is ever a generic
interface via sysfs or similar to manage correctable error masking.
> + pci_write_config_dword(pdev, parent->aer_cap + PCI_ERR_COR_MASK, val);
> +}
Powered by blists - more mailing lists