netdev - Re: [PATCH v8 7/7] PCI: Work around PCIe link training failures

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230504222048.GA887151@bhelgaas>
Date: Thu, 4 May 2023 17:20:48 -0500
From: Bjorn Helgaas <helgaas@...nel.org>
To: "Maciej W. Rozycki" <macro@...am.me.uk>
Cc: Bjorn Helgaas <bhelgaas@...gle.com>,
	Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
	Oliver O'Halloran <oohall@...il.com>,
	Michael Ellerman <mpe@...erman.id.au>,
	Nicholas Piggin <npiggin@...il.com>,
	Christophe Leroy <christophe.leroy@...roup.eu>,
	Saeed Mahameed <saeedm@...dia.com>,
	Leon Romanovsky <leon@...nel.org>,
	"David S. Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>,
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
	Alex Williamson <alex.williamson@...hat.com>,
	Lukas Wunner <lukas@...ner.de>,
	Mika Westerberg <mika.westerberg@...ux.intel.com>,
	Stefan Roese <sr@...x.de>, Jim Wilson <wilson@...iptree.org>,
	David Abdurachmanov <david.abdurachmanov@...il.com>,
	Pali Rohár <pali@...nel.org>,
	linux-pci@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org,
	linux-rdma@...r.kernel.org, netdev@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v8 7/7] PCI: Work around PCIe link training failures

On Thu, Apr 06, 2023 at 01:21:31AM +0100, Maciej W. Rozycki wrote:
> Attempt to handle cases such as with a downstream port of the ASMedia 
> ASM2824 PCIe switch where link training never completes and the link 
> continues switching between speeds indefinitely with the data link layer 
> never reaching the active state.

We're going to land this series this cycle, come hell or high water.

We talked about reusing pcie_retrain_link() earlier.  IIRC that didn't
work: ASPM needs to use PCI_EXP_LNKSTA_LT because not all devices
support PCI_EXP_LNKSTA_DLLLA, and you need PCI_EXP_LNKSTA_DLLLA
because the erratum makes PCI_EXP_LNKSTA_LT flap.

What if we made pcie_retrain_link() reusable by making it:

  bool pcie_retrain_link(struct pci_dev *pdev, u16 link_status_bit)

so ASPM could use pcie_retrain_link(link->pdev, PCI_EXP_LNKSTA_LT) and
you could use pcie_retrain_link(dev, PCI_EXP_LNKSTA_DLLLA)?

Maybe do it two steps?

  1) Move pcie_retrain_link() just after pcie_wait_for_link() and make
  it take link->pdev instead of link.

  2) Add the bit parameter.

I'm OK with having pcie_retrain_link() in pci.c, but the surrounding
logic about restricting to 2.5GT/s, retraining, removing the
restriction, retraining again is stuff I'd rather have in quirks.c so
it doesn't clutter pci.c.

I think it'd be good if the pci_device_add() path made clear that this
is a workaround for a problem, e.g.,

  void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
  {
    ...
    if (pcie_link_failed(dev))
      pcie_fix_link_train(dev);

where pcie_fix_link_train() could live in quirks.c (with a stub when
CONFIG_PCI_QUIRKS isn't enabled).  It *might* even be worth adding it
and the stub first because that's a trivial patch and wouldn't clutter
the probe.c git history with all the grotty details about ASM2824 and
this topology.

> +int pcie_downstream_link_retrain(struct pci_dev *dev)
> +{
> +	static const struct pci_device_id ids[] = {
> +		{ PCI_VDEVICE(ASMEDIA, 0x2824) }, /* ASMedia ASM2824 */
> +		{}
> +	};
> +	u16 lnksta, lnkctl2;
> +
> +	if (!pci_is_pcie(dev) || !pcie_downstream_port(dev) ||
> +	    !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting)
> +		return -1;
> +
> +	pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2);
> +	pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta);
> +	if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) ==
> +	    PCI_EXP_LNKSTA_LBMS) {

You go to some trouble to make sure PCI_EXP_LNKSTA_LBMS is set, and I
can't remember what the reason is.  If you make a preparatory patch
like this, it would give a place for that background, e.g.,

  +bool pcie_link_failed(struct pci_dev *dev)
  +{
  +       u16 lnksta;
  +
  +       if (!pci_is_pcie(dev) || !pcie_downstream_port(dev) ||
  +           !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting)
  +               return false;
  +
  +       pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta);
  +       if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) ==
  +                       PCI_EXP_LNKSTA_LBMS)
  +               return true;
  +
  +       return false;
  +}

If this is a generic thing and checking PCI_EXP_LNKSTA_LBMS makes
sense for everybody, it could go in pci.c; otherwise it could go in
quirks.c as well.  I guess it's not *truly* generic anyway because it
only detects link training failures for devices that have LNKCTL2 and
link_active_reporting.

> +		unsigned long timeout;
> +		u16 lnkctl;
> +
> +		pci_info(dev, "broken device, retraining non-functional downstream link at 2.5GT/s\n");
> +
> +		pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
> +		lnkctl |= PCI_EXP_LNKCTL_RL;
> +		lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
> +		lnkctl2 |= PCI_EXP_LNKCTL2_TLS_2_5GT;
> +		pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
> +		pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnkctl);
> +		/*
> +		 * Due to an erratum in some devices the Retrain Link bit
> +		 * needs to be cleared again manually to allow the link
> +		 * training to succeed.
> +		 */
> +		lnkctl &= ~PCI_EXP_LNKCTL_RL;
> +		if (dev->clear_retrain_link)
> +			pcie_capability_write_word(dev, PCI_EXP_LNKCTL,
> +						   lnkctl);
> +
> +		timeout = jiffies + PCIE_LINK_RETRAIN_TIMEOUT;
> +		do {
> +			pcie_capability_read_word(dev, PCI_EXP_LNKSTA,
> +					     &lnksta);
> +			if (lnksta & PCI_EXP_LNKSTA_DLLLA)
> +				break;
> +			usleep_range(10000, 20000);
> +		} while (time_before(jiffies, timeout));
> +
> +		if (!(lnksta & PCI_EXP_LNKSTA_DLLLA)) {
> +			pci_info(dev, "retraining failed\n");
> +			return -1;
> +		}
> +	}

> +	if (IS_ENABLED(CONFIG_PCI_QUIRKS) && (lnksta & PCI_EXP_LNKSTA_DLLLA) &&
> +	    (lnkctl2 & PCI_EXP_LNKCTL2_TLS) == PCI_EXP_LNKCTL2_TLS_2_5GT &&
> +	    pci_match_id(ids, dev)) {
> +		u32 lnkcap;
> +		u16 lnkctl;
> +
> +		pci_info(dev, "removing 2.5GT/s downstream link speed restriction\n");
> +		pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
> +		pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
> +		lnkctl |= PCI_EXP_LNKCTL_RL;
> +		lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
> +		lnkctl2 |= lnkcap & PCI_EXP_LNKCAP_SLS;
> +		pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
> +		pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnkctl);

This starts a retrain; should we wait for training to complete?

> +	}

If we put most of this into a pcie_fix_link_train() (separated from
detecting the *need* to fix something), could it be made to look
sort of like this?  (I suppose you'd want to return bool and rename
it that reads naturally, e.g., "pcie_link_forcibly_retrained()",
"pcie_link_retrained()", etc)

  +void pcie_fix_link_train(struct pci_dev *dev)
  +{
  +       u16 lnkctl2;
  +       u32 lnkcap;
  +       bool linkup;
  +
  +       pci_info(dev, "attempting link retrain at 2.5GT/s\n");
  +       pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2);
  +       lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
  +       lnkctl2 |= PCI_EXP_LNKCTL2_TLS_2_5GT;
  +       pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
  +
  +       linkup = pcie_retrain_link(dev, PCI_EXP_LNKSTA_DLLLA);
  +       if (!linkup) {
  +               pci_info(dev, "retraining failed\n");
  +               return;
  +       }
  +
  +       if (LNKCAP supports only 2.5GT/s)
  +               return;
  +
  +       if (!pci_match_id(ids, dev))
  +               return;

Your comment said "if we know this is *safe*"; I can't remember if
pci_match_id() is there to avoid a known problem?

  +
  +       pci_info(dev, "attempting link retrain at max supported rate\n");
  +       pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
  +       lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
  +       lnkctl2 |= lnkcap & PCI_EXP_LNKCAP_SLS;
  +       pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
  +
  +       linkup = pcie_retrain_link(dev, PCI_EXP_LNKSTA_DLLLA);
  +       if (!linkup)
  +               pci_info(dev, "retraining failed\n");
  +}

> +
> +	return 0;
> +}
> +
> +/* Same as above, but called for a downstream device.  */
> +static int pcie_upstream_link_retrain(struct pci_dev *dev)
> +{
> +	struct pci_dev *bridge;
> +
> +	bridge = pci_upstream_bridge(dev);
> +	if (bridge)
> +		return pcie_downstream_link_retrain(bridge);
> +	else
> +		return -1;
> +}
> +
>  static int pci_acs_enable;
>  
>  /**
> @@ -1148,8 +1274,8 @@ void pci_resume_bus(struct pci_bus *bus)
>  
>  static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>  {
> +	int retrain = 0;
>  	int delay = 1;
> -	u32 id;
>  
>  	/*
>  	 * After reset, the device should not silently discard config
> @@ -1163,21 +1289,37 @@ static int pci_dev_wait(struct pci_dev *
>  	 * Command register instead of Vendor ID so we don't have to
>  	 * contend with the CRS SV value.
>  	 */
> -	pci_read_config_dword(dev, PCI_COMMAND, &id);
> -	while (PCI_POSSIBLE_ERROR(id)) {
> +	for (;;) {
> +		u32 id;
> +
> +		pci_read_config_dword(dev, PCI_COMMAND, &id);
> +		if (!PCI_POSSIBLE_ERROR(id)) {
> +			if (delay > PCI_RESET_WAIT)
> +				pci_info(dev, "ready %dms after %s\n",
> +					 delay - 1, reset_type);
> +			break;
> +		}
> +
>  		if (delay > timeout) {
>  			pci_warn(dev, "not ready %dms after %s; giving up\n",
>  				 delay - 1, reset_type);
>  			return -ENOTTY;
>  		}
>  
> -		if (delay > PCI_RESET_WAIT)
> +		if (delay > PCI_RESET_WAIT) {
> +			if (!retrain) {
> +				retrain = 1;
> +				if (pcie_upstream_link_retrain(dev) == 0) {
> +					delay = 1;
> +					continue;
> +				}
> +			}
>  			pci_info(dev, "not ready %dms after %s; waiting\n",
>  				 delay - 1, reset_type);
> +		}

Thanks for fixing this in the reset path, too.  Can we move this part
to a separate patch?  It's related to the rest of the patch, but it
looks so much different that I think it would be easier to understand
by itself.

I think I might try to fold the pcie_upstream_link_retrain() directly
in here because the "upstream link retrain" in the function name
doesn't really make sense in PCIe terms.

Bjorn