[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230504222048.GA887151@bhelgaas>
Date: Thu, 4 May 2023 17:20:48 -0500
From: Bjorn Helgaas <helgaas@...nel.org>
To: "Maciej W. Rozycki" <macro@...am.me.uk>
Cc: Bjorn Helgaas <bhelgaas@...gle.com>,
Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
Oliver O'Halloran <oohall@...il.com>,
Michael Ellerman <mpe@...erman.id.au>,
Nicholas Piggin <npiggin@...il.com>,
Christophe Leroy <christophe.leroy@...roup.eu>,
Saeed Mahameed <saeedm@...dia.com>,
Leon Romanovsky <leon@...nel.org>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
Alex Williamson <alex.williamson@...hat.com>,
Lukas Wunner <lukas@...ner.de>,
Mika Westerberg <mika.westerberg@...ux.intel.com>,
Stefan Roese <sr@...x.de>, Jim Wilson <wilson@...iptree.org>,
David Abdurachmanov <david.abdurachmanov@...il.com>,
Pali Rohár <pali@...nel.org>,
linux-pci@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org,
linux-rdma@...r.kernel.org, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v8 7/7] PCI: Work around PCIe link training failures
On Thu, Apr 06, 2023 at 01:21:31AM +0100, Maciej W. Rozycki wrote:
> Attempt to handle cases such as with a downstream port of the ASMedia
> ASM2824 PCIe switch where link training never completes and the link
> continues switching between speeds indefinitely with the data link layer
> never reaching the active state.
We're going to land this series this cycle, come hell or high water.
We talked about reusing pcie_retrain_link() earlier. IIRC that didn't
work: ASPM needs to use PCI_EXP_LNKSTA_LT because not all devices
support PCI_EXP_LNKSTA_DLLLA, and you need PCI_EXP_LNKSTA_DLLLA
because the erratum makes PCI_EXP_LNKSTA_LT flap.
What if we made pcie_retrain_link() reusable by making it:
bool pcie_retrain_link(struct pci_dev *pdev, u16 link_status_bit)
so ASPM could use pcie_retrain_link(link->pdev, PCI_EXP_LNKSTA_LT) and
you could use pcie_retrain_link(dev, PCI_EXP_LNKSTA_DLLLA)?
Maybe do it two steps?
1) Move pcie_retrain_link() just after pcie_wait_for_link() and make
it take link->pdev instead of link.
2) Add the bit parameter.
I'm OK with having pcie_retrain_link() in pci.c, but the surrounding
logic about restricting to 2.5GT/s, retraining, removing the
restriction, retraining again is stuff I'd rather have in quirks.c so
it doesn't clutter pci.c.
I think it'd be good if the pci_device_add() path made clear that this
is a workaround for a problem, e.g.,
void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
{
...
if (pcie_link_failed(dev))
pcie_fix_link_train(dev);
where pcie_fix_link_train() could live in quirks.c (with a stub when
CONFIG_PCI_QUIRKS isn't enabled). It *might* even be worth adding it
and the stub first because that's a trivial patch and wouldn't clutter
the probe.c git history with all the grotty details about ASM2824 and
this topology.
> +int pcie_downstream_link_retrain(struct pci_dev *dev)
> +{
> + static const struct pci_device_id ids[] = {
> + { PCI_VDEVICE(ASMEDIA, 0x2824) }, /* ASMedia ASM2824 */
> + {}
> + };
> + u16 lnksta, lnkctl2;
> +
> + if (!pci_is_pcie(dev) || !pcie_downstream_port(dev) ||
> + !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting)
> + return -1;
> +
> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2);
> + pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta);
> + if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) ==
> + PCI_EXP_LNKSTA_LBMS) {
You go to some trouble to make sure PCI_EXP_LNKSTA_LBMS is set, and I
can't remember what the reason is. If you make a preparatory patch
like this, it would give a place for that background, e.g.,
+bool pcie_link_failed(struct pci_dev *dev)
+{
+ u16 lnksta;
+
+ if (!pci_is_pcie(dev) || !pcie_downstream_port(dev) ||
+ !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting)
+ return false;
+
+ pcie_capability_read_word(dev, PCI_EXP_LNKSTA, &lnksta);
+ if ((lnksta & (PCI_EXP_LNKSTA_LBMS | PCI_EXP_LNKSTA_DLLLA)) ==
+ PCI_EXP_LNKSTA_LBMS)
+ return true;
+
+ return false;
+}
If this is a generic thing and checking PCI_EXP_LNKSTA_LBMS makes
sense for everybody, it could go in pci.c; otherwise it could go in
quirks.c as well. I guess it's not *truly* generic anyway because it
only detects link training failures for devices that have LNKCTL2 and
link_active_reporting.
> + unsigned long timeout;
> + u16 lnkctl;
> +
> + pci_info(dev, "broken device, retraining non-functional downstream link at 2.5GT/s\n");
> +
> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
> + lnkctl |= PCI_EXP_LNKCTL_RL;
> + lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
> + lnkctl2 |= PCI_EXP_LNKCTL2_TLS_2_5GT;
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnkctl);
> + /*
> + * Due to an erratum in some devices the Retrain Link bit
> + * needs to be cleared again manually to allow the link
> + * training to succeed.
> + */
> + lnkctl &= ~PCI_EXP_LNKCTL_RL;
> + if (dev->clear_retrain_link)
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL,
> + lnkctl);
> +
> + timeout = jiffies + PCIE_LINK_RETRAIN_TIMEOUT;
> + do {
> + pcie_capability_read_word(dev, PCI_EXP_LNKSTA,
> + &lnksta);
> + if (lnksta & PCI_EXP_LNKSTA_DLLLA)
> + break;
> + usleep_range(10000, 20000);
> + } while (time_before(jiffies, timeout));
> +
> + if (!(lnksta & PCI_EXP_LNKSTA_DLLLA)) {
> + pci_info(dev, "retraining failed\n");
> + return -1;
> + }
> + }
> + if (IS_ENABLED(CONFIG_PCI_QUIRKS) && (lnksta & PCI_EXP_LNKSTA_DLLLA) &&
> + (lnkctl2 & PCI_EXP_LNKCTL2_TLS) == PCI_EXP_LNKCTL2_TLS_2_5GT &&
> + pci_match_id(ids, dev)) {
> + u32 lnkcap;
> + u16 lnkctl;
> +
> + pci_info(dev, "removing 2.5GT/s downstream link speed restriction\n");
> + pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
> + pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnkctl);
> + lnkctl |= PCI_EXP_LNKCTL_RL;
> + lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
> + lnkctl2 |= lnkcap & PCI_EXP_LNKCAP_SLS;
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
> + pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnkctl);
This starts a retrain; should we wait for training to complete?
> + }
If we put most of this into a pcie_fix_link_train() (separated from
detecting the *need* to fix something), could it be made to look
sort of like this? (I suppose you'd want to return bool and rename
it that reads naturally, e.g., "pcie_link_forcibly_retrained()",
"pcie_link_retrained()", etc)
+void pcie_fix_link_train(struct pci_dev *dev)
+{
+ u16 lnkctl2;
+ u32 lnkcap;
+ bool linkup;
+
+ pci_info(dev, "attempting link retrain at 2.5GT/s\n");
+ pcie_capability_read_word(dev, PCI_EXP_LNKCTL2, &lnkctl2);
+ lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
+ lnkctl2 |= PCI_EXP_LNKCTL2_TLS_2_5GT;
+ pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
+
+ linkup = pcie_retrain_link(dev, PCI_EXP_LNKSTA_DLLLA);
+ if (!linkup) {
+ pci_info(dev, "retraining failed\n");
+ return;
+ }
+
+ if (LNKCAP supports only 2.5GT/s)
+ return;
+
+ if (!pci_match_id(ids, dev))
+ return;
Your comment said "if we know this is *safe*"; I can't remember if
pci_match_id() is there to avoid a known problem?
+
+ pci_info(dev, "attempting link retrain at max supported rate\n");
+ pcie_capability_read_dword(dev, PCI_EXP_LNKCAP, &lnkcap);
+ lnkctl2 &= ~PCI_EXP_LNKCTL2_TLS;
+ lnkctl2 |= lnkcap & PCI_EXP_LNKCAP_SLS;
+ pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
+
+ linkup = pcie_retrain_link(dev, PCI_EXP_LNKSTA_DLLLA);
+ if (!linkup)
+ pci_info(dev, "retraining failed\n");
+}
> +
> + return 0;
> +}
> +
> +/* Same as above, but called for a downstream device. */
> +static int pcie_upstream_link_retrain(struct pci_dev *dev)
> +{
> + struct pci_dev *bridge;
> +
> + bridge = pci_upstream_bridge(dev);
> + if (bridge)
> + return pcie_downstream_link_retrain(bridge);
> + else
> + return -1;
> +}
> +
> static int pci_acs_enable;
>
> /**
> @@ -1148,8 +1274,8 @@ void pci_resume_bus(struct pci_bus *bus)
>
> static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
> {
> + int retrain = 0;
> int delay = 1;
> - u32 id;
>
> /*
> * After reset, the device should not silently discard config
> @@ -1163,21 +1289,37 @@ static int pci_dev_wait(struct pci_dev *
> * Command register instead of Vendor ID so we don't have to
> * contend with the CRS SV value.
> */
> - pci_read_config_dword(dev, PCI_COMMAND, &id);
> - while (PCI_POSSIBLE_ERROR(id)) {
> + for (;;) {
> + u32 id;
> +
> + pci_read_config_dword(dev, PCI_COMMAND, &id);
> + if (!PCI_POSSIBLE_ERROR(id)) {
> + if (delay > PCI_RESET_WAIT)
> + pci_info(dev, "ready %dms after %s\n",
> + delay - 1, reset_type);
> + break;
> + }
> +
> if (delay > timeout) {
> pci_warn(dev, "not ready %dms after %s; giving up\n",
> delay - 1, reset_type);
> return -ENOTTY;
> }
>
> - if (delay > PCI_RESET_WAIT)
> + if (delay > PCI_RESET_WAIT) {
> + if (!retrain) {
> + retrain = 1;
> + if (pcie_upstream_link_retrain(dev) == 0) {
> + delay = 1;
> + continue;
> + }
> + }
> pci_info(dev, "not ready %dms after %s; waiting\n",
> delay - 1, reset_type);
> + }
Thanks for fixing this in the reset path, too. Can we move this part
to a separate patch? It's related to the rest of the patch, but it
looks so much different that I think it would be easier to understand
by itself.
I think I might try to fold the pcie_upstream_link_retrain() directly
in here because the "upstream link retrain" in the function name
doesn't really make sense in PCIe terms.
Bjorn
Powered by blists - more mailing lists