[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240617231841.GA1232294@bhelgaas>
Date: Mon, 17 Jun 2024 18:18:41 -0500
From: Bjorn Helgaas <helgaas@...nel.org>
To: Smita Koralahalli <Smita.KoralahalliChannabasappa@....com>
Cc: linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org,
Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
Lukas Wunner <lukas@...ner.de>,
Yazen Ghannam <yazen.ghannam@....com>,
Ilpo Jarvinen <ilpo.jarvinen@...ux.intel.com>,
Bowman Terry <terry.bowman@....com>,
Hagan Billy <billy.hagan@....com>,
Simon Guinot <simon.guinot@...gate.com>,
"Maciej W . Rozycki" <macro@...am.me.uk>,
Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@...ux.intel.com>
Subject: Re: [PATCH v2] PCI: pciehp: Clear LBMS on hot-remove to prevent link
speed reduction
On Mon, Jun 17, 2024 at 03:51:57PM -0700, Smita Koralahalli wrote:
> Hi Bjorn,
>
> On 6/17/2024 1:09 PM, Bjorn Helgaas wrote:
> > On Thu, May 16, 2024 at 08:47:48PM +0000, Smita Koralahalli wrote:
> > > Clear Link Bandwidth Management Status (LBMS) if set, on a hot-remove
> > > event.
> > >
> > > The hot-remove event could result in target link speed reduction if LBMS
> > > is set, due to a delay in Presence Detect State Change (PDSC) happening
> > > after a Data Link Layer State Change event (DLLSC).
> > >
> > > In reality, PDSC and DLLSC events rarely come in simultaneously. Delay in
> > > PDSC can sometimes be too late and the slot could have already been
> > > powered down just by a DLLSC event. And the delayed PDSC could falsely be
> > > interpreted as an interrupt raised to turn the slot on. This false process
> > > of powering the slot on, without a link forces the kernel to retrain the
> > > link if LBMS is set, to a lower speed to restablish the link thereby
> > > bringing down the link speeds [2].
> >
> > Not sure we need PDSC and DLLSC details to justify clearing LBMS if it
> > has no meaning for an empty slot?
>
> I'm trying to also answer your below question here..
>
> >I guess the slot is empty, so retraining
> > is meaningless and will always fail. Maybe avoiding it avoids a
> > delay? Is the benefit that we get rid of the message and a delay?"
>
> The benefit of this patch is to "avoid link speed drops" on a hot remove
> event if LBMS is set and DLLLA is clear. But I'm not trying to solve delay
> issues here..
>
> I included the PDSC and DLLSC details as they are the cause for link speed
> drops on a remove event. On an empty slot, DLLLA is cleared and LBMS may or
> may not be set. And, we see cases of link speed drops here, if PDSC happens
> on an empty slot.
>
> We know for the fact that slot becomes empty if either of the events PDSC or
> DLLSC occur. Also, either of them do not wait for the other to bring down
> the device and mark the slot as "empty". That is the reason I was also
> thinking of waiting on both events PDSC and DLLSC to bring down the device
> as I mentioned in my comments in V1. We could eliminate link speed drops by
> taking this approach as well. But then we had to address cases where PDSC is
> hardwired to zero.
>
> On our AMD systems, we expect to see both events on an hot-remove event.
> But, mostly we see DLLSC happening first, which does the job of marking the
> slot empty. Now, if the PDSC event is delayed way too much and if it occurs
> after the slot becomes empty, kernel misinterprets PDSC as the signal to
> re-initialize the slot and this is the sequence of steps the kernel takes:
>
> pciehp_handle_presence_or_link_change()
> pciehp_enable_slot()
> __pciehp_enable_slot()
> board_added
> pciehp_check_link_status()
> pcie_wait_for_link()
> pcie_wait_for_link_delay()
> pcie_failed_link_retrain()
>
> while doing so, it hits the case of DLLLA clear and LBMS set and brings down
> the speeds.
So I guess LBMS currently remains set after a device has been removed,
so the slot is empty, and later when a device is hot-added, *that*
device sees a lower-than expected link speed?
> The issue of PDSC and DLLSC never occurring simultaneously was a known thing
> from before and it wasn't breaking anything functionally as the kernel would
> just exit with the message: "No link" at pciehp_check_link_status().
>
> However, Commit a89c82249c37 ("PCI: Work around PCIe link training
> failures") introduced link speed downgrades to re-establish links if LBMS is
> set, and DLLLA is clear. This caused the devices to operate at 2.5GT/s after
> they were plugged-in which were previously operating at higher speeds before
> hot-remove.
>
> > > According to PCIe r6.2 sec 7.5.3.8 [1], it is derived that, LBMS cannot
> > > be set for an unconnected link and if set, it serves the purpose of
> > > indicating that there is actually a device down an inactive link.
> >
> > I see that r6.2 added an implementation note about DLLSC, but I'm not
> > a hardware person and can't follow the implication about a device
> > present down an inactive link.
> >
> > I guess it must be related to the fact that LBMS indicates either
> > completion of link retraining or a change in link speed or width
> > (which would imply presence of a downstream device). But in both
> > cases I assume the link would be active.
> >
> > But IIUC LBMS is set by hardware but never cleared by hardware, so if
> > we remove a device and power off the slot, it doesn't seem like LBMS
> > could be telling us anything useful (what could we do in response to
> > LBMS when the slot is empty?), so it makes sense to me to clear it.
> >
> > It seems like pciehp_unconfigure_device() does sort of PCI core and
> > driver-related things and possibly could be something shared by all
> > hotplug drivers, while remove_board() does things more specific to the
> > hotplug model (pciehp, shpchp, etc).
> >
> > From that perspective, clearing LBMS might fit better in
> > remove_board(). In that case, I wonder whether it should be done
> > after turning off slot power? This patch clears is *before* turning
> > off the power, so I wonder if hardware could possibly set it again
> > before the poweroff?
>
> Yeah by talking to HW people I realized that HW could interfere possibly
> anytime to set LBMS when the slot power is on. Will change it to include in
> remove_board().
>
> > > However, hardware could have already set LBMS when the device was
> > > connected to the port i.e when the state was DL_Up or DL_Active. Some
> > > hardwares would have even attempted retrain going into recovery mode,
> > > just before transitioning to DL_Down.
> > >
> > > Thus the set LBMS is never cleared and might force software to cause link
> > > speed drops when there is no link [2].
> > >
> > > Dmesg before:
> > > pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
> > > pcieport 0000:20:01.1: pciehp: Slot(59): Card present
> > > pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
> > > pcieport 0000:20:01.1: retraining failed
> > > pcieport 0000:20:01.1: pciehp: Slot(59): No link
> > >
> > > Dmesg after:
> > > pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
> > > pcieport 0000:20:01.1: pciehp: Slot(59): Card present
> > > pcieport 0000:20:01.1: pciehp: Slot(59): No link
> >
> > I'm a little confused about the problem being solved here. Obviously
> > the message is extraneous. I guess the slot is empty, so retraining
> > is meaningless and will always fail. Maybe avoiding it avoids a
> > delay? Is the benefit that we get rid of the message and a delay?
> >
> > > [1] PCI Express Base Specification Revision 6.2, Jan 25 2024.
> > > https://members.pcisig.com/wg/PCI-SIG/document/20590
> > > [2] Commit a89c82249c37 ("PCI: Work around PCIe link training failures")
> > >
> > > Fixes: a89c82249c37 ("PCI: Work around PCIe link training failures")
> >
> > Lukas asked about this; did you confirm that it is related? Asking
> > because the Fixes tag may cause this to be backported along with
> > a89c82249c37.
>
> Yeah, without this patch we won't see link speed drops.
>
> Thanks,
> Smita
>
> >
> > > Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@....com>
> > > ---
> > > Link to v1:
> > > https://lore.kernel.org/all/20240424033339.250385-1-Smita.KoralahalliChannabasappa@amd.com/
> > >
> > > v2:
> > > Cleared LBMS unconditionally. (Ilpo)
> > > Added Fixes Tag. (Lukas)
> > > ---
> > > drivers/pci/hotplug/pciehp_pci.c | 3 +++
> > > 1 file changed, 3 insertions(+)
> > >
> > > diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
> > > index ad12515a4a12..dae73a8932ef 100644
> > > --- a/drivers/pci/hotplug/pciehp_pci.c
> > > +++ b/drivers/pci/hotplug/pciehp_pci.c
> > > @@ -134,4 +134,7 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
> > > }
> > > pci_unlock_rescan_remove();
> > > +
> > > + pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
> > > + PCI_EXP_LNKSTA_LBMS);
> > > }
> > > --
> > > 2.17.1
> > >
Powered by blists - more mailing lists