[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <27be113e-3e33-b969-c1e3-c5e82d1b8b7f@amd.com>
Date: Tue, 18 Jun 2024 11:51:31 -0700
From: Smita Koralahalli <Smita.KoralahalliChannabasappa@....com>
To: Bjorn Helgaas <helgaas@...nel.org>
Cc: linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org,
Mahesh J Salgaonkar <mahesh@...ux.ibm.com>, Lukas Wunner <lukas@...ner.de>,
Yazen Ghannam <yazen.ghannam@....com>,
Ilpo Jarvinen <ilpo.jarvinen@...ux.intel.com>,
Bowman Terry <terry.bowman@....com>, Hagan Billy <billy.hagan@....com>,
Simon Guinot <simon.guinot@...gate.com>,
"Maciej W . Rozycki" <macro@...am.me.uk>,
Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@...ux.intel.com>
Subject: Re: [PATCH v2] PCI: pciehp: Clear LBMS on hot-remove to prevent link
speed reduction
On 6/17/2024 4:18 PM, Bjorn Helgaas wrote:
> On Mon, Jun 17, 2024 at 03:51:57PM -0700, Smita Koralahalli wrote:
>> Hi Bjorn,
>>
>> On 6/17/2024 1:09 PM, Bjorn Helgaas wrote:
>>> On Thu, May 16, 2024 at 08:47:48PM +0000, Smita Koralahalli wrote:
>>>> Clear Link Bandwidth Management Status (LBMS) if set, on a hot-remove
>>>> event.
>>>>
>>>> The hot-remove event could result in target link speed reduction if LBMS
>>>> is set, due to a delay in Presence Detect State Change (PDSC) happening
>>>> after a Data Link Layer State Change event (DLLSC).
>>>>
>>>> In reality, PDSC and DLLSC events rarely come in simultaneously. Delay in
>>>> PDSC can sometimes be too late and the slot could have already been
>>>> powered down just by a DLLSC event. And the delayed PDSC could falsely be
>>>> interpreted as an interrupt raised to turn the slot on. This false process
>>>> of powering the slot on, without a link forces the kernel to retrain the
>>>> link if LBMS is set, to a lower speed to restablish the link thereby
>>>> bringing down the link speeds [2].
>>>
>>> Not sure we need PDSC and DLLSC details to justify clearing LBMS if it
>>> has no meaning for an empty slot?
>>
>> I'm trying to also answer your below question here..
>>
>>> I guess the slot is empty, so retraining
>>> is meaningless and will always fail. Maybe avoiding it avoids a
>>> delay? Is the benefit that we get rid of the message and a delay?"
>>
>> The benefit of this patch is to "avoid link speed drops" on a hot remove
>> event if LBMS is set and DLLLA is clear. But I'm not trying to solve delay
>> issues here..
>>
>> I included the PDSC and DLLSC details as they are the cause for link speed
>> drops on a remove event. On an empty slot, DLLLA is cleared and LBMS may or
>> may not be set. And, we see cases of link speed drops here, if PDSC happens
>> on an empty slot.
>>
>> We know for the fact that slot becomes empty if either of the events PDSC or
>> DLLSC occur. Also, either of them do not wait for the other to bring down
>> the device and mark the slot as "empty". That is the reason I was also
>> thinking of waiting on both events PDSC and DLLSC to bring down the device
>> as I mentioned in my comments in V1. We could eliminate link speed drops by
>> taking this approach as well. But then we had to address cases where PDSC is
>> hardwired to zero.
>>
>> On our AMD systems, we expect to see both events on an hot-remove event.
>> But, mostly we see DLLSC happening first, which does the job of marking the
>> slot empty. Now, if the PDSC event is delayed way too much and if it occurs
>> after the slot becomes empty, kernel misinterprets PDSC as the signal to
>> re-initialize the slot and this is the sequence of steps the kernel takes:
>>
>> pciehp_handle_presence_or_link_change()
>> pciehp_enable_slot()
>> __pciehp_enable_slot()
>> board_added
>> pciehp_check_link_status()
>> pcie_wait_for_link()
>> pcie_wait_for_link_delay()
>> pcie_failed_link_retrain()
>>
>> while doing so, it hits the case of DLLLA clear and LBMS set and brings down
>> the speeds.
>
> So I guess LBMS currently remains set after a device has been removed,
> so the slot is empty, and later when a device is hot-added, *that*
> device sees a lower-than expected link speed?
Yes, when PDSC is fired after the slot is empty (i.e when DLLLA is clear).
Thanks
Smita
>
>> The issue of PDSC and DLLSC never occurring simultaneously was a known thing
>> from before and it wasn't breaking anything functionally as the kernel would
>> just exit with the message: "No link" at pciehp_check_link_status().
>>
>> However, Commit a89c82249c37 ("PCI: Work around PCIe link training
>> failures") introduced link speed downgrades to re-establish links if LBMS is
>> set, and DLLLA is clear. This caused the devices to operate at 2.5GT/s after
>> they were plugged-in which were previously operating at higher speeds before
>> hot-remove.
>>
>>>> According to PCIe r6.2 sec 7.5.3.8 [1], it is derived that, LBMS cannot
>>>> be set for an unconnected link and if set, it serves the purpose of
>>>> indicating that there is actually a device down an inactive link.
>>>
>>> I see that r6.2 added an implementation note about DLLSC, but I'm not
>>> a hardware person and can't follow the implication about a device
>>> present down an inactive link.
>>>
>>> I guess it must be related to the fact that LBMS indicates either
>>> completion of link retraining or a change in link speed or width
>>> (which would imply presence of a downstream device). But in both
>>> cases I assume the link would be active.
>>>
>>> But IIUC LBMS is set by hardware but never cleared by hardware, so if
>>> we remove a device and power off the slot, it doesn't seem like LBMS
>>> could be telling us anything useful (what could we do in response to
>>> LBMS when the slot is empty?), so it makes sense to me to clear it.
>>>
>>> It seems like pciehp_unconfigure_device() does sort of PCI core and
>>> driver-related things and possibly could be something shared by all
>>> hotplug drivers, while remove_board() does things more specific to the
>>> hotplug model (pciehp, shpchp, etc).
>>>
>>> From that perspective, clearing LBMS might fit better in
>>> remove_board(). In that case, I wonder whether it should be done
>>> after turning off slot power? This patch clears is *before* turning
>>> off the power, so I wonder if hardware could possibly set it again
>>> before the poweroff?
>>
>> Yeah by talking to HW people I realized that HW could interfere possibly
>> anytime to set LBMS when the slot power is on. Will change it to include in
>> remove_board().
>>
>>>> However, hardware could have already set LBMS when the device was
>>>> connected to the port i.e when the state was DL_Up or DL_Active. Some
>>>> hardwares would have even attempted retrain going into recovery mode,
>>>> just before transitioning to DL_Down.
>>>>
>>>> Thus the set LBMS is never cleared and might force software to cause link
>>>> speed drops when there is no link [2].
>>>>
>>>> Dmesg before:
>>>> pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
>>>> pcieport 0000:20:01.1: pciehp: Slot(59): Card present
>>>> pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
>>>> pcieport 0000:20:01.1: retraining failed
>>>> pcieport 0000:20:01.1: pciehp: Slot(59): No link
>>>>
>>>> Dmesg after:
>>>> pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
>>>> pcieport 0000:20:01.1: pciehp: Slot(59): Card present
>>>> pcieport 0000:20:01.1: pciehp: Slot(59): No link
>>>
>>> I'm a little confused about the problem being solved here. Obviously
>>> the message is extraneous. I guess the slot is empty, so retraining
>>> is meaningless and will always fail. Maybe avoiding it avoids a
>>> delay? Is the benefit that we get rid of the message and a delay?
>>>
>>>> [1] PCI Express Base Specification Revision 6.2, Jan 25 2024.
>>>> https://members.pcisig.com/wg/PCI-SIG/document/20590
>>>> [2] Commit a89c82249c37 ("PCI: Work around PCIe link training failures")
>>>>
>>>> Fixes: a89c82249c37 ("PCI: Work around PCIe link training failures")
>>>
>>> Lukas asked about this; did you confirm that it is related? Asking
>>> because the Fixes tag may cause this to be backported along with
>>> a89c82249c37.
>>
>> Yeah, without this patch we won't see link speed drops.
>>
>> Thanks,
>> Smita
>>
>>>
>>>> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@....com>
>>>> ---
>>>> Link to v1:
>>>> https://lore.kernel.org/all/20240424033339.250385-1-Smita.KoralahalliChannabasappa@amd.com/
>>>>
>>>> v2:
>>>> Cleared LBMS unconditionally. (Ilpo)
>>>> Added Fixes Tag. (Lukas)
>>>> ---
>>>> drivers/pci/hotplug/pciehp_pci.c | 3 +++
>>>> 1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
>>>> index ad12515a4a12..dae73a8932ef 100644
>>>> --- a/drivers/pci/hotplug/pciehp_pci.c
>>>> +++ b/drivers/pci/hotplug/pciehp_pci.c
>>>> @@ -134,4 +134,7 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
>>>> }
>>>> pci_unlock_rescan_remove();
>>>> +
>>>> + pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
>>>> + PCI_EXP_LNKSTA_LBMS);
>>>> }
>>>> --
>>>> 2.17.1
>>>>
Powered by blists - more mailing lists