[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d4e5b6d8-c69f-4fbc-8da6-bc2c2fb2a550@oracle.com>
Date: Wed, 26 Nov 2025 00:53:18 +0530
From: ALOK TIWARI <alok.a.tiwari@...cle.com>
To: Lukas Wunner <lukas@...ner.de>,
Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>,
bhelgaas@...gle.com
Cc: Jiwei <jiwei.sun.bj@...com>, macro@...am.me.uk, linux-pci@...r.kernel.org,
LKML <linux-kernel@...r.kernel.org>, guojinhui.liam@...edance.com,
helgaas@...nel.org, ahuang12@...ovo.com, sunjw10@...ovo.com
Subject: Re: [External] : Re: [PATCH 2/2] PCI: Fix the PCIe bridge decreasing
to Gen 1 during hotplug testing
Hi,
On 1/15/2025 3:48 PM, Lukas Wunner wrote:
> On Tue, Jan 14, 2025 at 08:25:04PM +0200, Ilpo Järvinen wrote:
>> On Tue, 14 Jan 2025, Jiwei wrote:
>>> [ 539.362400] ==== pcie_bwnotif_irq 269(stop running),link_status:0x7841
>>> [ 539.395720] ==== pcie_bwnotif_irq 247(start running),link_status:0x1041
>>
>> DLLLA=0
>>
>> But LBMS did not get reset.
>>
>> So is this perhaps because hotplug cannot keep up with the rapid
>> remove/add going on, and thus will not always call the remove_board()
>> even if the device went away?
>>
>> Lukas, do you know if there's a good way to resolve this within hotplug
>> side?
>
> I believe the pciehp code is fine and suspect this is an issue
> in the quirk. We've been dealing with rapid add/remove in pciehp
> for years without issues.
>
> I don't understand the quirk sufficiently to make a guess
> what's going wrong, but I'm wondering if there could be
> a race accessing the lbms_count?
>
> Maybe if lbms_count is replaced by a flag in pci_dev->priv_flags
> as we've discussed, with proper memory barriers where necessary,
> this problem will solve itself?
>
> Thanks,
>
> Lukas
>
We are testing hot-add/hot-remove behavior and observed the same issue
as, mentioned where the PCIe bridge link speed drops from 32 GT/s to 2.5
GT/s.
My understanding is that pcie_failed_link_retrain should only apply to
devices matched by PCI_VDEVICE(ASMEDIA, 0x2824),
but the current implementation appears to affect all devices that take
longer to establish a link.
We are unsure if this is intentional, but it effectively allows such
devices to continue operating at a reduced speed.
If we extend PCIE_LINK_RETRAIN_TIMEOUT_MS to 3000 ms, these slower
devices are able to complete link training,
and the problem is no longer observed in our testing. Therefore,
increasing PCIE_LINK_RETRAIN_TIMEOUT_MS to 3000 ms seems to resolve the
issue for us.
Would it be acceptable to increase PCIE_LINK_RETRAIN_TIMEOUT_MS, from
1000 to 3000 ms in this case?
Thanks,
Alok
Powered by blists - more mailing lists