linux-kernel - Re: [External] : Re: [PATCH 2/2] PCI: Fix the PCIe bridge decreasing to Gen 1 during hotplug testing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.21.2511290112150.36486@angie.orcam.me.uk>
Date: Mon, 1 Dec 2025 03:54:57 +0000 (GMT)
From: "Maciej W. Rozycki" <macro@...am.me.uk>
To: ALOK TIWARI <alok.a.tiwari@...cle.com>
cc: Lukas Wunner <lukas@...ner.de>, 
    Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>, 
    Bjorn Helgaas <bhelgaas@...gle.com>, Jiwei <jiwei.sun.bj@...com>, 
    linux-pci@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>, 
    guojinhui.liam@...edance.com, Bjorn Helgaas <helgaas@...nel.org>, 
    ahuang12@...ovo.com, sunjw10@...ovo.com
Subject: Re: [External] : Re: [PATCH 2/2] PCI: Fix the PCIe bridge decreasing
 to Gen 1 during hotplug testing

On Wed, 26 Nov 2025, ALOK TIWARI wrote:

> We are testing hot-add/hot-remove behavior and observed the same issue as,
> mentioned where the PCIe bridge link speed drops from 32 GT/s to 2.5 GT/s.
> 
> My understanding is that pcie_failed_link_retrain should only apply to devices
> matched by PCI_VDEVICE(ASMEDIA, 0x2824),
> but the current implementation appears to affect all devices that take longer
> to establish a link.

 Thank you for your report.

 No, there seems nothing wrong with said device by itself and the problem 
is either with the downstream device (which obviously cannot be discovered 
until a link has been actually established), or the particular device pair 
or setup.  I've originally implemented matching for this particular device 
out of the abundance of caution, in case the removal of speed restriction 
for other upstream devices (in case the quirk triggered there) would cause 
the link to go back into the infinite retraining loop.

> We are unsure if this is intentional, but it effectively allows such
> devices to continue operating at a reduced speed.

 It was intentional, but didn't take into account noisy hot-plug scenarios 
which are not a part of my lab setup.

> If we extend PCIE_LINK_RETRAIN_TIMEOUT_MS to 3000 ms, these slower devices are
> able to complete link training,
> and the problem is no longer observed in our testing. Therefore, increasing
> PCIE_LINK_RETRAIN_TIMEOUT_MS to 3000 ms seems to resolve the issue for us.
> 
> Would it be acceptable to increase PCIE_LINK_RETRAIN_TIMEOUT_MS, from 1000 to
> 3000 ms in this case?

 FWIW my understanding is this goes beyond the spec actually.

 However given other reports I've given more thought to my idea previously 
shared, which has sadly received no feedback to motivate me further, and 
implemented yet more simplified an approach, where the 2.5GT/s speed clamp 
is always removed regardless of the link state and if that fails, then any 
original clamp as at the entry to the quirk is restored.  This I hope will 
prove robust enough not to cause further issues with hot-plug scenarios.

 Please give it a try and let me know if it's fixed your issue:

<https://lore.kernel.org/r/alpine.DEB.2.21.2511290245460.36486@angie.orcam.me.uk/>

  Maciej