lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aHCPTU03s-SkAsPs@wunner.de>
Date: Fri, 11 Jul 2025 06:13:01 +0200
From: Lukas Wunner <lukas@...ner.de>
To: Hongbo Yao <andy.xu@...micro.com>
Cc: Sathyanarayanan Kuppuswamy <sathyanarayanan.kuppuswamy@...ux.intel.com>,
	bhelgaas@...gle.com, mahesh@...ux.ibm.com, oohall@...il.com,
	linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org,
	jemma.zhang@...micro.com, peter.du@...micro.com
Subject: Re: [PATCH] PCI/DPC: Extend DPC recovery timeout

On Fri, Jul 11, 2025 at 11:20:15AM +0800, Hongbo Yao wrote:
> 2025/7/8 1:04, Sathyanarayanan Kuppuswamy:
> > On 7/7/25 3:30 AM, Andy Xu wrote:
> > > Setting timeout to 7s covers both devices with safety margin.
> > 
> > Instead of updating the recovery time, can you check why your device
> > recovery takes
> > such a long time and how to fix it from the device end?
> 
> I fully agree that ideally the root cause should be addressed on the
> device side to reduce the DPC recovery latency, and that waiting longer
> in the kernel is not a perfect solution.
> 
> However, the current 4 seconds timeout in pci_dpc_recovered() is indeed
> an empirical value rather than a hard requirement from the PCIe
> specification. In real-world scenarios, like with Mellanox ConnectX-5/7
> adapters, we've observed that full DPC recovery can take more than 5-6
> seconds, which leads to premature hotplug processing and device removal.

I think Sathya's point was:  Have you made an effort to talk to the
vendor and ask them to root-cause and fix the issue e.g. with a firmware
update.

> To improve robustness and maintain flexibility, I???m considering
> introducing a module parameter to allow tuning the DPC recovery timeout
> dynamically. Would you like me to prepare and submit such a patch for
> review?

We try to avoid adding new module parameters.  Things should just work
out of the box without the user having to adjust the kernel command
line for their system.

So the solution is indeed to either adjust the delay for everyone
(as you've done) or introduce an unsigned int to struct pci_dev
which can be assigned the delay after reset for the device to be
responsive.

For comparison, we're allowing up to 60 sec for devices to become
available after a Fundamental Reset or Conventional Reset
(PCIE_RESET_READY_POLL_MS).  That's how long we're waiting in
dpc_reset_link() -> pci_bridge_wait_for_secondary_bus() and
we're not consistent with that when we wait only 4 sec in
pci_dpc_recovered().

I think the reason is that we weren't really sure whether this approach
to synchronize hotplug with DPC works well and how to choose delays.
But we've had this for a few years now and it seems to have worked nicely
for people.  I think this is the first report where it's not been
working out of the box.

Thanks,

Lukas

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ