lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20240729185631.26746-1-mattc@purestorage.com>
Date: Mon, 29 Jul 2024 12:56:31 -0600
From: Matthew W Carlis <mattc@...estorage.com>
To: macro@...am.me.uk
Cc: alex.williamson@...hat.com,
	bhelgaas@...gle.com,
	christophe.leroy@...roup.eu,
	davem@...emloft.net,
	david.abdurachmanov@...il.com,
	edumazet@...gle.com,
	ilpo.jarvinen@...ux.intel.com,
	kuba@...nel.org,
	leon@...nel.org,
	linux-kernel@...r.kernel.org,
	linux-pci@...r.kernel.org,
	linux-rdma@...r.kernel.org,
	linuxppc-dev@...ts.ozlabs.org,
	lukas@...ner.de,
	mahesh@...ux.ibm.com,
	mattc@...estorage.com,
	mika.westerberg@...ux.intel.com,
	mpe@...erman.id.au,
	netdev@...r.kernel.org,
	npiggin@...il.com,
	oohall@...il.com,
	pabeni@...hat.com,
	pali@...nel.org,
	saeedm@...dia.com,
	sr@...x.de,
	wilson@...iptree.org
Subject: PCI: Work around PCIe link training failures

On Mon, 29 July 2024, Ilpo Järvinen wrote:

> The most obvious solution is to not leave the speed at Gen1 on failure in
> Target Speed quirk but to restore the original Target Speed value. The
> downside with that is if the current retraining interface (function) is
> used, it adds delay.

Tends to be that I care less about how long a device is gone & more about
how it will behave once it reappears. For our purposes we don't even tend
to notice a few seconds of wiggle in this area, but we do notice impact
if the kernel creates the nvme device & it is degraded in some way. Even
though we might have automation to recover the device we will have lost
more time already than by the purposed delay afaik.

Some of the time a human would have hot-insert'ed a new device, but much of
the time perhaps the device will be coming back from downstream port containment
where there won't be a person to ensure the correctness of link speed/width.
In the DPC case perhaps the endpoint itself will have reset/rebooted/crashed
where you already suffer a few hundred ms of delay from EP's boot time.

I would be interested to know what kind of maximum delay we would all be
willing to tolerate & what applications might care.

On Mon, 29 Jul 2024, Maciej W. Rozycki wrote:

> After these many years it took from the inception of this change until it
> landed upstream I'm not sure anymore what my original idea was behind
> leaving the link clamped

A familiar question I have been known to ask myself. - "Why did I do this again?"
The scary/funny thing is that there is almost always a reason.

I do think there might be some benefit to overall system stability to have some
kind of damping on link retraining rate because I have also seen device stuck
in an infinite cycle of many retrains per second, but each time we come through
the hot-insert code path kernel should let the link partners try to get to their
maximum speeds because it could in theory be a totally new EP. In the handful
I have seen there was some kind of defect with a particular device & replacement
resolved it.

- Matt

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ