lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250717172826.22120-1-mattc@purestorage.com>
Date: Thu, 17 Jul 2025 11:28:26 -0600
From: Matthew W Carlis <mattc@...estorage.com>
To: xueshuai@...ux.alibaba.com
Cc: anil.s.keshavamurthy@...el.com,
	bhelgaas@...gle.com,
	bp@...en8.de,
	davem@...emloft.net,
	helgaas@...nel.org,
	linux-edac@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	linux-pci@...r.kernel.org,
	linux-trace-kernel@...r.kernel.org,
	lukas@...ner.de,
	mark.rutland@....com,
	mathieu.desnoyers@...icios.com,
	mhiramat@...nel.org,
	naveen@...nel.org,
	oleg@...hat.com,
	peterz@...radead.org,
	rostedt@...dmis.org,
	tianruidong@...ux.alibaba.com,
	tony.luck@...el.com
Subject: Re: [PATCH v8] PCI: hotplug: Add a generic RAS tracepoint for hotplug event

A bit late to the discussion here..  Looks like "too late" in fact, but I
wanted to just make some comments.

On Tue, 12 May 2025, Shuai Xue wrote:
> Hotplug events are critical indicators for analyzing hardware health,

In terms of a "hot plug" event I'm not actually sure what that means. I
mean to say that the spec has some room for different implementations.
I think sometimes that means a presence detect state change event, but a
system is not required to implement a presence pin (at least not for the
Slot Status presence). Some vendors support an "inband" presence which
is when the LTSSM essentially asserts presence if the link is active
and deasserts it when the link is down.

Appendix I in the newer PCIe specs say to use data link layer state change
event if presence is not implemented. It looks like this tracepoint would still
work, but its just something to keep in mind. At the risk of including too
much information I could see it also being useful to put the device/vendor IDs
of the DSP and the EP into the trace event for link up. Perhaps even the link
speed/width cap for DSP/EP. The real challenge with tracking a fleet is getting
all the things you care about into one place.

On Tue, 20 May 2025, Lukas Wunner wrote:
> Link speed changes and device plug/unplug events are orthogonal

I guess what I wanted to get at here were some of the discussion from Lukas &
Ilpo. I think it makes sense to separate presence events from link events, but
I think it would make sense to have a "link tracepoint" which reports previous
and new speed. One of those speeds being DOWN/DISABLED etc. Width could be in
there as well. I have seen many times now an engineer become confused about
checking speed because "Current Link Speed" & "Negotiated Link Width" are
"undefined" when "Data Link Layer Active" bit is unset. Ideally a solution
here would be immediately clear to the user.

When it comes to tracking things across a "fleet" having the slot number of
the device is extremely useful. We have an internal specification for our
slot number assignments that allows us to track meaning across different
generations of hardware or different architectures. The BDF is often changing
between generations, but the meaning of the slot is not.

Cheers!
- Matt

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ