[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250717172826.22120-1-mattc@purestorage.com>
Date: Thu, 17 Jul 2025 11:28:26 -0600
From: Matthew W Carlis <mattc@...estorage.com>
To: xueshuai@...ux.alibaba.com
Cc: anil.s.keshavamurthy@...el.com,
bhelgaas@...gle.com,
bp@...en8.de,
davem@...emloft.net,
helgaas@...nel.org,
linux-edac@...r.kernel.org,
linux-kernel@...r.kernel.org,
linux-pci@...r.kernel.org,
linux-trace-kernel@...r.kernel.org,
lukas@...ner.de,
mark.rutland@....com,
mathieu.desnoyers@...icios.com,
mhiramat@...nel.org,
naveen@...nel.org,
oleg@...hat.com,
peterz@...radead.org,
rostedt@...dmis.org,
tianruidong@...ux.alibaba.com,
tony.luck@...el.com
Subject: Re: [PATCH v8] PCI: hotplug: Add a generic RAS tracepoint for hotplug event
A bit late to the discussion here.. Looks like "too late" in fact, but I
wanted to just make some comments.
On Tue, 12 May 2025, Shuai Xue wrote:
> Hotplug events are critical indicators for analyzing hardware health,
In terms of a "hot plug" event I'm not actually sure what that means. I
mean to say that the spec has some room for different implementations.
I think sometimes that means a presence detect state change event, but a
system is not required to implement a presence pin (at least not for the
Slot Status presence). Some vendors support an "inband" presence which
is when the LTSSM essentially asserts presence if the link is active
and deasserts it when the link is down.
Appendix I in the newer PCIe specs say to use data link layer state change
event if presence is not implemented. It looks like this tracepoint would still
work, but its just something to keep in mind. At the risk of including too
much information I could see it also being useful to put the device/vendor IDs
of the DSP and the EP into the trace event for link up. Perhaps even the link
speed/width cap for DSP/EP. The real challenge with tracking a fleet is getting
all the things you care about into one place.
On Tue, 20 May 2025, Lukas Wunner wrote:
> Link speed changes and device plug/unplug events are orthogonal
I guess what I wanted to get at here were some of the discussion from Lukas &
Ilpo. I think it makes sense to separate presence events from link events, but
I think it would make sense to have a "link tracepoint" which reports previous
and new speed. One of those speeds being DOWN/DISABLED etc. Width could be in
there as well. I have seen many times now an engineer become confused about
checking speed because "Current Link Speed" & "Negotiated Link Width" are
"undefined" when "Data Link Layer Active" bit is unset. Ideally a solution
here would be immediately clear to the user.
When it comes to tracking things across a "fleet" having the slot number of
the device is extremely useful. We have an internal specification for our
slot number assignments that allows us to track meaning across different
generations of hardware or different architectures. The BDF is often changing
between generations, but the meaning of the slot is not.
Cheers!
- Matt
Powered by blists - more mailing lists