[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a036291f-2483-4bf0-9ac1-db310f8115aa@linux.alibaba.com>
Date: Thu, 22 May 2025 17:41:31 +0800
From: Shuai Xue <xueshuai@...ux.alibaba.com>
To: Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>
Cc: rostedt@...dmis.org, Lukas Wunner <lukas@...ner.de>,
linux-pci@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>,
linux-edac@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
helgaas@...nel.org, bhelgaas@...gle.com, tony.luck@...el.com, bp@...en8.de,
mhiramat@...nel.org, mathieu.desnoyers@...icios.com, oleg@...hat.com,
naveen@...nel.org, davem@...emloft.net, anil.s.keshavamurthy@...el.com,
mark.rutland@....com, peterz@...radead.org, tianruidong@...ux.alibaba.com
Subject: Re: [PATCH v8] PCI: hotplug: Add a generic RAS tracepoint for hotplug
event
在 2025/5/20 18:07, Ilpo Järvinen 写道:
> On Tue, 20 May 2025, Shuai Xue wrote:
>
>> Hi, Ilpo,
>>
>> 在 2025/5/20 01:10, Ilpo Järvinen 写道:
>>> On Mon, 12 May 2025, Shuai Xue wrote:
>>>
>>>> Hotplug events are critical indicators for analyzing hardware health,
>>>> particularly in AI supercomputers where surprise link downs can
>>>> significantly impact system performance and reliability.
>>>>
>>>> To this end, define a new TRACING_SYSTEM named pci, add a generic RAS
>>>> tracepoint for hotplug event to help healthy check, and generate
>>>> tracepoints for pcie hotplug event. Add enum pci_hotplug_event in
>>>> include/uapi/linux/pci.h so applications like rasdaemon can register
>>>> tracepoint event handlers for it.
>>>>
>>>> The output like below:
>>>>
>>>> $ echo 1 > /sys/kernel/debug/tracing/events/pci/pci_hp_event/enable
>>>> $ cat /sys/kernel/debug/tracing/trace_pipe
>>>> <...>-206 [001] ..... 40.373870: pci_hp_event: 0000:00:02.0
>>>> slot:10, event:Link Down
>>>>
>>>> <...>-206 [001] ..... 40.374871: pci_hp_event: 0000:00:02.0
>>>> slot:10, event:Card not present
>>>>
>>>> Suggested-by: Lukas Wunner <lukas@...ner.de>
>>>> Suggested-by: Steven Rostedt <rostedt@...dmis.org>
>>>> Signed-off-by: Shuai Xue <xueshuai@...ux.alibaba.com>
>>>> Reviewed-by: Lukas Wunner <lukas@...ner.de>
>>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@...wei.com>
>>>> ---
>>>> changes since v7:
>>>> - replace the TRACE_INCLUDE_PATH to avoid macro conflict per Steven
>>>> - pick up Reviewed-by from Lukas Wunner
>>>> ---
>>>> drivers/pci/hotplug/Makefile | 3 ++
>>>> drivers/pci/hotplug/pciehp_ctrl.c | 33 ++++++++++++---
>>>> drivers/pci/hotplug/trace.h | 68 +++++++++++++++++++++++++++++++
>>>> include/uapi/linux/pci.h | 7 ++++
>>>> 4 files changed, 105 insertions(+), 6 deletions(-)
>>>> create mode 100644 drivers/pci/hotplug/trace.h
>>>>
>>>> diff --git a/drivers/pci/hotplug/Makefile b/drivers/pci/hotplug/Makefile
>>>> index 40aaf31fe338..a1a9d1e98962 100644
>>>> --- a/drivers/pci/hotplug/Makefile
>>>> +++ b/drivers/pci/hotplug/Makefile
>>>> @@ -3,6 +3,9 @@
>>>> # Makefile for the Linux kernel pci hotplug controller drivers.
>>>> #
>>>> +# define_trace.h needs to know how to find our header
>>>> +CFLAGS_pciehp_ctrl.o := -I$(src)
>>>> +
>>>> obj-$(CONFIG_HOTPLUG_PCI) += pci_hotplug.o
>>>> obj-$(CONFIG_HOTPLUG_PCI_COMPAQ) += cpqphp.o
>>>> obj-$(CONFIG_HOTPLUG_PCI_IBM) += ibmphp.o
>>>> diff --git a/drivers/pci/hotplug/pciehp_ctrl.c
>>>> b/drivers/pci/hotplug/pciehp_ctrl.c
>>>> index d603a7aa7483..f9beb4d3a9b8 100644
>>>> --- a/drivers/pci/hotplug/pciehp_ctrl.c
>>>> +++ b/drivers/pci/hotplug/pciehp_ctrl.c
>>>> @@ -23,6 +23,9 @@
>>>> #include "../pci.h"
>>>> #include "pciehp.h"
>>>> +#define CREATE_TRACE_POINTS
>>>> +#include "trace.h"
>>>> +
>>>> /* The following routines constitute the bulk of the
>>>> hotplug controller logic
>>>> */
>>>> @@ -244,12 +247,20 @@ void pciehp_handle_presence_or_link_change(struct
>>>> controller *ctrl, u32 events)
>>>> case ON_STATE:
>>>> ctrl->state = POWEROFF_STATE;
>>>> mutex_unlock(&ctrl->state_lock);
>>>> - if (events & PCI_EXP_SLTSTA_DLLSC)
>>>> + if (events & PCI_EXP_SLTSTA_DLLSC) {
>>>> ctrl_info(ctrl, "Slot(%s): Link Down\n",
>>>> slot_name(ctrl));
>>>> - if (events & PCI_EXP_SLTSTA_PDC)
>>>> + trace_pci_hp_event(pci_name(ctrl->pcie->port),
>>>> + slot_name(ctrl),
>>>> + PCI_HOTPLUG_LINK_DOWN);
>>>> + }
>>>> + if (events & PCI_EXP_SLTSTA_PDC) {
>>>> ctrl_info(ctrl, "Slot(%s): Card not present\n",
>>>> slot_name(ctrl));
>>>> + trace_pci_hp_event(pci_name(ctrl->pcie->port),
>>>> + slot_name(ctrl),
>>>> + PCI_HOTPLUG_CARD_NOT_PRESENT);
>>>> + }
>>>> pciehp_disable_slot(ctrl, SURPRISE_REMOVAL);
>>>> break;
>>>> default:
>>>> @@ -269,6 +280,9 @@ void pciehp_handle_presence_or_link_change(struct
>>>> controller *ctrl, u32 events)
>>>> INDICATOR_NOOP);
>>>> ctrl_info(ctrl, "Slot(%s): Card not present\n",
>>>> slot_name(ctrl));
>>>> + trace_pci_hp_event(pci_name(ctrl->pcie->port),
>>>> + slot_name(ctrl),
>>>> + PCI_HOTPLUG_CARD_NOT_PRESENT);
>>>> }
>>>> mutex_unlock(&ctrl->state_lock);
>>>> return;
>>>> @@ -281,12 +295,19 @@ void pciehp_handle_presence_or_link_change(struct
>>>> controller *ctrl, u32 events)
>>>> case OFF_STATE:
>>>> ctrl->state = POWERON_STATE;
>>>> mutex_unlock(&ctrl->state_lock);
>>>> - if (present)
>>>> + if (present) {
>>>> ctrl_info(ctrl, "Slot(%s): Card present\n",
>>>> slot_name(ctrl));
>>>> - if (link_active)
>>>> - ctrl_info(ctrl, "Slot(%s): Link Up\n",
>>>> - slot_name(ctrl));
>>>> + trace_pci_hp_event(pci_name(ctrl->pcie->port),
>>>> + slot_name(ctrl),
>>>> + PCI_HOTPLUG_CARD_PRESENT);
>>>> + }
>>>> + if (link_active) {
>>>> + ctrl_info(ctrl, "Slot(%s): Link Up\n",
>>>> slot_name(ctrl));
>>>> + trace_pci_hp_event(pci_name(ctrl->pcie->port),
>>>> + slot_name(ctrl),
>>>> + PCI_HOTPLUG_LINK_UP);
>>>> + }
>>>> ctrl->request_result = pciehp_enable_slot(ctrl);
>>>> break;
>>>> default:
>>>> diff --git a/drivers/pci/hotplug/trace.h b/drivers/pci/hotplug/trace.h
>>>> new file mode 100644
>>>> index 000000000000..21329c198019
>>>> --- /dev/null
>>>> +++ b/drivers/pci/hotplug/trace.h
>>>> @@ -0,0 +1,68 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>> +#if !defined(_TRACE_HW_EVENT_PCI_HP_H) ||
>>>> defined(TRACE_HEADER_MULTI_READ)
>>>> +#define _TRACE_HW_EVENT_PCI_HP_H
>>>> +
>>>> +#include <linux/tracepoint.h>
>>>> +
>>>> +#undef TRACE_SYSTEM
>>>> +#define TRACE_SYSTEM pci
>>>> +
>>>> +#define PCI_HOTPLUG_EVENT \
>>>> + EM(PCI_HOTPLUG_LINK_UP, "Link Up") \
>>>> + EM(PCI_HOTPLUG_LINK_DOWN, "Link Down") \
>>>> + EM(PCI_HOTPLUG_CARD_PRESENT, "Card present") \
>>>> + EMe(PCI_HOTPLUG_CARD_NOT_PRESENT, "Card not present")
>>>
>>> Hi,
>>>
>>> While I was thinking of adding tracing into PCIe BW controller (bwctrl),
>>> I ended up thinking that perhaps it would make more sense to have PCIe
>>> Link related tracepoints which would cover both hotplug and bwctrl so that
>>> also Link Speed changes would be reported through the same trace event.
>>>
>>> Downgraded speed may indicate there's something wrong with the card and
>>> the Link Speed does have performance impact too for those who are pushing
>>> BW boundaries such as AI systems.
>>
>> Agreed!
>>
>>>
>>> So my suggestion is:
>>>
>>> - Add "Link Speed changed" to the event types.
>>> - Add Link Speed and Width into the event format (and probably also Flit
>>> mode as PCIe gen6 is coming).
>>
>>
>> How about bellow event format:
>>
>> + TP_STRUCT__entry(
>> + __string( port_name, port_name )
>> + __field( unsigned char, cur_bus_speed )
>> + __field( unsigned char, max_bus_speed )
>
> Add also the Link Width.
Got it.
>
>> + __field( unsigned char, flit_mode )
>> + ),
>>
>> And add the event to pcie_update_link_speed():
>>
>> @@ -796,6 +799,10 @@ void pcie_update_link_speed(struct pci_bus *bus)
>> pcie_capability_read_word(bridge, PCI_EXP_LNKSTA, &linksta);
>> pcie_capability_read_word(bridge, PCI_EXP_LNKSTA2, &linksta2);
>> __pcie_update_link_speed(bus, linksta, linksta2);
>> +
>> + trace_pci_link_event(pci_name(bridge), bus->cur_bus_speed,
>> + bus->max_bus_speed,
>
> I don't think outputting the internal values of enum pci_bus_speed is a
> good idea. Maybe these could be printed as a string (with
> pci_speed_string()) or encoded with trace interface specific values.
I see, a human readable string is better.
>
> Perhaps it would make sense to check if the speed really changed before
> sending that event, but there are good sides in both approaches as I
> know some platforms assert LBMS more than once during a single Link Speed
> change.
>
>> + PCI_HOTPLUG_LINK_SPEED_CHANGED);
>>
>> But I don't find link speed changed in hotplug driver
>
> pciehp_check_link_status() calls __pcie_update_link_speed().
Thanks.
>
>> , and the format of "Link Speed changed" is a bit different from
>> "pci_hp_event".
>
> The difference is only because when the Link is down, there's no Link
> Speed (obviously). Whenever a new device is hotplugged and it comes up,
> there's also Link Speed for it which can be included into the trace event.
>
> I think the trace event should have some special value for the fields that
> are N/A due to Link being off. While it would be possible to create
> separate events for speed changes and hotplug, I don't see any pros in
> that approach over just having the N/A fields marked as such when the Link
> is Down.
>
> Perhaps it would even make sense to add PCIE_SPEED_LINK_DOWN into
> bus->cur_bus_speed when hotplug finds the card is gone (I'm not entirely
> sure how bwctrl or pcie_cooling driver would cope with that though, they
> might need minor tweaking to support it, and there are a few other drivers
> that use that field).
>
>> Do we really need a PCI_HOTPLUG_EVENT? May PCI_LINK_EVENT is more
>> appropriate?
>
> Ah, right, I forgot to mention it would make sense to rename it to
> PCI_LINK_EVENT.
>
Got it.
Thanks.
Shuai
Powered by blists - more mailing lists