[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <16bf61ae-fc95-4c44-a2b4-81f467537349@linux.alibaba.com>
Date: Thu, 14 Nov 2024 09:39:52 +0800
From: Shuai Xue <xueshuai@...ux.alibaba.com>
To: Lukas Wunner <lukas@...ner.de>
Cc: linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-edac@...r.kernel.org, bhelgaas@...gle.com, tony.luck@...el.com,
bp@...en8.de
Subject: Re: [PATCH v2 1/2] PCI: hotplug: Add a generic RAS tracepoint for
hotplug event
在 2024/11/14 00:27, Lukas Wunner 写道:
> On Tue, Nov 12, 2024 at 07:58:51PM +0800, Shuai Xue wrote:
>> Hotplug events are critical indicators for analyzing hardware health,
>> particularly in AI supercomputers where surprise link downs can
>> significantly impact system performance and reliability. The failure
>> characterization analysis illustrates the significance of failures
>> caused by the Infiniband link errors. Meta observes that 2% in a machine
>> learning cluster and 6% in a vision application cluster of Infiniband
>> failures co-occur with GPU failures, such as falling off the bus, which
>> may indicate a correlation with PCIe.[1]
>>
>> Add a generic RAS tracepoint for hotplug event to help healthy check.
>
> It would be good if you could mention in the commit message that
> you intend to use rasdaemon for monitoring these tracepoints
> and that you're hence adding the enum to a uapi header.
> So that if someone wonders later on why you chose a uapi header,
> they will find breadcrumbs in the commit message.
Got it, will add it.
>
> I think both patches can be squashed into one.
Will do it.
>
> I'm not an expert on tracepoints, so when respinning, perhaps you
> could cc linux-trace-kernel@...r.kernel.org and maybe also tracing
> maintainers so that subject matter experts can look the patch over
> and ack it.
Sorry, I forget it. Will add the maillist.
>
>
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM hotplug
>
> Maybe "pci_hotplug" to differentiate this from, say, cpu hotplug?
Yes, will fix it.
>
>
>> +#define PCI_HP_TRANS_STATE \
>> + EM(PCI_HOTPLUG_LINK_UP, "Link Up") \
>> + EM(PCI_HOTPLUG_LINK_DOWN, "Link Down") \
>> + EM(PCI_HOTPLUG_CARD_PRESENT, "Card present") \
>> + EMe(PCI_HOTPLUG_CARD_NO_PRESENT, "Card not present")
>
> PCI_HOTPLUG_CARD_NOT_PRESENT would be neater I think.
Ok, will rename it.
> ^
>
>> +PCI_HP_TRANS_STATE
>
> Not sure what "trans state" stands for, maybe "state transition"?
> What if we add tracepoints going forward which aren't for state
> transitions but other types of events, such as "Power Failure"?
> Perhaps something more generic such as "PCI_HOTPLUG_EVENT" would
> be more apt?
>
I see, will rename as PCI_HOTPLUG_EVENT.
>
>> +enum pci_hotplug_trans_type {
>> + PCI_HOTPLUG_LINK_UP,
>> + PCI_HOTPLUG_LINK_DOWN,
>> + PCI_HOTPLUG_CARD_PRESENT,
>> + PCI_HOTPLUG_CARD_NO_PRESENT,
>> +};
>
> I note that this is called "pci_hotplug_trans_type", perhaps for
> consistency use "pci_hotplug_trans_state" as everywhere else
> (or whatever you choose to substitute the name for, see above).
>
> Other than these cosmetic things, the patches LGTM.
> Again, my knowledge on tracepoints is superficial, but broadly
> it looks reasonable.
>
> Thanks,
>
> Lukas
Thank you for valuable comments.
Best Regards,
Shuai
Powered by blists - more mailing lists