lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250107231943.GA189869@bhelgaas>
Date: Tue, 7 Jan 2025 17:19:43 -0600
From: Bjorn Helgaas <helgaas@...nel.org>
To: Shuai Xue <xueshuai@...ux.alibaba.com>
Cc: rostedt@...dmis.org, lukas@...ner.de, linux-pci@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-edac@...r.kernel.org,
	linux-trace-kernel@...r.kernel.org, bhelgaas@...gle.com,
	tony.luck@...el.com, bp@...en8.de, mhiramat@...nel.org,
	mathieu.desnoyers@...icios.com, oleg@...hat.com, naveen@...nel.org,
	davem@...emloft.net, anil.s.keshavamurthy@...el.com,
	mark.rutland@....com, peterz@...radead.org
Subject: Re: [PATCH v4] PCI: hotplug: Add a generic RAS tracepoint for
 hotplug event

On Sat, Nov 23, 2024 at 07:31:08PM +0800, Shuai Xue wrote:
> Hotplug events are critical indicators for analyzing hardware health,
> particularly in AI supercomputers where surprise link downs can
> significantly impact system performance and reliability. The failure
> characterization analysis illustrates the significance of failures
> caused by the Infiniband link errors. Meta observes that 2% in a machine
> learning cluster and 6% in a vision application cluster of Infiniband
> failures co-occur with GPU failures, such as falling off the bus, which
> may indicate a correlation with PCIe.[1]
> 
> To this end, define a new TRACING_SYSTEM named pci, add a generic RAS
> tracepoint for hotplug event to help healthy check, and generate
> tracepoints for pcie hotplug event. To monitor these tracepoints in
> userspace, e.g. with rasdaemon, put `enum pci_hotplug_event` in uapi
> header.
> 
> The output like below:
> $ echo 1 > /sys/kernel/debug/tracing/events/pci/pci_hp_event/enable
> $ cat /sys/kernel/debug/tracing/trace_pipe
>            <...>-206     [001] .....    40.373870: pci_hp_event: 0000:00:02.0 slot:10, event:Link Down
> 
>            <...>-206     [001] .....    40.374871: pci_hp_event: 0000:00:02.0 slot:10, event:Card not present
> 
> [1]https://arxiv.org/abs/2410.21680

Doesn't apply on pci/main (v6.13-rc1); can you rebase it?

s/pcie/PCIe/ in English text.

Probably more detail than necessary about AI supercomputers,
Infiniband, vision applications, etc.  This is a very generic issue.

"Falling off the bus" doesn't really mean anything to me.  I suppose
it's another way to describe a "link down" event that leads to UR
errors when trying to access the device?

I'm guessing that monitoring these via rasdaemon requires more than
just adding "enum pci_hotplug_event"?  Or does rasdaemon read
include/uapi/linux/pci.h and automagically incorporate new events?
Maybe there's at least a rebuild involved?

Anything in the arxiv link that is specifically relevant to this patch
needs to be in the commit log itself.  But I think there's already
enough information here to motivate this change, and whatever is in
the arxiv link may be of general interest, but is probably not
required to justify, understand, or debug this useful functionality.

Bjorn

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ