lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241112115852.52980-1-xueshuai@linux.alibaba.com>
Date: Tue, 12 Nov 2024 19:58:50 +0800
From: Shuai Xue <xueshuai@...ux.alibaba.com>
To: lukas@...ner.de,
	linux-pci@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	linux-edac@...r.kernel.org
Cc: bhelgaas@...gle.com,
	tony.luck@...el.com,
	bp@...en8.de,
	xueshuai@...ux.alibaba.com
Subject: [PATCH v2 0/2] PCI: hotplug: Add a generic RAS tracepoint for hotplug event

Hotplug events are critical indicators for analyzing hardware health,
particularly in AI supercomputers where surprise link downs can
significantly impact system performance and reliability. The failure
characterization analysis illustrates the significance of failures
caused by the Infiniband link errors. Meta observes that 2% in a machine
learning cluster and 6% in a vision application cluster of Infiniband
failures co-occur with GPU failures, such as falling off the bus, which
may indicate a correlation with PCIe.[1]

PATCH 1/2: Add a generic RAS tracepoint for hotplug event to help healthy check. 
PATCH 2/2: Generate tracepoints for PCIe hotplug event

The output like below:

$ echo 1 > /sys/kernel/debug/tracing/events/hotplug/pci_hp_event/enable
$ cat /sys/kernel/debug/tracing/trace_pipe
           <...>-206     [001] .....    40.373870: pci_hp_event: 0000:00:02.0 slot:10, trans_state:Link Down

           <...>-206     [001] .....    40.374871: pci_hp_event: 0000:00:02.0 slot:10, trans_state:Card not present

[1]https://arxiv.org/abs/2410.21680


Shuai Xue (2):
  PCI: hotplug: Add a generic RAS tracepoint for hotplug event
  pci: pciehp: Generate tracepoints for hotplug event

 drivers/pci/hotplug/pciehp_ctrl.c | 33 ++++++++++++---
 drivers/pci/hotplug/trace.h       | 68 +++++++++++++++++++++++++++++++
 include/uapi/linux/pci.h          |  7 ++++
 3 files changed, 102 insertions(+), 6 deletions(-)
 create mode 100644 drivers/pci/hotplug/trace.h

-- 
2.39.3


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ