lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-Id: <20260129-vmcoreinfo_sysfs-v1-1-164c1fe1fe07@debian.org>
Date: Thu, 29 Jan 2026 05:34:10 -0800
From: Breno Leitao <leitao@...ian.org>
To: akpm@...ux-foundation.org, bhe@...hat.com
Cc: linux-kernel@...r.kernel.org, kernel-team@...a.com, 
 Breno Leitao <leitao@...ian.org>, kexec@...ts.infradead.org, 
 dyoung@...hat.com, tony.luck@...el.com, xueshuai@...ux.alibaba.com, 
 vgoyal@...hat.com, zhiquan1.li@...el.com, olja@...a.com
Subject: [PATCH] vmcore_info: expose hardware error recovery statistics via
 sysfs

Add a sysfs file at /sys/kernel/vmcore_stats and expose hardware error
recovery statistics that are already tracked by the kernel. This allows
userspace monitoring tools to track recovered hardware errors without
requiring kernel crashes.

This is useful to track recoverable hardware errors in a time series,
even if the host doesn't crash.

Create a generic vmcore_stats sysfs, and add a section for
hwerr_recovery that shows the counts per subsystem and timestamps:

  - cpu: CPU-related errors (MCE, ARM processor errors)
  - memory: Memory-related errors
  - pci: PCI/PCIe AER non-fatal errors
  - cxl: CXL errors
  - other: Other hardware errors

Example output:
  hwerr_recovery:
    cpu: 0 (0)
    memory: 2 (1738148257)
    pci: 1 (1738147000)
    cxl: 0 (0)
    other: 0 (0)

The value in parentheses is the timestamp (seconds since epoch) of the
last error of that type, or 0 if no errors have occurred.

These statistics provide visibility into the health of the system's
hardware and can be used by system administrators to proactively detect
failing components before they cause system crashes.

Signed-off-by: Breno Leitao <leitao@...ian.org>
---
To: akpm@...ux-foundation.org
Cc: kexec@...ts.infradead.org
To: bhe@...hat.com
Cc: linux-kernel@...r.kernel.org
Cc: dyoung@...hat.com
Cc: tony.luck@...el.com
Cc: xueshuai@...ux.alibaba.com
Cc: vgoyal@...hat.com
Cc: zhiquan1.li@...el.com
Cc: olja@...a.com
---
 .../ABI/testing/sysfs-kernel-vmcore_stats          | 23 ++++++++++++++++
 kernel/vmcore_info.c                               | 31 ++++++++++++++++++++++
 2 files changed, 54 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-kernel-vmcore_stats b/Documentation/ABI/testing/sysfs-kernel-vmcore_stats
new file mode 100644
index 0000000000000..b42f18d24c00b
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-vmcore_stats
@@ -0,0 +1,23 @@
+What:		/sys/kernel/vmcore_stats
+Date:		January 2026
+KernelVersion:	6.20
+Contact:	Breno Leitao <leitao@...ian.org>
+Description:
+		Shows statistics related to vmcore functionality. Currently
+		includes hardware error recovery statistics.
+
+		Format:
+		  Recovered hardware errors:
+		    metric: count (timestamp)
+
+		Statistics about recoverable hardware errors that the kernel
+		has handled since boot. Each metric shows the count and
+		timestamp (seconds since epoch) of the last error in
+		parentheses (0 if no errors have occurred).
+
+		Metrics:
+		    - cpu: CPU-related errors (MCE, ARM processor errors)
+		    - memory: Memory-related errors
+		    - pci: PCI/PCIe AER non-fatal errors
+		    - cxl: CXL (Compute Express Link) errors
+		    - other: Other hardware errors
diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
index fe9bf8db1922e..5974b4be08cbc 100644
--- a/kernel/vmcore_info.c
+++ b/kernel/vmcore_info.c
@@ -6,6 +6,8 @@
 
 #include <linux/buildid.h>
 #include <linux/init.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
 #include <linux/utsname.h>
 #include <linux/vmalloc.h>
 #include <linux/sizes.h>
@@ -135,6 +137,31 @@ void hwerr_log_error_type(enum hwerr_error_type src)
 }
 EXPORT_SYMBOL_GPL(hwerr_log_error_type);
 
+/* sysfs interface for hardware error recovery statistics */
+static ssize_t vmcore_stats_show(struct kobject *kobj,
+				 struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf,
+			  "Recovered hardware errors:\n"
+			  "  cpu: %d (%lld)\n"
+			  "  memory: %d (%lld)\n"
+			  "  pci: %d (%lld)\n"
+			  "  cxl: %d (%lld)\n"
+			  "  other: %d (%lld)\n",
+			  atomic_read(&hwerr_data[HWERR_RECOV_CPU].count),
+			  (long long)READ_ONCE(hwerr_data[HWERR_RECOV_CPU].timestamp),
+			  atomic_read(&hwerr_data[HWERR_RECOV_MEMORY].count),
+			  (long long)READ_ONCE(hwerr_data[HWERR_RECOV_MEMORY].timestamp),
+			  atomic_read(&hwerr_data[HWERR_RECOV_PCI].count),
+			  (long long)READ_ONCE(hwerr_data[HWERR_RECOV_PCI].timestamp),
+			  atomic_read(&hwerr_data[HWERR_RECOV_CXL].count),
+			  (long long)READ_ONCE(hwerr_data[HWERR_RECOV_CXL].timestamp),
+			  atomic_read(&hwerr_data[HWERR_RECOV_OTHERS].count),
+			  (long long)READ_ONCE(hwerr_data[HWERR_RECOV_OTHERS].timestamp));
+}
+
+static struct kobj_attribute vmcore_stats_attr = __ATTR_RO(vmcore_stats);
+
 static int __init crash_save_vmcoreinfo_init(void)
 {
 	vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL);
@@ -244,6 +271,10 @@ static int __init crash_save_vmcoreinfo_init(void)
 	arch_crash_save_vmcoreinfo();
 	update_vmcoreinfo_note();
 
+	/* Create /sys/kernel/vmcore_stats */
+	if (sysfs_create_file(kernel_kobj, &vmcore_stats_attr.attr))
+		pr_warn("Failed to create vmcore_stats sysfs file\n");
+
 	return 0;
 }
 

---
base-commit: 8dfce8991b95d8625d0a1d2896e42f93b9d7f68d
change-id: 20260129-vmcoreinfo_sysfs-ff4687979cd5

Best regards,
--  
Breno Leitao <leitao@...ian.org>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ