lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <a4a34583-8f26-bb08-001f-a53715070c00@huawei.com>
Date: Fri, 21 Nov 2025 10:47:32 +0800
From: Hanjun Guo <guohanjun@...wei.com>
To: Breno Leitao <leitao@...ian.org>, "Rafael J. Wysocki" <rafael@...nel.org>,
	Len Brown <lenb@...nel.org>, James Morse <james.morse@....com>, Tony Luck
	<tony.luck@...el.com>, Borislav Petkov <bp@...en8.de>, Robert Moore
	<robert.moore@...el.com>, Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar
	<mingo@...hat.com>, Dave Hansen <dave.hansen@...ux.intel.com>,
	<x86@...nel.org>, "H. Peter Anvin" <hpa@...or.com>, Mauro Carvalho Chehab
	<mchehab@...nel.org>, Mahesh J Salgaonkar <mahesh@...ux.ibm.com>, Oliver
 O'Halloran <oohall@...il.com>, Bjorn Helgaas <bhelgaas@...gle.com>
CC: <linux-acpi@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<acpica-devel@...ts.linux.dev>, <osandov@...ndov.com>,
	<xueshuai@...ux.alibaba.com>, <konrad.wilk@...cle.com>,
	<linux-edac@...r.kernel.org>, <linuxppc-dev@...ts.ozlabs.org>,
	<linux-pci@...r.kernel.org>, <kernel-team@...a.com>
Subject: Re: [PATCH RESEND v5] vmcoreinfo: Track and log recoverable hardware
 errors

On 2025/10/10 18:36, Breno Leitao wrote:
> Introduce a generic infrastructure for tracking recoverable hardware
> errors (HW errors that are visible to the OS but does not cause a panic)
> and record them for vmcore consumption. This aids post-mortem crash
> analysis tools by preserving a count and timestamp for the last
> occurrence of such errors. On the other side, correctable errors, which
> the OS typically remains unaware of because the underlying hardware
> handles them transparently, are less relevant for crash dump
> and therefore are NOT tracked in this infrastructure.
> 
> Add centralized logging for sources of recoverable hardware
> errors based on the subsystem it has been notified.
> 
> hwerror_data is write-only at kernel runtime, and it is meant to be read
> from vmcore using tools like crash/drgn. For example, this is how it
> looks like when opening the crashdump from drgn.
> 
> 	>>> prog['hwerror_data']
> 	(struct hwerror_info[1]){
> 		{
> 			.count = (int)844,
> 			.timestamp = (time64_t)1752852018,
> 		},
> 		...
> 
> This helps fleet operators quickly triage whether a crash may be
> influenced by hardware recoverable errors (which executes a uncommon
> code path in the kernel), especially when recoverable errors occurred
> shortly before a panic, such as the bug fixed by
> commit ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them
> when destroying the pool")
> 
> This is not intended to replace full hardware diagnostics but provides
> a fast way to correlate hardware events with kernel panics quickly.
> 
> Rare machine check exceptions—like those indicated by mce_flags.p5 or
> mce_flags.winchip—are not accounted for in this method, as they fall
> outside the intended usage scope for this feature’s user base.
> 
> Suggested-by: Tony Luck <tony.luck@...el.com>
> Suggested-by: Shuai Xue <xueshuai@...ux.alibaba.com>
> Signed-off-by: Breno Leitao <leitao@...ian.org>
> Reviewed-by: Shuai Xue <xueshuai@...ux.alibaba.com>
> ---
> Changes in v5:
> - Move the headers to uapi file (Dave Hansen)
> - Use atomic operations in the tracking struct (Dave Hansen)
> - Drop the MCE enum type, and track MCE errors as "others"
> - Document this feature better
> - Link to v4: https://lore.kernel.org/r/20250801-vmcore_hw_error-v4-1-fa1fe65edb83@debian.org
> 
> Changes in v4:
> - Split the error by hardware subsystem instead of kernel
>    subsystem/driver (Shuai)
> - Do not count the corrected errors, only focusing on recoverable errors (Shuai)
> - Link to v3: https://lore.kernel.org/r/20250722-vmcore_hw_error-v3-1-ff0683fc1f17@debian.org
> 
> Changes in v3:
> - Add more information about this feature in the commit message
>    (Borislav Petkov)
> - Renamed the function to hwerr_log_error_type() and use hwerr as
>    suffix (Borislav Petkov)
> - Make the empty function static inline (kernel test robot)
> - Link to v2: https://lore.kernel.org/r/20250721-vmcore_hw_error-v2-1-ab65a6b43c5a@debian.org
> 
> Changes in v2:
> - Split the counter by recoverable error (Tony Luck)
> - Link to v1: https://lore.kernel.org/r/20250714-vmcore_hw_error-v1-1-8cf45edb6334@debian.org
> ---
>   Documentation/driver-api/hw-recoverable-errors.rst | 60 ++++++++++++++++++++++
>   arch/x86/kernel/cpu/mce/core.c                     |  4 ++
>   drivers/acpi/apei/ghes.c                           | 36 +++++++++++++

For the APEI part,

Reviewed-by: Hanjun Guo <guohanjun@...wei.com>

Thanks
Hanjun

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ