[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <a4a34583-8f26-bb08-001f-a53715070c00@huawei.com>
Date: Fri, 21 Nov 2025 10:47:32 +0800
From: Hanjun Guo <guohanjun@...wei.com>
To: Breno Leitao <leitao@...ian.org>, "Rafael J. Wysocki" <rafael@...nel.org>,
Len Brown <lenb@...nel.org>, James Morse <james.morse@....com>, Tony Luck
<tony.luck@...el.com>, Borislav Petkov <bp@...en8.de>, Robert Moore
<robert.moore@...el.com>, Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar
<mingo@...hat.com>, Dave Hansen <dave.hansen@...ux.intel.com>,
<x86@...nel.org>, "H. Peter Anvin" <hpa@...or.com>, Mauro Carvalho Chehab
<mchehab@...nel.org>, Mahesh J Salgaonkar <mahesh@...ux.ibm.com>, Oliver
O'Halloran <oohall@...il.com>, Bjorn Helgaas <bhelgaas@...gle.com>
CC: <linux-acpi@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<acpica-devel@...ts.linux.dev>, <osandov@...ndov.com>,
<xueshuai@...ux.alibaba.com>, <konrad.wilk@...cle.com>,
<linux-edac@...r.kernel.org>, <linuxppc-dev@...ts.ozlabs.org>,
<linux-pci@...r.kernel.org>, <kernel-team@...a.com>
Subject: Re: [PATCH RESEND v5] vmcoreinfo: Track and log recoverable hardware
errors
On 2025/10/10 18:36, Breno Leitao wrote:
> Introduce a generic infrastructure for tracking recoverable hardware
> errors (HW errors that are visible to the OS but does not cause a panic)
> and record them for vmcore consumption. This aids post-mortem crash
> analysis tools by preserving a count and timestamp for the last
> occurrence of such errors. On the other side, correctable errors, which
> the OS typically remains unaware of because the underlying hardware
> handles them transparently, are less relevant for crash dump
> and therefore are NOT tracked in this infrastructure.
>
> Add centralized logging for sources of recoverable hardware
> errors based on the subsystem it has been notified.
>
> hwerror_data is write-only at kernel runtime, and it is meant to be read
> from vmcore using tools like crash/drgn. For example, this is how it
> looks like when opening the crashdump from drgn.
>
> >>> prog['hwerror_data']
> (struct hwerror_info[1]){
> {
> .count = (int)844,
> .timestamp = (time64_t)1752852018,
> },
> ...
>
> This helps fleet operators quickly triage whether a crash may be
> influenced by hardware recoverable errors (which executes a uncommon
> code path in the kernel), especially when recoverable errors occurred
> shortly before a panic, such as the bug fixed by
> commit ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them
> when destroying the pool")
>
> This is not intended to replace full hardware diagnostics but provides
> a fast way to correlate hardware events with kernel panics quickly.
>
> Rare machine check exceptions—like those indicated by mce_flags.p5 or
> mce_flags.winchip—are not accounted for in this method, as they fall
> outside the intended usage scope for this feature’s user base.
>
> Suggested-by: Tony Luck <tony.luck@...el.com>
> Suggested-by: Shuai Xue <xueshuai@...ux.alibaba.com>
> Signed-off-by: Breno Leitao <leitao@...ian.org>
> Reviewed-by: Shuai Xue <xueshuai@...ux.alibaba.com>
> ---
> Changes in v5:
> - Move the headers to uapi file (Dave Hansen)
> - Use atomic operations in the tracking struct (Dave Hansen)
> - Drop the MCE enum type, and track MCE errors as "others"
> - Document this feature better
> - Link to v4: https://lore.kernel.org/r/20250801-vmcore_hw_error-v4-1-fa1fe65edb83@debian.org
>
> Changes in v4:
> - Split the error by hardware subsystem instead of kernel
> subsystem/driver (Shuai)
> - Do not count the corrected errors, only focusing on recoverable errors (Shuai)
> - Link to v3: https://lore.kernel.org/r/20250722-vmcore_hw_error-v3-1-ff0683fc1f17@debian.org
>
> Changes in v3:
> - Add more information about this feature in the commit message
> (Borislav Petkov)
> - Renamed the function to hwerr_log_error_type() and use hwerr as
> suffix (Borislav Petkov)
> - Make the empty function static inline (kernel test robot)
> - Link to v2: https://lore.kernel.org/r/20250721-vmcore_hw_error-v2-1-ab65a6b43c5a@debian.org
>
> Changes in v2:
> - Split the counter by recoverable error (Tony Luck)
> - Link to v1: https://lore.kernel.org/r/20250714-vmcore_hw_error-v1-1-8cf45edb6334@debian.org
> ---
> Documentation/driver-api/hw-recoverable-errors.rst | 60 ++++++++++++++++++++++
> arch/x86/kernel/cpu/mce/core.c | 4 ++
> drivers/acpi/apei/ghes.c | 36 +++++++++++++
For the APEI part,
Reviewed-by: Hanjun Guo <guohanjun@...wei.com>
Thanks
Hanjun
Powered by blists - more mailing lists