[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4qh2wbcbzdajh2tvki26qe4tqjazmyvbn7v7aqqhkxpitdrexo@ucch4ppo7i4e>
Date: Thu, 24 Jul 2025 06:34:31 -0700
From: Breno Leitao <leitao@...ian.org>
To: Shuai Xue <xueshuai@...ux.alibaba.com>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>, Len Brown <lenb@...nel.org>,
James Morse <james.morse@....com>, Tony Luck <tony.luck@...el.com>, Borislav Petkov <bp@...en8.de>,
Robert Moore <robert.moore@...el.com>, Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
"H. Peter Anvin" <hpa@...or.com>, Hanjun Guo <guohanjun@...wei.com>,
Mauro Carvalho Chehab <mchehab@...nel.org>, Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
Oliver O'Halloran <oohall@...il.com>, Bjorn Helgaas <bhelgaas@...gle.com>, linux-acpi@...r.kernel.org,
linux-kernel@...r.kernel.org, acpica-devel@...ts.linux.dev, osandov@...ndov.com,
konrad.wilk@...cle.com, linux-edac@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org,
linux-pci@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors
Hello Shuai,
On Thu, Jul 24, 2025 at 04:00:09PM +0800, Shuai Xue wrote:
> 在 2025/7/23 00:56, Breno Leitao 写道:
> > Introduce a generic infrastructure for tracking recoverable hardware
> > errors (HW errors that did not cause a panic) and record them for vmcore
> > consumption. This aids post-mortem crash analysis tools by preserving
> > a count and timestamp for the last occurrence of such errors.
> >
> > Add centralized logging for three common sources of recoverable hardware
> > errors:
>
> The term "recoverable" is highly ambiguous. Even within the x86
> architecture, different vendors define errors differently. I'm not
> trying to be pedantic about classification. As far as I know, for 2-bit
> memory errors detected by scrub, AMD defines them as deferred errors
> (DE) and handles them with log_error_deferred, while Intel uses
> machine_check_poll. For 2-bit memory errors consumed by processes,
> both Intel and AMD use MCE handling via do_machine_check(). Does your
> HWERR_RECOV_MCE only focus on synchronous UE errors handled in
> do_machine_check? What makes it special?
I understand that deferred errors (DE) detected by memory scrubbing are
typically silent and may not significantly impact system stability. In
other words, I’m not convinced that including DE metrics in crash dumps
would be helpful for correlating crashes with hardware issues—it might
just add noise.
Do you think it would be valuable to also log these events within
log_error_deferred()?
> > - if (ghes_severity(estatus->error_severity) >= GHES_SEV_PANIC)
> > + sev = ghes_severity(estatus->error_severity);
> > + if (sev == GHES_SEV_RECOVERABLE || sev == GHES_SEV_CORRECTED)
> > + hwerr_log_error_type(HWERR_RECOV_GHES);
>
> APEI does not define an error type named GHES. GHES is just a kernel
> driver name. Many hardware error types can be handled in GHES (see
> ghes_do_proc), for example, AER is routed by GHES when firmware-first
> mode is used. As far as I know, firmware-first mode is commonly used in
> production. Should GHES errors be categorized into AER, memory, and CXL
> memory instead?
I also considered slicing the data differently initially, but then
realized it would add more complexity than necessary for my needs.
If you believe we should further subdivide the data, I’m happy to do so.
You’re suggesting a structure like this, which would then map to the
corresponding CPER_SEC_ sections:
enum hwerr_error_type {
HWERR_RECOV_AER, // maps to CPER_SEC_PCIE
HWERR_RECOV_MCE, // maps to default MCE + CPER_SEC_PCIE
HWERR_RECOV_CXL, // maps to CPER_SEC_CXL_*
HWERR_RECOV_MEMORY, // maps to CPER_SEC_PLATFORM_MEM
}
Additionally, what about events related to CPU, Firmware, or DMA
errors—for example, CPER_SEC_PROC, CPER_SEC_FW, CPER_SEC_DMAR? Should we
include those in the classification as well?
Thanks for your review and for the ongoing discussion!
--breno
Powered by blists - more mailing lists