[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250721135718.GAaH5HPinaKvXjM-1g@renoirsky.local>
Date: Mon, 21 Jul 2025 15:57:18 +0200
From: Borislav Petkov <bp@...en8.de>
To: Breno Leitao <leitao@...ian.org>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>, Len Brown <lenb@...nel.org>,
James Morse <james.morse@....com>, Tony Luck <tony.luck@...el.com>,
Robert Moore <robert.moore@...el.com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>,
Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
"H. Peter Anvin" <hpa@...or.com>, Hanjun Guo <guohanjun@...wei.com>,
Mauro Carvalho Chehab <mchehab@...nel.org>,
Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
Oliver O'Halloran <oohall@...il.com>,
Bjorn Helgaas <bhelgaas@...gle.com>, linux-acpi@...r.kernel.org,
linux-kernel@...r.kernel.org, acpica-devel@...ts.linux.dev,
osandov@...ndov.com, xueshuai@...ux.alibaba.com,
konrad.wilk@...cle.com, linux-edac@...r.kernel.org,
linuxppc-dev@...ts.ozlabs.org, linux-pci@...r.kernel.org,
kernel-team@...a.com
Subject: Re: [PATCH v2] vmcoreinfo: Track and log recoverable hardware errors
On Mon, Jul 21, 2025 at 03:13:40AM -0700, Breno Leitao wrote:
> Introduce a generic infrastructure for tracking recoverable hardware
> errors (HW errors that did not cause a panic) and record them for vmcore
> consumption. This aids post-mortem crash analysis tools by preserving
> a count and timestamp for the last occurrence of such errors.
>
> This patch adds centralized logging for three common sources of
"Add centralized... "
> recoverable hardware errors:
>
> - PCIe AER Correctable errors
> - x86 Machine Check Exceptions (MCE)
> - APEI/CPER GHES corrected or recoverable errors
>
> hwerror_tracking is write-only at kernel runtime, and it is meant to be
> read from vmcore using tools like crash/drgn. For example, this is how
> it looks like when opening the crashdump from drgn.
>
> >>> prog['hwerror_tracking']
> (struct hwerror_tracking_info [3]){
> {
> .count = (int)844,
> .timestamp = (time64_t)1752852018,
> },
> ...
>
I'm still missing the justification why rasdaemon can't be used here.
You did explain it already in past emails.
> +enum hwerror_tracking_source {
> + HWE_RECOV_AER,
> + HWE_RECOV_MCE,
> + HWE_RECOV_GHES,
> + HWE_RECOV_MAX,
> +};
Are we confident this separation will serve all cloud dudes?
> +
> +#ifdef CONFIG_VMCORE_INFO
> +void hwerror_tracking_log(enum hwerror_tracking_source src);
> +#else
> +void hwerror_tracking_log(enum hwerror_tracking_source src) {};
> +#endif
> +
> #endif /* LINUX_VMCORE_INFO_H */
> diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
> index e066d31d08f89..23d7ddcd55cdd 100644
> --- a/kernel/vmcore_info.c
> +++ b/kernel/vmcore_info.c
> @@ -31,6 +31,13 @@ u32 *vmcoreinfo_note;
> /* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */
> static unsigned char *vmcoreinfo_data_safecopy;
>
> +struct hwerror_tracking_info {
> + int __data_racy count;
> + time64_t __data_racy timestamp;
> +};
> +
> +static struct hwerror_tracking_info hwerror_tracking[HWE_RECOV_MAX];
> +
> Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
> void *data, size_t data_len)
> {
> @@ -118,6 +125,17 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void)
> }
> EXPORT_SYMBOL(paddr_vmcoreinfo_note);
>
> +void hwerror_tracking_log(enum hwerror_tracking_source src)
A function should have a verb in its name explaining what it does:
hwerr_log_error_type()
or so.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Powered by blists - more mailing lists