linux-kernel - Re: [PATCH v2] vmcoreinfo: Track and log recoverable hardware errors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250721135718.GAaH5HPinaKvXjM-1g@renoirsky.local>
Date: Mon, 21 Jul 2025 15:57:18 +0200
From: Borislav Petkov <bp@...en8.de>
To: Breno Leitao <leitao@...ian.org>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>, Len Brown <lenb@...nel.org>,
	James Morse <james.morse@....com>, Tony Luck <tony.luck@...el.com>,
	Robert Moore <robert.moore@...el.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>,
	Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
	"H. Peter Anvin" <hpa@...or.com>, Hanjun Guo <guohanjun@...wei.com>,
	Mauro Carvalho Chehab <mchehab@...nel.org>,
	Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
	Oliver O'Halloran <oohall@...il.com>,
	Bjorn Helgaas <bhelgaas@...gle.com>, linux-acpi@...r.kernel.org,
	linux-kernel@...r.kernel.org, acpica-devel@...ts.linux.dev,
	osandov@...ndov.com, xueshuai@...ux.alibaba.com,
	konrad.wilk@...cle.com, linux-edac@...r.kernel.org,
	linuxppc-dev@...ts.ozlabs.org, linux-pci@...r.kernel.org,
	kernel-team@...a.com
Subject: Re: [PATCH v2] vmcoreinfo: Track and log recoverable hardware errors

On Mon, Jul 21, 2025 at 03:13:40AM -0700, Breno Leitao wrote:
> Introduce a generic infrastructure for tracking recoverable hardware
> errors (HW errors that did not cause a panic) and record them for vmcore
> consumption. This aids post-mortem crash analysis tools by preserving
> a count and timestamp for the last occurrence of such errors.
> 
> This patch adds centralized logging for three common sources of

"Add centralized... "

> recoverable hardware errors:
> 
>   - PCIe AER Correctable errors
>   - x86 Machine Check Exceptions (MCE)
>   - APEI/CPER GHES corrected or recoverable errors
> 
> hwerror_tracking is write-only at kernel runtime, and it is meant to be
> read from vmcore using tools like crash/drgn. For example, this is how
> it looks like when opening the crashdump from drgn.
> 
> 	>>> prog['hwerror_tracking']
> 	(struct hwerror_tracking_info [3]){
> 		{
> 			.count = (int)844,
> 			.timestamp = (time64_t)1752852018,
> 		},
> 		...
> 

I'm still missing the justification why rasdaemon can't be used here.
You did explain it already in past emails.

> +enum hwerror_tracking_source {
> +	HWE_RECOV_AER,
> +	HWE_RECOV_MCE,
> +	HWE_RECOV_GHES,
> +	HWE_RECOV_MAX,
> +};

Are we confident this separation will serve all cloud dudes?

> +
> +#ifdef CONFIG_VMCORE_INFO
> +void hwerror_tracking_log(enum hwerror_tracking_source src);
> +#else
> +void hwerror_tracking_log(enum hwerror_tracking_source src) {};
> +#endif
> +
>  #endif /* LINUX_VMCORE_INFO_H */
> diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
> index e066d31d08f89..23d7ddcd55cdd 100644
> --- a/kernel/vmcore_info.c
> +++ b/kernel/vmcore_info.c
> @@ -31,6 +31,13 @@ u32 *vmcoreinfo_note;
>  /* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */
>  static unsigned char *vmcoreinfo_data_safecopy;
>  
> +struct hwerror_tracking_info {
> +	int __data_racy count;
> +	time64_t __data_racy timestamp;
> +};
> +
> +static struct hwerror_tracking_info hwerror_tracking[HWE_RECOV_MAX];
> +
>  Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
>  			  void *data, size_t data_len)
>  {
> @@ -118,6 +125,17 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void)
>  }
>  EXPORT_SYMBOL(paddr_vmcoreinfo_note);
>  
> +void hwerror_tracking_log(enum hwerror_tracking_source src)

A function should have a verb in its name explaining what it does:

hwerr_log_error_type()

or so.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette