linux-kernel - Re: [PATCH v2] vmcoreinfo: Track and log recoverable hardware errors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <crxrexye2nmqebct6eisgkvpc7btrg6ckh5qr7tmhpkdnqys2h@6dpf2j6yhlxq>
Date: Mon, 21 Jul 2025 08:43:24 -0700
From: Breno Leitao <leitao@...ian.org>
To: Borislav Petkov <bp@...en8.de>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>, Len Brown <lenb@...nel.org>, 
	James Morse <james.morse@....com>, Tony Luck <tony.luck@...el.com>, 
	Robert Moore <robert.moore@...el.com>, Thomas Gleixner <tglx@...utronix.de>, 
	Ingo Molnar <mingo@...hat.com>, Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org, 
	"H. Peter Anvin" <hpa@...or.com>, Hanjun Guo <guohanjun@...wei.com>, 
	Mauro Carvalho Chehab <mchehab@...nel.org>, Mahesh J Salgaonkar <mahesh@...ux.ibm.com>, 
	Oliver O'Halloran <oohall@...il.com>, Bjorn Helgaas <bhelgaas@...gle.com>, linux-acpi@...r.kernel.org, 
	linux-kernel@...r.kernel.org, acpica-devel@...ts.linux.dev, osandov@...ndov.com, 
	xueshuai@...ux.alibaba.com, konrad.wilk@...cle.com, linux-edac@...r.kernel.org, 
	linuxppc-dev@...ts.ozlabs.org, linux-pci@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH v2] vmcoreinfo: Track and log recoverable hardware errors

Hello Borislav,

On Mon, Jul 21, 2025 at 03:57:18PM +0200, Borislav Petkov wrote:
> On Mon, Jul 21, 2025 at 03:13:40AM -0700, Breno Leitao wrote:
> > Introduce a generic infrastructure for tracking recoverable hardware
> > errors (HW errors that did not cause a panic) and record them for vmcore
> > consumption. This aids post-mortem crash analysis tools by preserving
> > a count and timestamp for the last occurrence of such errors.
> > 
> > This patch adds centralized logging for three common sources of
> 
> "Add centralized... "

Ack!

> > recoverable hardware errors:
> > 
> >   - PCIe AER Correctable errors
> >   - x86 Machine Check Exceptions (MCE)
> >   - APEI/CPER GHES corrected or recoverable errors
> > 
> > hwerror_tracking is write-only at kernel runtime, and it is meant to be
> > read from vmcore using tools like crash/drgn. For example, this is how
> > it looks like when opening the crashdump from drgn.
> > 
> > 	>>> prog['hwerror_tracking']
> > 	(struct hwerror_tracking_info [3]){
> > 		{
> > 			.count = (int)844,
> > 			.timestamp = (time64_t)1752852018,
> > 		},
> > 		...
> > 
> 
> I'm still missing the justification why rasdaemon can't be used here.
> You did explain it already in past emails.

Sorry, I will update it.

> > +enum hwerror_tracking_source {
> > +	HWE_RECOV_AER,
> > +	HWE_RECOV_MCE,
> > +	HWE_RECOV_GHES,
> > +	HWE_RECOV_MAX,
> > +};
> 
> Are we confident this separation will serve all cloud dudes?

I am not, but, I've added them to CC list of this patch, so, they are
more than free to chime in.

> > +void hwerror_tracking_log(enum hwerror_tracking_source src)
> 
> A function should have a verb in its name explaining what it does:
> 
> hwerr_log_error_type()
> 
> or so.

Ack!

I will wait a bit more and send an updated version.

Thanks for the review
--breno