[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <crxrexye2nmqebct6eisgkvpc7btrg6ckh5qr7tmhpkdnqys2h@6dpf2j6yhlxq>
Date: Mon, 21 Jul 2025 08:43:24 -0700
From: Breno Leitao <leitao@...ian.org>
To: Borislav Petkov <bp@...en8.de>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>, Len Brown <lenb@...nel.org>,
James Morse <james.morse@....com>, Tony Luck <tony.luck@...el.com>,
Robert Moore <robert.moore@...el.com>, Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
"H. Peter Anvin" <hpa@...or.com>, Hanjun Guo <guohanjun@...wei.com>,
Mauro Carvalho Chehab <mchehab@...nel.org>, Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
Oliver O'Halloran <oohall@...il.com>, Bjorn Helgaas <bhelgaas@...gle.com>, linux-acpi@...r.kernel.org,
linux-kernel@...r.kernel.org, acpica-devel@...ts.linux.dev, osandov@...ndov.com,
xueshuai@...ux.alibaba.com, konrad.wilk@...cle.com, linux-edac@...r.kernel.org,
linuxppc-dev@...ts.ozlabs.org, linux-pci@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH v2] vmcoreinfo: Track and log recoverable hardware errors
Hello Borislav,
On Mon, Jul 21, 2025 at 03:57:18PM +0200, Borislav Petkov wrote:
> On Mon, Jul 21, 2025 at 03:13:40AM -0700, Breno Leitao wrote:
> > Introduce a generic infrastructure for tracking recoverable hardware
> > errors (HW errors that did not cause a panic) and record them for vmcore
> > consumption. This aids post-mortem crash analysis tools by preserving
> > a count and timestamp for the last occurrence of such errors.
> >
> > This patch adds centralized logging for three common sources of
>
> "Add centralized... "
Ack!
> > recoverable hardware errors:
> >
> > - PCIe AER Correctable errors
> > - x86 Machine Check Exceptions (MCE)
> > - APEI/CPER GHES corrected or recoverable errors
> >
> > hwerror_tracking is write-only at kernel runtime, and it is meant to be
> > read from vmcore using tools like crash/drgn. For example, this is how
> > it looks like when opening the crashdump from drgn.
> >
> > >>> prog['hwerror_tracking']
> > (struct hwerror_tracking_info [3]){
> > {
> > .count = (int)844,
> > .timestamp = (time64_t)1752852018,
> > },
> > ...
> >
>
> I'm still missing the justification why rasdaemon can't be used here.
> You did explain it already in past emails.
Sorry, I will update it.
> > +enum hwerror_tracking_source {
> > + HWE_RECOV_AER,
> > + HWE_RECOV_MCE,
> > + HWE_RECOV_GHES,
> > + HWE_RECOV_MAX,
> > +};
>
> Are we confident this separation will serve all cloud dudes?
I am not, but, I've added them to CC list of this patch, so, they are
more than free to chime in.
> > +void hwerror_tracking_log(enum hwerror_tracking_source src)
>
> A function should have a verb in its name explaining what it does:
>
> hwerr_log_error_type()
>
> or so.
Ack!
I will wait a bit more and send an updated version.
Thanks for the review
--breno
Powered by blists - more mailing lists