[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <buhwuankenpnvmio6jeoxverixoyfpn2eh62ix7vzxw7xvlxcv@rpibcrufr2yg>
Date: Fri, 1 Aug 2025 10:00:33 -0700
From: Breno Leitao <leitao@...ian.org>
To: Dave Hansen <dave.hansen@...el.com>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>, Len Brown <lenb@...nel.org>,
James Morse <james.morse@....com>, Tony Luck <tony.luck@...el.com>, Borislav Petkov <bp@...en8.de>,
Robert Moore <robert.moore@...el.com>, Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
"H. Peter Anvin" <hpa@...or.com>, Hanjun Guo <guohanjun@...wei.com>,
Mauro Carvalho Chehab <mchehab@...nel.org>, Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
Oliver O'Halloran <oohall@...il.com>, Bjorn Helgaas <bhelgaas@...gle.com>, linux-acpi@...r.kernel.org,
linux-kernel@...r.kernel.org, acpica-devel@...ts.linux.dev, osandov@...ndov.com,
xueshuai@...ux.alibaba.com, konrad.wilk@...cle.com, linux-edac@...r.kernel.org,
linuxppc-dev@...ts.ozlabs.org, linux-pci@...r.kernel.org, kernel-team@...a.com, osandov@...com
Subject: Re: [PATCH v4] vmcoreinfo: Track and log recoverable hardware errors
hello Dave,
On Fri, Aug 01, 2025 at 09:24:43AM -0700, Dave Hansen wrote:
> On 8/1/25 08:13, Breno Leitao wrote:
> > On Fri, Aug 01, 2025 at 07:52:17AM -0700, Dave Hansen wrote:
> >> On 8/1/25 05:31, Breno Leitao wrote:
> >>> Introduce a generic infrastructure for tracking recoverable hardware
> >>> errors (HW errors that are visible to the OS but does not cause a panic)
> >>> and record them for vmcore consumption.
> >> ...
> >>
> >> Are there patches for the consumer side of this, too? Or do humans
> >> looking at crash dumps have to know what to go digging for?
> >>
> >> In either case, don't we need documentation for this new ABI?
> >
> > I have considered this, but the documentation for vmcoreinfo
> > (admin-guide/kdump/vmcoreinfo.rst) solely documents what is explicitly
> > exposed by vmcore, which differs from the nature of these counters.
> >
> > Where would be a good place to document it?
>
> I'm not picky. But you also didn't quite answer the question I was asking.
>
> Is this new data for humans or machines to read?
I would say that the main consumer for this are post-mortem tools that
collect information of the vmcore and do diagnostic and correlation.
This is a common tooling for cloud providers, AFAIK.
In my work environment, there is a script that runs `drgn` on every
vmcore to capture information that would help operator to address the
problem. The information that this patch is proposing adds another field
that would help to potentially correlate crashes with recoverable error.
> >> Does "MCE" mean anything outside of x86?
> >
> > AFAIK this is a MCE concept.
>
> I'm not really sure what that response means.
>
> There are two problems here. First is that HWERR_RECOV_MCE is defined in
> arch-generic code, but it may never get used by anything other than x86
> when CONFIG_X86_MCE.
>
> That also completely wastes space in your data structure when
> HWERR_RECOV_MCE=n. Not a huge deal as-is, but it's still a bit sloppy
> and wasteful.
Would a solution like this look better?
enum hwerr_error_type {
HWERR_RECOV_CPU,
HWERR_RECOV_MEMORY,
HWERR_RECOV_PCI,
HWERR_RECOV_CXL,
HWERR_RECOV_OTHERS,
#ifdef CONFIG_X86_MCE
HWERR_RECOV_MCE,
#endif
HWERR_RECOV_MAX,
};
Or, would you prefer to have HWERR_RECOV_ARCH and keep it always there?
> >> I'd also love to hear more about _actual_ users of this. Surely, someone
> >> hit a real world problem and thought this would be a nifty solution. Who
> >> was that? What problem did they hit? How does this help them?
> >
> > Yes, this has been extensively discussed in the very first version of
> > the patch. Borislav raised the same question, which was discussed in the
> > following link:
> >
> > https://lore.kernel.org/all/20250715125327.GGaHZPRz9QLNNO-7q8@fat_crate.local/
>
> When someone raises a concern, we usually try to alleviate the concern
> in a way that is self-contained in the next posting. A cover letter with
> a full explanation would be one place to put the reasoning, for example.
>
> But expecting future reviewers to plod through all the old threads isn't
> really feasible.
Sorry. I tried to improve the documentation and wrote the following
message to the commit message, which was clearly not enough.
This helps fleet operators quickly triage whether a crash may be
influenced by hardware recoverable errors (which executes a uncommon
code path in the kernel), especially when recoverable errors occurred
shortly before a panic, such as the bug fixed by
commit ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them
when destroying the pool")
For next commit I will add a cover-letter, with the summary of the
details of that discussion.
Thanks for the review and suggestions,
--breno
Powered by blists - more mailing lists