[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <vs5x5qvw2veurxdljmdiumqpseze2myx6quw3rmt7li7d3dbin@duoky4z44zzz>
Date: Tue, 15 Jul 2025 05:02:39 -0700
From: Breno Leitao <leitao@...ian.org>
To: Borislav Petkov <bp@...en8.de>
Cc: "Luck, Tony" <tony.luck@...el.com>,
"Rafael J. Wysocki" <rafael@...nel.org>, Len Brown <lenb@...nel.org>, James Morse <james.morse@....com>,
"Moore, Robert" <robert.moore@...el.com>, "linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "acpica-devel@...ts.linux.dev" <acpica-devel@...ts.linux.dev>,
"kernel-team@...a.com" <kernel-team@...a.com>
Subject: Re: [PATCH] ghes: Track number of recovered hardware errors
Hello Borislav,
On Tue, Jul 15, 2025 at 12:31:25PM +0200, Borislav Petkov wrote:
> On Tue, Jul 15, 2025 at 03:20:35AM -0700, Breno Leitao wrote:
> > For instance, If every investigation (as you suggested above) take just
> > a couple of minutes, there simply wouldn’t be enough hours in the day,
> > even working 24x7, to keep up with the volume.
>
> Well, first of all, it would help considerably if you put the use case in the
> commit message.
Sorry, my bad. I can do better if we decide that this is worth pursuing.
> Then, are you saying that when examining kernel crashes, you don't look at
> I find that hard to believe.
We absolutely do examine kernel messages when investigating crashes, and
over time we've developed an extensive set of regular expressions to
identify relevant errors.
In practice, what you're describing is very similar to the workflow we
already use. For example, here are just a few of the regex patterns we
match in dmesg, grouped by category:
(r"Machine check: Processor context corrupt", "cpu"),
(r"Kernel panic - not syncing: Panicing machine check CPU died", "cpu"),
(r"Machine check: Data load in unrecoverable area of kernel", "memory"),
(r"Instruction fetch error in kernel", "memory"),
(r"\[Hardware Error\]: +section_type: memory error", "memory"),
(r"EDAC skx MC\d: HANDLING MCE MEMORY ERROR", "memory"),
(r"\[Hardware Error\]: section_type: general processor error", "cpu"),
(r"UE memory read error on", "memory"),
And that’s just a partial list. We have 26 regexps for various issues,
and I wouldn’t be surprised if other large operators use a similar
approach.
While this system mostly works, there are real advantages to
consolidating this logic in the kernel itself, as I’m proposing:
* Reduces the risk of mistakes
- Less chance of missing changes or edge cases.
* Centralizes effort
- Users don’t have to maintain their own lists; the logic lives
closer to the source of truth.
* Simplifies maintenance
- Avoids the constant need to update regexps if message strings
change.
* Easier validation
- It becomes straightforward to cross-check that all relevant
messages are being captured.
* Automatic accounting
- Any new or updated messages are immediately reflected.
* Lower postmortem overhead
- Requires less supporting infrastructure for crash analysis.
* Netconsole support
- Makes this status data available via netconsole, which is
helpful for those users.
> Because if you do look at dmesg and if you would grep it for hw errors - we do
> log those if desired, AFAIR - you don't need anything new.
Understood. If you don’t see additional value in kernel-side
counting, I can certainly keep relying on our current method. For
us, though, having this functionality built in feels more robust and
sustainable.
Thanks for the discussion,
--breno
Powered by blists - more mailing lists