[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250715125327.GGaHZPRz9QLNNO-7q8@fat_crate.local>
Date: Tue, 15 Jul 2025 14:53:27 +0200
From: Borislav Petkov <bp@...en8.de>
To: Breno Leitao <leitao@...ian.org>, Alexander Graf <graf@...zon.com>,
Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
Peter Gonda <pgonda@...gle.com>
Cc: "Luck, Tony" <tony.luck@...el.com>,
"Rafael J. Wysocki" <rafael@...nel.org>,
Len Brown <lenb@...nel.org>, James Morse <james.morse@....com>,
"Moore, Robert" <robert.moore@...el.com>,
"linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"acpica-devel@...ts.linux.dev" <acpica-devel@...ts.linux.dev>,
"kernel-team@...a.com" <kernel-team@...a.com>
Subject: Re: [PATCH] ghes: Track number of recovered hardware errors
On Tue, Jul 15, 2025 at 05:02:39AM -0700, Breno Leitao wrote:
> Hello Borislav,
>
> On Tue, Jul 15, 2025 at 12:31:25PM +0200, Borislav Petkov wrote:
> > On Tue, Jul 15, 2025 at 03:20:35AM -0700, Breno Leitao wrote:
> > > For instance, If every investigation (as you suggested above) take just
> > > a couple of minutes, there simply wouldn’t be enough hours in the day,
> > > even working 24x7, to keep up with the volume.
> >
> > Well, first of all, it would help considerably if you put the use case in the
> > commit message.
>
> Sorry, my bad. I can do better if we decide that this is worth pursuing.
>
> > Then, are you saying that when examining kernel crashes, you don't look at
> > I find that hard to believe.
>
> We absolutely do examine kernel messages when investigating crashes, and
> over time we've developed an extensive set of regular expressions to
> identify relevant errors.
>
> In practice, what you're describing is very similar to the workflow we
> already use. For example, here are just a few of the regex patterns we
> match in dmesg, grouped by category:
>
> (r"Machine check: Processor context corrupt", "cpu"),
> (r"Kernel panic - not syncing: Panicing machine check CPU died", "cpu"),
> (r"Machine check: Data load in unrecoverable area of kernel", "memory"),
> (r"Instruction fetch error in kernel", "memory"),
> (r"\[Hardware Error\]: +section_type: memory error", "memory"),
> (r"EDAC skx MC\d: HANDLING MCE MEMORY ERROR", "memory"),
> (r"\[Hardware Error\]: section_type: general processor error", "cpu"),
> (r"UE memory read error on", "memory"),
>
> And that’s just a partial list. We have 26 regexps for various issues,
> and I wouldn’t be surprised if other large operators use a similar
> approach.
>
> While this system mostly works, there are real advantages to
> consolidating this logic in the kernel itself, as I’m proposing:
>
> * Reduces the risk of mistakes
> - Less chance of missing changes or edge cases.
>
> * Centralizes effort
> - Users don’t have to maintain their own lists; the logic lives
> closer to the source of truth.
>
> * Simplifies maintenance
> - Avoids the constant need to update regexps if message strings
> change.
>
> * Easier validation
> - It becomes straightforward to cross-check that all relevant
> messages are being captured.
>
> * Automatic accounting
> - Any new or updated messages are immediately reflected.
>
> * Lower postmortem overhead
> - Requires less supporting infrastructure for crash analysis.
>
> * Netconsole support
> - Makes this status data available via netconsole, which is
> helpful for those users.
Yap, this is more like it. Those sound to me like good reasons to have this
additional logging.
It would be really good to sync with other cloud providers here so that we can
do this one solution which fits all. Lemme CC some other folks I know who do
cloud gunk and leave the whole mail for their pleasure.
Newly CCed folks, you know how to find the whole discussion. :-)
Thx.
> > Because if you do look at dmesg and if you would grep it for hw errors - we do
> > log those if desired, AFAIR - you don't need anything new.
>
> Understood. If you don’t see additional value in kernel-side
> counting, I can certainly keep relying on our current method. For
> us, though, having this functionality built in feels more robust and
> sustainable.
>
> Thanks for the discussion,
> --breno
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Powered by blists - more mailing lists