[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZeDDLLQWPyyZve_s@agluck-desk3>
Date: Thu, 29 Feb 2024 09:47:24 -0800
From: Tony Luck <tony.luck@...el.com>
To: Borislav Petkov <bp@...en8.de>
Cc: "Naik, Avadhut" <avadnaik@....com>,
"Mehta, Sohil" <sohil.mehta@...el.com>,
"x86@...nel.org" <x86@...nel.org>,
"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"yazen.ghannam@....com" <yazen.ghannam@....com>,
Avadhut Naik <avadhut.naik@....com>
Subject: Re: [PATCH] x86/mce: Dynamically size space for machine check records
On Thu, Feb 29, 2024 at 09:39:51AM +0100, Borislav Petkov wrote:
> On Thu, Feb 29, 2024 at 12:42:38AM -0600, Naik, Avadhut wrote:
> > Somewhat confused here. Weren't we also exploring ways to avoid
> > duplicate records from being added to the genpool? Has something
> > changed in that regard?
>
> You can always send patches proposing how *you* think this duplicate
> elimination should look like and we can talk. :)
>
> I don't think anyone would mind it if done properly but first you'd need
> a real-life use case. As in, do we log sooo many duplicates such that
> we'd want to dedup?
There are definitly cases where dedup will not help. If a row fails in a
DIMM there will be a flood of correctable errors with different addresses
(depending on number of channels in the interleave schema for a system
this may be dozens or hundreds of distinct addresses).
Same for other failures in structures like column and rank.
As to "real-life" use cases. A search on Lore for "MCE records pool
full!" only finds threads about modifications to this code. So the
general population of Linux developers isn't seeing this.
But a search in my internal e-mail box kicks up a dozen or so distinct
hits from internal validation teams in just the past year. But those
folks are super-dedicated to finding corner cases. Just this morning I
got a triumphant e-mail from someone who reproduced an issue "after 1.6
million error injections".
I'd bet that large cloud providers with system numbered in the hundreds
of thousands see the MCE pool exhausted from time to time.
Open question is "How many error records do you need to diagnose the
cause of various floods of errors?"
I'm going to say that the current 32 (see earlier e-mail today to
Sohil https://lore.kernel.org/all/ZeC8_jzdFnkpPVPf@agluck-desk3/ )
isn't enough for big systems. It may be hard to distinguish between
the various bulk fail modes with just that many error logs.
-Tony
Powered by blists - more mailing lists