[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1c598798-5b28-4a17-bf86-042781808021@amd.com>
Date: Mon, 16 Oct 2023 10:14:05 -0400
From: Yazen Ghannam <yazen.ghannam@....com>
To: Dave Hansen <dave.hansen@...el.com>,
"Sironi, Filippo" <sironi@...zon.de>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Cc: yazen.ghannam@....com, "tony.luck@...el.com" <tony.luck@...el.com>,
"bp@...en8.de" <bp@...en8.de>,
"tglx@...utronix.de" <tglx@...utronix.de>,
"mingo@...hat.com" <mingo@...hat.com>,
"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
"x86@...nel.org" <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>
Subject: Re: [PATCH] x86/mce: Increase the size of the MCE pool from 2 to 8
pages
On 10/12/23 11:49 AM, Dave Hansen wrote:
> On 10/12/23 04:46, Sironi, Filippo wrote:
>> There's correlation across the errors that we're seeing, indeed,
>> we're looking at the same row being responsible for multiple CPUs
>> tripping and running into #MC. I still don't like the full lack of
>> visibility; it's not uncommon in a large fleet to see to take a
>> server out of production, replace a DIMM and shortly after taking it
>> out of production again to replace another DIMM just because some of
>> the errors weren't properly logged.
>
> So you had two nearly simultaneous DIMM failures. The first failed,
> filled up the buffer and then the second failed, but there was no room.
> The second failed *SO* soon after the first that there was no
> opportunity to empty the buffer between.
>
> Right?
>
> How do you know that storing 8 pages of records will catch this case as
> opposed to storing 2?
>
>>> Is there any way that the size of the pool can be more automatically
>>> determined? Is the likelihood of a bunch errors proportional to the
>>> number of CPUs or amount of RAM or some other aspect of the hardware?
>>>
>>> Could the pool be emptied more aggressively so that it does not fill up?
>
> You didn't really address the additional questions I posed there.
>
> I'll add one more: how many of the messages are duplicates or
> *effectively* duplicates? Or is that hard to determine at the time that
> the entries are being made that they are duplicates?
>
> It _should_ also be fairly easy to enlarge the buffer on demand, say, if
> it got half full. What's the time scale over which the buffer filled
> up? Did a single #MC fill it up?
>
> I really think we need to understand what the problem is and have _some_
> confidence that the proposed solution will fix that, even if we're just
> talking about a new config option.
I've seen a similar issue, and it's not just related to memory errors.
In my experience it was MCA errors from a variety of hardware blocks.
For example, a bad link internal to an SoC could spew MCA errors
regardless of the scale of RAM or CPUs. Same thing is possible for a bad
cache, etc.
These were during pre-production testing, and the easy workaround is to
increase the MCE genpool size at build time.
I don't think this needs to be the default though.
How about this to start?
1) Keep the current config size for boot time.
2) Add a kernel parameter and/or sysfs file to allow users to request
additional genpool capacity.
3) Use gen_pool_add(), or whichever, to add the capacity based on user
input.
Maybe this can be expanded later to be automatic. But I think it simpler
to start with explicit user input.
Thanks,
Yazen
Powered by blists - more mailing lists