lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 16 Oct 2023 10:14:05 -0400
From:   Yazen Ghannam <yazen.ghannam@....com>
To:     Dave Hansen <dave.hansen@...el.com>,
        "Sironi, Filippo" <sironi@...zon.de>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Cc:     yazen.ghannam@....com, "tony.luck@...el.com" <tony.luck@...el.com>,
        "bp@...en8.de" <bp@...en8.de>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
        "x86@...nel.org" <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
        "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>
Subject: Re: [PATCH] x86/mce: Increase the size of the MCE pool from 2 to 8
 pages

On 10/12/23 11:49 AM, Dave Hansen wrote:
> On 10/12/23 04:46, Sironi, Filippo wrote:
>> There's correlation across the errors that we're seeing, indeed,
>> we're looking at the same row being responsible for multiple CPUs
>> tripping and running into #MC. I still don't like the full lack of
>> visibility; it's not uncommon in a large fleet to see to take a
>> server out of production, replace a DIMM and shortly after taking it
>> out of production again to replace another DIMM just because some of
>> the errors weren't properly logged.
> 
> So you had two nearly simultaneous DIMM failures.  The first failed,
> filled up the buffer and then the second failed, but there was no room.
> The second failed *SO* soon after the first that there was no
> opportunity to empty the buffer between.
> 
> Right?
> 
> How do you know that storing 8 pages of records will catch this case as
> opposed to storing 2?
> 
>>> Is there any way that the size of the pool can be more automatically
>>> determined? Is the likelihood of a bunch errors proportional to the
>>> number of CPUs or amount of RAM or some other aspect of the hardware?
>>>
>>> Could the pool be emptied more aggressively so that it does not fill up?
> 
> You didn't really address the additional questions I posed there.
> 
> I'll add one more: how many of the messages are duplicates or
> *effectively* duplicates?  Or is that hard to determine at the time that
> the entries are being made that they are duplicates?
> 
> It _should_ also be fairly easy to enlarge the buffer on demand, say, if
> it got half full.  What's the time scale over which the buffer filled
> up?  Did a single #MC fill it up?
> 
> I really think we need to understand what the problem is and have _some_
> confidence that the proposed solution will fix that, even if we're just
> talking about a new config option.

I've seen a similar issue, and it's not just related to memory errors.
In my experience it was MCA errors from a variety of hardware blocks.
For example, a bad link internal to an SoC could spew MCA errors
regardless of the scale of RAM or CPUs. Same thing is possible for a bad
cache, etc.

These were during pre-production testing, and the easy workaround is to
increase the MCE genpool size at build time.

I don't think this needs to be the default though.

How about this to start?

1) Keep the current config size for boot time.
2) Add a kernel parameter and/or sysfs file to allow users to request
additional genpool capacity.
3) Use gen_pool_add(), or whichever, to add the capacity based on user
input.

Maybe this can be expanded later to be automatic. But I think it simpler
to start with explicit user input.

Thanks,
Yazen

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ