lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6591377b-7911-444b-abf9-cfc978472d76@intel.com>
Date:   Thu, 12 Oct 2023 08:49:39 -0700
From:   Dave Hansen <dave.hansen@...el.com>
To:     "Sironi, Filippo" <sironi@...zon.de>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Cc:     "tony.luck@...el.com" <tony.luck@...el.com>,
        "bp@...en8.de" <bp@...en8.de>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
        "x86@...nel.org" <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
        "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>
Subject: Re: [PATCH] x86/mce: Increase the size of the MCE pool from 2 to 8
 pages

On 10/12/23 04:46, Sironi, Filippo wrote:
> There's correlation across the errors that we're seeing, indeed,
> we're looking at the same row being responsible for multiple CPUs
> tripping and running into #MC. I still don't like the full lack of
> visibility; it's not uncommon in a large fleet to see to take a
> server out of production, replace a DIMM and shortly after taking it
> out of production again to replace another DIMM just because some of
> the errors weren't properly logged.

So you had two nearly simultaneous DIMM failures.  The first failed,
filled up the buffer and then the second failed, but there was no room.
The second failed *SO* soon after the first that there was no
opportunity to empty the buffer between.

Right?

How do you know that storing 8 pages of records will catch this case as
opposed to storing 2?

>> Is there any way that the size of the pool can be more automatically
>> determined? Is the likelihood of a bunch errors proportional to the
>> number of CPUs or amount of RAM or some other aspect of the hardware?
>>
>> Could the pool be emptied more aggressively so that it does not fill up?

You didn't really address the additional questions I posed there.

I'll add one more: how many of the messages are duplicates or
*effectively* duplicates?  Or is that hard to determine at the time that
the entries are being made that they are duplicates?

It _should_ also be fairly easy to enlarge the buffer on demand, say, if
it got half full.  What's the time scale over which the buffer filled
up?  Did a single #MC fill it up?

I really think we need to understand what the problem is and have _some_
confidence that the proposed solution will fix that, even if we're just
talking about a new config option.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ