lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ1PR11MB60831E048B544C83E845D907FC8DA@SJ1PR11MB6083.namprd11.prod.outlook.com>
Date:   Wed, 13 Dec 2023 19:54:25 +0000
From:   "Luck, Tony" <tony.luck@...el.com>
To:     "Mehta, Sohil" <sohil.mehta@...el.com>,
        "x86@...nel.org" <x86@...nel.org>, Borislav Petkov <bp@...en8.de>
CC:     Thomas Gleixner <tglx@...utronix.de>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        "H . Peter Anvin" <hpa@...or.com>,
        "Yazen Ghannam" <yazen.ghannam@....com>,
        Arnd Bergmann <arnd@...db.de>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>
Subject: RE: x86/mce: Is mce_is_memory_error() incorrect for Intel?

> The code tries to identify the memory and cache errors by masking the
> status and then comparing based on the bit encodings below. But it seems
> to be missing the "Extended Memory Errors" encoding which may have been
> added after the original code was written.

That's right. The "Extended Memory Errors" were added for systems that use
some "fast" memory as a cache for "slower" memory. Initial use case was for
3D-Xpoint (using this to add capacity rather than making use of persistence).

> Type                          Form
> ----                          ----
> Generic Cache Hierarchy       000F 0000 0000 11LL
> TLB Errors                    000F 0000 0001 TTLL
> Memory Controller Errors      000F 0000 1MMM CCCC
> Cache Hierarchy Errors                000F 0001 RRRR TTLL
> Extended Memory Errors                000F 0010 1MMM CCCC
> Bus and Interconnect Errors   000F 1PPT RRRR IILL
>
> I am not sure what are the practical implications of getting
> mce_is_memory_error() wrong. (This issue is completely theoretical right
> now.) Any insights?

This function is used to check whether an address is OS addressable memory
(i.e. for a page that could be taken offline). That doesn't apply to the caching
use case (the only way to "offline" such a page would be to offline each of the
slow memory pages that it might be used for).

I'm not quite sure why bit 8 (cache hierarchy error) was added into this check,
It would seem to have the same issues as extended memory.

> A couple of other points:
>
> - The code seems ripe for a rewrite to be rid of the magic masks and bit
> comparisons. I am thinking of doing that in a separate patch along side
> of rewriting the comment. Would that be useful even if no issue exists?

Maybe the code would be prettier, or at least easier to read, with some well
chosen names for the bits and fields. Only way to tell would be to write the
code and see how it looks.

> - Relying on these bit encodings seems problematic in the long run with
> the possibility of more things that could always be added. Is there a
> better way to do it?

Computer h/w architecture is going to evolve so change is going to be
needed however you code it.

-Tony

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ