lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4ce89a5e-96f8-4939-b86e-f65c16f4bd4e@intel.com>
Date:   Thu, 14 Dec 2023 00:57:51 +0530
From:   Sohil Mehta <sohil.mehta@...el.com>
To:     <x86@...nel.org>, Borislav Petkov <bp@...en8.de>,
        Tony Luck <tony.luck@...el.com>
CC:     Thomas Gleixner <tglx@...utronix.de>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        "H . Peter Anvin" <hpa@...or.com>,
        "Yazen Ghannam" <yazen.ghannam@....com>,
        Arnd Bergmann <arnd@...db.de>, <linux-kernel@...r.kernel.org>,
        <linux-edac@...r.kernel.org>
Subject: Re: x86/mce: Is mce_is_memory_error() incorrect for Intel?

On 12/6/2023 7:08 AM, Sohil Mehta wrote:

In an effort to rewrite the below Intel specific comment in
mce_is_memory_error(), I think I may have found an issue. It would be
really helpful if someone can help check the analysis below.

> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 7b397370b4d6..d42122b1afea 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -482,7 +482,8 @@ bool mce_is_memory_error(struct mce *m)
>  	case X86_VENDOR_INTEL:
>  	case X86_VENDOR_ZHAOXIN:
>  		/*
> -		 * Intel SDM Volume 3B - 15.9.2 Compound Error Codes
> +		 * Intel SDM: Machine-Check Architecture -> "Compound Error
> +		 * Codes"
>  		 *
>  		 * Bit 7 of the MCACOD field of IA32_MCi_STATUS is used for
>  		 * indicating a memory error. Bit 8 is used for indicating a


The full comment reads as follows:

* Bit 7 of the MCACOD field of IA32_MCi_STATUS is used for
* indicating a memory error. Bit 8 is used for indicating a
* cache hierarchy error. The combination of bit 2 and bit 3
* is used for indicating a `generic' cache hierarchy error
* But we can't just blindly check the above bits, because if
* bit 11 is set, then it is a bus/interconnect error - and
* either way the above bits just gives more detail on what
* bus/interconnect error happened. Note that bit 12 can be
* ignored, as it's the "filter" bit.
*/

return (m->status & 0xef80) == BIT(7) ||  ---> Memory Controller Errors
	(m->status & 0xef00) == BIT(8) || ---> Cache Hierarchy Errors
	(m->status & 0xeffc) == 0xc;	  ---> Generic Cache Hierarchy

The code tries to identify the memory and cache errors by masking the
status and then comparing based on the bit encodings below. But it seems
to be missing the "Extended Memory Errors" encoding which may have been
added after the original code was written.

Type 				Form
----				----
Generic Cache Hierarchy 	000F 0000 0000 11LL
TLB Errors			000F 0000 0001 TTLL
Memory Controller Errors	000F 0000 1MMM CCCC
Cache Hierarchy Errors		000F 0001 RRRR TTLL
Extended Memory Errors		000F 0010 1MMM CCCC
Bus and Interconnect Errors	000F 1PPT RRRR IILL

I am not sure what are the practical implications of getting
mce_is_memory_error() wrong. (This issue is completely theoretical right
now.) Any insights?

A couple of other points:

- The code seems ripe for a rewrite to be rid of the magic masks and bit
comparisons. I am thinking of doing that in a separate patch along side
of rewriting the comment. Would that be useful even if no issue exists?

- Relying on these bit encodings seems problematic in the long run with
the possibility of more things that could always be added. Is there a
better way to do it?

Sohil

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ