linux-kernel - Re: x86/mce: Is mce_is_memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <314eedc5-c27e-4e63-b74a-7b06f64fdd86@intel.com>
Date: Sat, 16 Dec 2023 04:41:36 +0530
From: Sohil Mehta <sohil.mehta@...el.com>
To: "Luck, Tony" <tony.luck@...el.com>, "x86@...nel.org" <x86@...nel.org>,
	Borislav Petkov <bp@...en8.de>
CC: Thomas Gleixner <tglx@...utronix.de>, Peter Zijlstra
	<peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, Dave Hansen
	<dave.hansen@...ux.intel.com>, "H . Peter Anvin" <hpa@...or.com>, "Yazen
 Ghannam" <yazen.ghannam@....com>, Arnd Bergmann <arnd@...db.de>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>
Subject: Re: x86/mce: Is mce_is_memory_error() incorrect for Intel?

Thanks Tony for the explanation. It is very helpful.

>> Type                          Form
>> ----                          ----
>> Generic Cache Hierarchy       000F 0000 0000 11LL
>> TLB Errors                    000F 0000 0001 TTLL
>> Memory Controller Errors      000F 0000 1MMM CCCC
>> Cache Hierarchy Errors                000F 0001 RRRR TTLL
>> Extended Memory Errors                000F 0010 1MMM CCCC
>> Bus and Interconnect Errors   000F 1PPT RRRR IILL
>>
>> I am not sure what are the practical implications of getting
>> mce_is_memory_error() wrong. (This issue is completely theoretical right
>> now.) Any insights?
> 
> This function is used to check whether an address is OS addressable memory
> (i.e. for a page that could be taken offline). That doesn't apply to the caching
> use case (the only way to "offline" such a page would be to offline each of the
> slow memory pages that it might be used for).
> 

Makes sense. I am assuming these Extended Memory Errors will not be used
anymore (even for CXL.mem type configs) and we don't need to include
them in the mce_is_memory_error() check? I'll update the comment
accordingly.

> I'm not quite sure why bit 8 (cache hierarchy error) was added into this check,
> It would seem to have the same issues as extended memory.
> 

>From a little bit of digging it seems the check for "cache hierarchy
errors" was always there. Commit fa92c5869426 ("x86, mce: Support memory
error recovery for both UCNA and Deferred error in machine_check_poll")
introduced the original checks but maybe the intention at that time was
different? I see that the CEC stuff was added later so maybe the
original memory related failures were handled differently?

Now, should we remove the cache error related check from
mce_is_memory_error()?