[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <314eedc5-c27e-4e63-b74a-7b06f64fdd86@intel.com>
Date: Sat, 16 Dec 2023 04:41:36 +0530
From: Sohil Mehta <sohil.mehta@...el.com>
To: "Luck, Tony" <tony.luck@...el.com>, "x86@...nel.org" <x86@...nel.org>,
Borislav Petkov <bp@...en8.de>
CC: Thomas Gleixner <tglx@...utronix.de>, Peter Zijlstra
<peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, Dave Hansen
<dave.hansen@...ux.intel.com>, "H . Peter Anvin" <hpa@...or.com>, "Yazen
Ghannam" <yazen.ghannam@....com>, Arnd Bergmann <arnd@...db.de>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>
Subject: Re: x86/mce: Is mce_is_memory_error() incorrect for Intel?
Thanks Tony for the explanation. It is very helpful.
>> Type Form
>> ---- ----
>> Generic Cache Hierarchy 000F 0000 0000 11LL
>> TLB Errors 000F 0000 0001 TTLL
>> Memory Controller Errors 000F 0000 1MMM CCCC
>> Cache Hierarchy Errors 000F 0001 RRRR TTLL
>> Extended Memory Errors 000F 0010 1MMM CCCC
>> Bus and Interconnect Errors 000F 1PPT RRRR IILL
>>
>> I am not sure what are the practical implications of getting
>> mce_is_memory_error() wrong. (This issue is completely theoretical right
>> now.) Any insights?
>
> This function is used to check whether an address is OS addressable memory
> (i.e. for a page that could be taken offline). That doesn't apply to the caching
> use case (the only way to "offline" such a page would be to offline each of the
> slow memory pages that it might be used for).
>
Makes sense. I am assuming these Extended Memory Errors will not be used
anymore (even for CXL.mem type configs) and we don't need to include
them in the mce_is_memory_error() check? I'll update the comment
accordingly.
> I'm not quite sure why bit 8 (cache hierarchy error) was added into this check,
> It would seem to have the same issues as extended memory.
>
>From a little bit of digging it seems the check for "cache hierarchy
errors" was always there. Commit fa92c5869426 ("x86, mce: Support memory
error recovery for both UCNA and Deferred error in machine_check_poll")
introduced the original checks but maybe the intention at that time was
different? I see that the CEC stuff was added later so maybe the
original memory related failures were handled differently?
Now, should we remove the cache error related check from
mce_is_memory_error()?
Powered by blists - more mailing lists