linux-kernel - RE: [PATCH -next v4 2/3] x86/mce: rename MCE_IN_KERNEL_COPYIN to MCE_IN_KERNEL_COPY

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <SJ1PR11MB6083BDC3A0596FA87BC25259FC422@SJ1PR11MB6083.namprd11.prod.outlook.com>
Date: Fri, 2 Feb 2024 21:36:27 +0000
From: "Luck, Tony" <tony.luck@...el.com>
To: Borislav Petkov <bp@...en8.de>
CC: Tong Tiangen <tongtiangen@...wei.com>, Thomas Gleixner
	<tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
	"wangkefeng.wang@...wei.com" <wangkefeng.wang@...wei.com>, Dave Hansen
	<dave.hansen@...ux.intel.com>, "x86@...nel.org" <x86@...nel.org>, "H. Peter
 Anvin" <hpa@...or.com>, Andy Lutomirski <luto@...nel.org>, Peter Zijlstra
	<peterz@...radead.org>, Andrew Morton <akpm@...ux-foundation.org>, "Naoya
 Horiguchi" <naoya.horiguchi@....com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "linux-edac@...r.kernel.org"
	<linux-edac@...r.kernel.org>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
	Guohanjun <guohanjun@...wei.com>
Subject: RE: [PATCH -next v4 2/3] x86/mce: rename MCE_IN_KERNEL_COPYIN to
 MCE_IN_KERNEL_COPY_MC

> > At least on Intel you can only get a machine check for operation on poison data LOAD.
> > Not for a STORE. I believe that is generally true - other arches to confirm.
>
> So what happens if you store to a poisoned cacheline on Intel? It'll
> raise a poison consumption error when that cacheline is loaded in the
> cache? Because you need to load that line into the cache for writing,
> I'd presume...

There are two places in the pipeline where poison is significant.

1) When the memory controller gets a request to fetch some data. If the ECC
check on the bits returned from the DIMMs the memory controller will log
a "UCNA" signature error to a machine check bank for the memory channel
where the DIMMs live. If CMCI is enabled for that bank, then a CMCI is
sent to all logical CPUs that are in the scope of that bank (generally a
CPU socket). The data is marked with a POISON signature and passed
to the entity that requested it. Caches support this POISON signature
and preserve it as data is moved between caches, or written back to
memory. This may have been a prefetch or a speculative read. In these
cases there won't be a machine check. Linux uc_decode_notifier() will
try to offline pages when it sees UCNA signatures.

2) When a CPU core tries to retire an instruction that consumes poison
data, or needs to retire a poisoned instruction. These log an SRAR signature
into a core scoped bank (on most Xeons to date bank 0 for poisoned instructions,
bank 1 for poisoned data consumption). Then they signal a machine check.

> What happens if you have bits flipped in the cacheline you want to write
> to?
>
> That's fine because you're overwriting them anyway?
>
> I'd presume ECC check gets performed on cacheline load and then you'll
> have to raise an #MC...

Partial cacheline stores to data marked as POISON in the cache maintain
the poison status. Full cacheline writes (certainly with MOVDIR64B instruction,
possibly with some AVX512 instructions) can clear the POISON status (since
you have all new data). A sequence of partial cache line stores that overwrite
all data in a cache line will NOT clear the POISON status.

Nothing is logged or signaled when updating data in the cache.

-Tony