[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YgwnqTc8FGG3orcE@agluck-desk3.sc.intel.com>
Date: Tue, 15 Feb 2022 14:22:33 -0800
From: "Luck, Tony" <tony.luck@...el.com>
To: Borislav Petkov <bp@...en8.de>
Cc: Jue Wang <juew@...gle.com>, x86@...nel.org,
linux-kernel@...r.kernel.org, patches@...ts.linux.dev
Subject: Re: [PATCH] x86/mce: Add workaround for SKX/CLX/CPX spurious machine
checks
On Tue, Feb 15, 2022 at 11:08:43PM +0100, Borislav Petkov wrote:
> > This is still better than the OS crashes on MCEs raised on an
> > irrelevant process due to 'rep movs*' accesses in a kernel context,
> > e.g., copy_page.
>
> Wait a minute: so the MCE will happen for a piece of buffer that REP;
> MOVS *wasn't* supposed to copy.
Yes. That's why this is a "spurious" MCE. The "REP; MOVS" does
a fetch beyond the source range. If there is poison there, BOOM,
MCE :-(
> So why are we even disabling fast strings operations? Why aren't we
> simply ignoring this MCE with a warn in dmesg since, reportedly, we can
> recover safely?
This early in do_machine check we don't know whether this was from
a over enthusistic REP;MOVS fetch, or a "normal" machine check.
I don't think there is an easy way to tell the difference.
Since that "extra fetch" is part of the fast string mode, the workaround
is to disable fast strings and return. Now that will mean that fast
strings gets disabled for machine checks that had nothing to do with
this quirk. But this does provide a good-enough workaround.
> What about the MCE broadcasting synchronization? This is bypassing
> everything. There's mce_exception_count which counts stuff too.
The first check:
if ((mcgstatus & MCG_STATUS_LMCES)
is for "is this a local machine check"? So no broadcast sync
needed. But that needs a comment.
-Tony
Powered by blists - more mailing lists