lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3908561D78D1C84285E8C5FCA982C28F7F57E302@ORSMSX115.amr.corp.intel.com>
Date:   Wed, 19 Feb 2020 22:33:37 +0000
From:   "Luck, Tony" <tony.luck@...el.com>
To:     Andy Lutomirski <luto@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>
CC:     Borislav Petkov <bp@...en8.de>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-arch <linux-arch@...r.kernel.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ingo Molnar <mingo@...nel.org>,
        Joel Fernandes <joel@...lfernandes.org>,
        Greg KH <gregkh@...uxfoundation.org>,
        "gustavo@...eddedor.com" <gustavo@...eddedor.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        "paulmck@...nel.org" <paulmck@...nel.org>,
        "Josh Triplett" <josh@...htriplett.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Lai Jiangshan <jiangshanlai@...il.com>,
        Frederic Weisbecker <frederic@...nel.org>,
        Dan Carpenter <dan.carpenter@...cle.com>,
        Masami Hiramatsu <mhiramat@...nel.org>
Subject: RE: [PATCH v3 02/22] x86,mce: Delete ist_begin_non_atomic()

> One big question here: are memory failure #MC exceptions synchronous
> or can they be delayed?   If we get a memory failure, is it possible
> that the #MC hits some random context and not the actual context where
> the error occurred?

There are a few cases:
1) SRAO (Software recoverable action optional) [Patrol scrub or L3 cache eviction]
These aren't synchronous with any core execution. Using machine check to signal
was probably a mistake - compounded by it being broadcast :-(  Could pick any CPU
to handle (actually choose the first to arrive in do_machine_check()). That guy should
arrange to soft offline the affected page. Every CPU can return to what they were doing
before.

2) SRAR (Software recoverable action required)
These are synchronous. Starting with Skylake they may be signaled just to the thread
that hit the poison. Earlier generations broadcast.
	2a) Hit in ring3 code ... we want to offline the page and SIGBUS the task(s)
	2b) Memcpy_mcsafe() ... kernel has a recovery path. "Return" to the recovery code instead of to the original RIP.
	2c) copy_from_user ... not implemented yet. We are in kernel, but would like to treat this like case 2a

3) Fatal
Always broadcast. Some bank has MCi_STATUS.PCC==1. System must be shutdown.

-Tony

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ