linux-kernel - Re: [PATCH v3 02/22] x86,mce: Delete ist_begin_non

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <772ACE2A-FD8B-492F-960E-981ECC72E283@amacapital.net>
Date:   Wed, 19 Feb 2020 14:48:34 -0800
From:   Andy Lutomirski <luto@...capital.net>
To:     "Luck, Tony" <tony.luck@...el.com>
Cc:     Andy Lutomirski <luto@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Borislav Petkov <bp@...en8.de>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-arch <linux-arch@...r.kernel.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ingo Molnar <mingo@...nel.org>,
        Joel Fernandes <joel@...lfernandes.org>,
        Greg KH <gregkh@...uxfoundation.org>,
        "gustavo@...eddedor.com" <gustavo@...eddedor.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        "paulmck@...nel.org" <paulmck@...nel.org>,
        Josh Triplett <josh@...htriplett.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Lai Jiangshan <jiangshanlai@...il.com>,
        Frederic Weisbecker <frederic@...nel.org>,
        Dan Carpenter <dan.carpenter@...cle.com>,
        Masami Hiramatsu <mhiramat@...nel.org>
Subject: Re: [PATCH v3 02/22] x86,mce: Delete ist_begin_non_atomic()


> On Feb 19, 2020, at 2:33 PM, Luck, Tony <tony.luck@...el.com> wrote:
> 
> 
>> 
>> One big question here: are memory failure #MC exceptions synchronous
>> or can they be delayed?   If we get a memory failure, is it possible
>> that the #MC hits some random context and not the actual context where
>> the error occurred?
> 
> There are a few cases:
> 1) SRAO (Software recoverable action optional) [Patrol scrub or L3 cache eviction]
> These aren't synchronous with any core execution. Using machine check to signal
> was probably a mistake - compounded by it being broadcast :-(  Could pick any CPU
> to handle (actually choose the first to arrive in do_machine_check()). That guy should
> arrange to soft offline the affected page. Every CPU can return to what they were doing
> before.

You could handle this by sending IPI-to-self and dealing with it in the interrupt handler. Or even wake a high-priority kthread or workqueue. irq_work may help. Relying on task_work or the non_atomic stuff seems silly - you can’t rely on anything about the interrupted context, and the context is more or less irrelevant anyway.

> 
> 2) SRAR (Software recoverable action required)
> These are synchronous. Starting with Skylake they may be signaled just to the thread
> that hit the poison. Earlier generations broadcast.

Here’s where dealing with one that came from kernel code is just nasty, right?

I would argue that, if IF=0, killing the machine is reasonable.  If IF=1, we should be okay.  Actually making this work sanely is gross, and arguably the goal should be minimizing grossness.

Perhaps, if we came from kernel mode, we should IPI-to-self and use a special vector that is idtentry, not apicinterrupt.  Or maybe even do this for entries from usermode just to keep everything consistent.

>    2a) Hit in ring3 code ... we want to offline the page and SIGBUS the task(s)
>    2b) Memcpy_mcsafe() ... kernel has a recovery path. "Return" to the recovery code instead of to the original RIP.
>    2c) copy_from_user ... not implemented yet. We are in kernel, but would like to treat this like case 2a
> 
> 3) Fatal
> Always broadcast. Some bank has MCi_STATUS.PCC==1. System must be shutdown.

Easy :)

It would be really, really nice if NMI was masked in MCE context.

> 
> -Tony