linux-kernel - Re: [PATCH V2] x86/entry/64: De-Xen-ify our NMI code further

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20210125123859.39b244ca@gandalf.local.home>
Date:   Mon, 25 Jan 2021 12:38:59 -0500
From:   Steven Rostedt <rostedt@...dmis.org>
To:     Lai Jiangshan <jiangshanlai@...il.com>
Cc:     linux-kernel@...r.kernel.org,
        Peter Zijlstra <peterz@...radead.org>,
        Lai Jiangshan <laijs@...ux.alibaba.com>,
        Andy Lutomirski <luto@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        x86@...nel.org, "H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH V2] x86/entry/64: De-Xen-ify our NMI code further

On Mon, 25 Jan 2021 15:45:06 +0800
Lai Jiangshan <jiangshanlai@...il.com> wrote:

> From: Lai Jiangshan <laijs@...ux.alibaba.com>
> 
> The commit 929bacec21478("x86/entry/64: De-Xen-ify our NMI code") simplified
> the NMI code by changing paravirt code into native code and left a comment
> about "inspecting RIP instead".  But until now, "inspecting RIP instead"
> has not been made happened and this patch tries to complete it.
> 
> Comments in the code was from Andy Lutomirski.  Thanks!
> 
> Signed-off-by: Lai Jiangshan <laijs@...ux.alibaba.com>
> ---
>  arch/x86/entry/entry_64.S | 44 ++++++++++-----------------------------
>  1 file changed, 11 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index cad08703c4ad..21f67ea62341 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1268,32 +1268,14 @@ SYM_CODE_START(asm_exc_nmi)
>  	je	nested_nmi
>  
>  	/*
> -	 * Now test if the previous stack was an NMI stack.  This covers
> -	 * the case where we interrupt an outer NMI after it clears
> -	 * "NMI executing" but before IRET.  We need to be careful, though:
> -	 * there is one case in which RSP could point to the NMI stack
> -	 * despite there being no NMI active: naughty userspace controls
> -	 * RSP at the very beginning of the SYSCALL targets.  We can
> -	 * pull a fast one on naughty userspace, though: we program
> -	 * SYSCALL to mask DF, so userspace cannot cause DF to be set
> -	 * if it controls the kernel's RSP.  We set DF before we clear
> -	 * "NMI executing".
> +	 * Now test if we interrupted an outer NMI that just cleared "NMI
> +	 * executing" and is about to IRET.  This is a single-instruction
> +	 * window.  This check does not handle the case in which we get a
> +	 * nested interrupt (#MC, #VE, #VC, etc.) after clearing
> +	 * "NMI executing" but before the outer NMI executes IRET.
>  	 */
> -	lea	6*8(%rsp), %rdx
> -	/* Compare the NMI stack (rdx) with the stack we came from (4*8(%rsp)) */
> -	cmpq	%rdx, 4*8(%rsp)
> -	/* If the stack pointer is above the NMI stack, this is a normal NMI */
> -	ja	first_nmi
> -
> -	subq	$EXCEPTION_STKSZ, %rdx
> -	cmpq	%rdx, 4*8(%rsp)
> -	/* If it is below the NMI stack, it is a normal NMI */
> -	jb	first_nmi
> -
> -	/* Ah, it is within the NMI stack. */
> -
> -	testb	$(X86_EFLAGS_DF >> 8), (3*8 + 1)(%rsp)
> -	jz	first_nmi	/* RSP was user controlled. */

So we no longer check to see if the current stack is on the NMI stack.
Makes sense, since this beginning of the NMI code can not be interrupted,
as there's no breakpoints or faults that can occur when that happens. The
$nmi_executing is set in all other locations except for:

  repeat_nmi - end_repeat_nmi
  and for the iretq itself (.Lnmi_iret).

Thus, by just checking the nmi_executing variable on the stack along with
the RIP compared to repeat_nim-end_repeat_nmi + .Lnmi_iret, we can safely
tell if the NMI is nested or not.

I was working on rewriting the beginning comments to reflect these updates,
and discovered a possible bug that exists (unrelated to this patch).

> +	cmpq	$.Lnmi_iret, 8(%rsp)
> +	jne	first_nmi
>  

On triggering an NMI from user space, I see the switch to the thread stack
is done, and "exc_nmi" is called.

The problem I see with this is that exc_nmi is called with the thread
stack, if it were to take an exception, NMIs would be enabled allowing for
a nested NMI to run. From what I can tell, I don't see anything stopping
that NMI from executing over the currently running NMI. That is, this means
that NMI handlers are now re-entrant.

Yes, the stack issue is not a problem here, but NMI handlers are not
allowed to be re-entrant. For example, we have spin locks in NMI handlers
that are considered fine if they are only used in NMI handlers. But because
there's a possible way to make NMI handlers re-entrant then these spin
locks can deadlock.

I'm guessing that we need to add some tricks to the user space path to
set and clear the "NMI executing" variable, but the return may become a bit
complex in clearing that without races.

-- Steve