lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1394221143-29713-1-git-send-email-dzickus@redhat.com>
Date:	Fri,  7 Mar 2014 14:39:03 -0500
From:	Don Zickus <dzickus@...hat.com>
To:	hpa@...or.com
Cc:	LKML <linux-kernel@...r.kernel.org>, x86@...nel.org,
	vgoyal@...hat.com, ebiederm@...ssion.com,
	Don Zickus <dzickus@...hat.com>
Subject: [PATCH] x86: Skip latched NMIs on early boot in kdump

A customer generated an external NMI using their iLO to test kdump worked.
Unfortunately, the machine hung.  Disabling the nmi_watchdog made things work.

I speculated the external NMI fired, caused the machine to panic (as expected)
and the perf NMI from the watchdog came in and was latched.  My guess was this
somehow caused the hang.

Debugging this with outb's and debug_putstr, I learned the following

- the machine hung during the first memcpy in copy_bootdata (in
arch/x86/kernel/head64.c)
- early_make_pgtable was called during this memcpy
- after early_make_pgtable, an exception vector 2 (NMI) came in
- the IP of this vector was in copy_bootdata's range
- because there was no fixup associated with this IP, the machine
  is sitting in a 'hlt' instruction (in arch/x86/kernel/head_64.S)

(copy and paste from arch/x86/kernel/head_64.S)
/* This is global to keep gas from relaxing the jumps */
ENTRY(early_idt_handler)

<snip>

        cmpl $14,72(%rsp)       # Page fault?
        jnz 10f
        GET_CR2_INTO(%rdi)      # can clobber any volatile register if pv
        call early_make_pgtable
        andl %eax,%eax
        jz 20f                  # All good

10:
        leaq 88(%rsp),%rdi      # Pointer to %rip
        call early_fixup_exception
        andl %eax,%eax
        jnz 20f                 # Found an exception entry

11:
	<snip>
1:      hlt
	^^^^^^^^^^^^ sitting here

        jmp 1b

I added the below hack, which says if the exception is an NMI just return and
things seem to work.

Now, I don't expect this to be the correct solution.  Nor do I fully understand
what this early boot code is doing, so hopefully folks wiser than me can
provide me a better patch to test. :-)

I also do not fully understand why the latched NMI is not happening immediately
after the load idt call or why it comes after a page fault (the
early_make_pgtable).  Further adding to my confusion is why the early printk
magic didn't dump a stack as I believe I had that setup on my commandline.
But I figured I would just report what I have observed.

My testing and debugging were based off a 3.10 kernel (RHEL-7) but has included
Seiji's tracepoint cleanups to arch/x86/kernel/head_64.S|head64.c.  Not much
has changed upstream here.  Also 3.14-rc4 still has the same hang.

Signed-off-by: Don Zickus <dzickus@...hat.com>
---
 arch/x86/kernel/head_64.S |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 77e6d3e..05306c8 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -368,6 +368,8 @@ ENTRY(early_idt_handler)
 	jz 20f			# All good
 
 10:
+	cmpl $2,72(%rsp)	# NMI?
+	jz 20f			# skip NMIs
 	leaq 88(%rsp),%rdi	# Pointer to %rip
 	call early_fixup_exception
 	andl %eax,%eax
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ