linux-kernel - Re: Dealing with the NMI mess

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150724110639.GG19282@twins.programming.kicks-ass.net>
Date:	Fri, 24 Jul 2015 13:06:39 +0200
From:	Peter Zijlstra <peterz@...radead.org>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Andy Lutomirski <luto@...capital.net>, X86 ML <x86@...nel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Willy Tarreau <w@....eu>, Borislav Petkov <bp@...en8.de>,
	Thomas Gleixner <tglx@...utronix.de>,
	Steven Rostedt <rostedt@...dmis.org>,
	Brian Gerst <brgerst@...il.com>
Subject: Re: Dealing with the NMI mess

On Thu, Jul 23, 2015 at 02:54:54PM -0700, Linus Torvalds wrote:
> On Thu, Jul 23, 2015 at 2:45 PM, Andy Lutomirski <luto@...capital.net> wrote:
> >
> > Or we just re-enable them on the way out of NMI (i.e. the very last
> > thing we do in the NMI handler).  I don't want to break regular
> > userspace gdb when perf is running.
> 
> I'd really prefer it if we don't touch NMI code in those kinds of
> ways. The NMI code is fragile as hell. All the problems we have with
> it is exactly due to "where is the boundary" issues.
> 
> That's why I *don't* want NMI code to do magic crap. Anything that
> says "disable this during this magic window" is broken. The problems
> we've had are exactly about atomicity of the entry/exit conditions,
> and there is no really good way to get them right.
> 
> I'd be much happier with a _TIF_USER_WORK_MASK approach exactly
> because it's so *obvious* that it's not a boundary condition.
> 
> I dislike the "disable and re-enable dr7 in the NMI handler" exactly
> because it smells like "we can only handle faults in _this_ region".
> It may be true, but it's also what I want us to get away from. I'd
> much rather have the "big picture" be that we can take faults anywhere
> at all (*), and that none of the core code really cares. Then we "fix
> up" user space.

A wee bit something like so?

We need the intermediate self-IPI because NMI/MCE etc do not deal with
TIF flags.

I further cleared all of DR7 in an attempt at reducing the amount of
state tracked. And it doesn't distinguish between kernel/user
watchpoints because the kernel can touch both from !IF.

---
 arch/x86/kernel/traps.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 8e65d8a9b8db..e8308e9c2b1e 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -570,6 +570,33 @@ struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s)
 NOKPROBE_SYMBOL(fixup_bad_iret);
 #endif
 
+struct do_debug_state {
+	unsigned long dr7;
+	struct irq_work irq_work;
+	struct callback_head task_work;
+};
+
+static void __debug_irq_trampoline(struct irq_work *work)
+{
+	struct do_debug_state *dds =
+		container_of(work, struct do_debug_state, irq_work);
+
+	task_work_add(current, &dds->task_work, true);
+}
+
+static void __debug_restore_dr7(struct callback_head *work)
+{
+	struct do_debug_state *dds =
+		container_of(work, struct do_debug_state, task_work);
+
+	set_debugreg(dds->dr7, 7);
+}
+
+static DEFINE_PER_CPU(struct do_debug_state, do_debug_state) = {
+	.irq_work = { .func = __debug_irq_trampoline, },
+	.task_work = { .func = __debug_restore_dr7, },
+};
+
 /*
  * Our handling of the processor debug registers is non-trivial.
  * We do not clear them on entry and exit from the kernel. Therefore
@@ -603,6 +630,16 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
 
 	ist_enter(regs);
 
+	if (arch_irqs_disabled_flags(regs->flags)) {
+		struct do_debug_state *dds = this_cpu_ptr(&do_debug_state);
+
+		get_debugreg(dds->dr7, 7);
+		set_debugreg(0, 7);
+		irq_work_queue(&dds->irq_work);
+
+		goto exit;
+	}
+
 	get_debugreg(dr6, 6);
 
 	/* Filter out all the reserved bits which are preset to 1 */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/