lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 15 Jun 2009 20:18:07 +0200
From:	Ingo Molnar <mingo@...e.hu>
To:	Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
Cc:	mingo@...hat.com, hpa@...or.com, paulus@...ba.org, acme@...hat.com,
	linux-kernel@...r.kernel.org, a.p.zijlstra@...llo.nl,
	penberg@...helsinki.fi, torvalds@...ux-foundation.org,
	vegard.nossum@...il.com, efault@....de, jeremy@...p.org,
	npiggin@...e.de, tglx@...utronix.de,
	linux-tip-commits@...r.kernel.org
Subject: Re: [tip:perfcounters/core] perf_counter: x86: Fix call-chain
	support to use NMI-safe methods


* Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca> wrote:

> * Ingo Molnar (mingo@...e.hu) wrote:
> > 
> > * Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca> wrote:
> > 
> > > * tip-bot for Peter Zijlstra (a.p.zijlstra@...llo.nl) wrote:
> > > > Commit-ID:  74193ef0ecab92535c8517f082f1f50504526c9b
> > > > Gitweb:     http://git.kernel.org/tip/74193ef0ecab92535c8517f082f1f50504526c9b
> > > > Author:     Peter Zijlstra <a.p.zijlstra@...llo.nl>
> > > > AuthorDate: Mon, 15 Jun 2009 13:07:24 +0200
> > > > Committer:  Ingo Molnar <mingo@...e.hu>
> > > > CommitDate: Mon, 15 Jun 2009 15:57:53 +0200
> > > > 
> > > > perf_counter: x86: Fix call-chain support to use NMI-safe methods
> > > > 
> > > > __copy_from_user_inatomic() isn't NMI safe in that it can trigger
> > > > the page fault handler which is another trap and its return path
> > > > invokes IRET which will also close the NMI context.
> > > > 
> > > > Therefore use a GUP based approach to copy the stack frames over.
> > > > 
> > > > We tried an alternative solution as well: we used a forward ported
> > > > version of Mathieu Desnoyers's "NMI safe INT3 and Page Fault" patch
> > > > that modifies the exception return path to use an open-coded IRET with
> > > > explicit stack unrolling and TF checking.
> > > > 
> > > > This didnt work as it interacted with faulting user-space instructions,
> > > > causing them not to restart properly, which corrupts user-space
> > > > registers.
> > > > 
> > > > Solving that would probably involve disassembling those instructions
> > > > and backtracing the RIP. But even without that, the code was deemed
> > > > rather complex to the already non-trivial x86 entry assembly code,
> > > > so instead we went for this GUP based method that does a
> > > > software-walk of the pagetables.
> > > > 
> > > 
> > > Hrm, I'm probably missing something. Normally, you should test for 
> > > "in_nmi()" upon return from exception, and only in these cases go 
> > > for the open-coded IRET with stack unrolling and ret. I really 
> > > don't see how you end up messing up the page fault return to 
> > > userspace path, as it's impossible to have in_nmi() set.
> > 
> > here's the (heavily modified) version of your patch that i used.
> > 
> > 	Ingo
> > 
> > -------------------->
> > Subject: x86 NMI-safe INT3 and Page Fault
> > From: Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
> > Date: Mon, 12 May 2008 21:21:07 +0200
> > 
> > Implements an alternative iret with popf and return so trap and exception
> > handlers can return to the NMI handler without issuing iret. iret would cause
> > NMIs to be reenabled prematurely. x86_32 uses popf and far return. x86_64 has to
> > copy the return instruction pointer to the top of the previous stack, issue a
> > popf, loads the previous esp and issue a near return (ret).
> > 
> > It allows placing immediate values (and therefore optimized trace_marks) in NMI
> > code since returning from a breakpoint would be valid. Accessing vmalloc'd
> > memory, which allows executing module code or accessing vmapped or vmalloc'd
> > areas from NMI context, would also be valid. This is very useful to tracers like
> > LTTng.
> > 
> > This patch makes all faults, traps and exception safe to be called from NMI
> > context *except* single-stepping, which requires iret to restore the TF (trap
> > flag) and jump to the return address in a single instruction. Sorry, no kprobes
> > support in NMI handlers because of this limitation.  We cannot single-step an
> > NMI handler, because iret must set the TF flag and return back to the
> > instruction to single-step in a single instruction. This cannot be emulated with
> > popf/lret, because lret would be single-stepped. It does not apply to immediate
> > values because they do not use single-stepping. This code detects if the TF
> > flag is set and uses the iret path for single-stepping, even if it reactivates
> > NMIs prematurely.
> > 
> > Test to detect if nested under a NMI handler is only done upon the return from
> > trap/exception to kernel, which is not frequent. Other return paths (return from
> > trap/exception to userspace, return from interrupt) keep the exact same behavior
> > (no slowdown).
> > 
> > Depends on :
> > change-alpha-active-count-bit.patch
> > change-avr32-active-count-bit.patch
> > 
> > TODO : test with lguest, xen, kvm.
> > 
> > ** This patch depends on the "Stringify support commas" patchset **
> > ** Also depends on fix-x86_64-page-fault-scheduler-race patch **
> > 
> > tested on x86_32 (tests implemented in a separate patch) :
> > - instrumented the return path to export the EIP, CS and EFLAGS values when
> >   taken so we know the return path code has been executed.
> > - trace_mark, using immediate values, with 10ms delay with the breakpoint
> >   activated. Runs well through the return path.
> > - tested vmalloc faults in NMI handler by placing a non-optimized marker in the
> >   NMI handler (so no breakpoint is executed) and connecting a probe which
> >   touches every pages of a 20MB vmalloc'd buffer. It executes trough the return
> >   path without problem.
> > - Tested with and without preemption
> > 
> > tested on x86_64
> > - instrumented the return path to export the EIP, CS and EFLAGS values when
> >   taken so we know the return path code has been executed.
> > - trace_mark, using immediate values, with 10ms delay with the breakpoint
> >   activated. Runs well through the return path.
> > 
> > To test on x86_64 :
> > - Test without preemption
> > - Test vmalloc faults
> > - Test on Intel 64 bits CPUs. (AMD64 was fine)
> > 
> > Changelog since v1 :
> > - x86_64 fixes.
> > Changelog since v2 :
> > - fix paravirt build
> > Changelog since v3 :
> > - Include modifications suggested by Jeremy
> > Changelog since v4 :
> > - including hardirq.h in entry_32/64.S is a bad idea (non ifndef'd C code),
> >   define HARDNMI_MASK in the .S files directly.
> > Changelog since v5 :
> > - Add HARDNMI_MASK to irq_count() and make die() more verbose for NMIs.
> > Changelog since v7 :
> > - Implement paravirtualized nmi_return.
> > Changelog since v8 :
> > - refreshed the patch for asm-offsets. Those were left out of v8.
> > - now depends on "Stringify support commas" patch.
> > Changelog since v9 :
> > - Only test the nmi nested preempt count flag upon return from exceptions, not
> >   on return from interrupts. Only the kernel return path has this test.
> > - Add Xen, VMI, lguest support. Use their iret pavavirt ops in lieu of
> >   nmi_return.
> > 
> > -- Ported to sched-devel.git
> > 
> > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
> > CC: akpm@...l.org
> > CC: mingo@...e.hu
> > CC: "H. Peter Anvin" <hpa@...or.com>
> > CC: Jeremy Fitzhardinge <jeremy@...p.org>
> > CC: Steven Rostedt <rostedt@...dmis.org>
> > CC: "Frank Ch. Eigler" <fche@...hat.com>
> > Signed-off-by: Ingo Molnar <mingo@...e.hu>
> > Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
> > ---
> >  arch/x86/include/asm/irqflags.h |   56 +++++++++++++++++++++++++++++++++++++++
> >  arch/x86/kernel/dumpstack.c     |    2 +
> >  arch/x86/kernel/entry_32.S      |   30 +++++++++++++++++++++
> >  arch/x86/kernel/entry_64.S      |   57 +++++++++++++++++++++++++++++++---------
> >  include/linux/hardirq.h         |   16 +++++++----
> >  5 files changed, 144 insertions(+), 17 deletions(-)
> > 
> > Index: linux/arch/x86/include/asm/irqflags.h
> > ===================================================================
> > --- linux.orig/arch/x86/include/asm/irqflags.h
> > +++ linux/arch/x86/include/asm/irqflags.h
> > @@ -51,6 +51,61 @@ static inline void native_halt(void)
> >  
> >  #endif
> >  
> > +#ifdef CONFIG_X86_64
> > +/*
> > + * Only returns from a trap or exception to a NMI context (intra-privilege
> > + * level near return) to the same SS and CS segments. Should be used
> > + * upon trap or exception return when nested over a NMI context so no iret is
> > + * issued. It takes care of modifying the eflags, rsp and returning to the
> > + * previous function.
> > + *
> > + * The stack, at that point, looks like :
> > + *
> > + * 0(rsp)  RIP
> > + * 8(rsp)  CS
> > + * 16(rsp) EFLAGS
> > + * 24(rsp) RSP
> > + * 32(rsp) SS
> > + *
> > + * Upon execution :
> > + * Copy EIP to the top of the return stack
> > + * Update top of return stack address
> > + * Pop eflags into the eflags register
> > + * Make the return stack current
> > + * Near return (popping the return address from the return stack)
> > + */
> > +#define NATIVE_INTERRUPT_RETURN_NMI_SAFE	pushq %rax;		\
> > +						movq %rsp, %rax;	\
> > +						movq 24+8(%rax), %rsp;	\
> > +						pushq 0+8(%rax);	\
> > +						pushq 16+8(%rax);	\
> > +						movq (%rax), %rax;	\
> > +						popfq;			\
> > +						ret
> > +#else
> > +/*
> > + * Protected mode only, no V8086. Implies that protected mode must
> > + * be entered before NMIs or MCEs are enabled. Only returns from a trap or
> > + * exception to a NMI context (intra-privilege level far return). Should be used
> > + * upon trap or exception return when nested over a NMI context so no iret is
> > + * issued.
> > + *
> > + * The stack, at that point, looks like :
> > + *
> > + * 0(esp) EIP
> > + * 4(esp) CS
> > + * 8(esp) EFLAGS
> > + *
> > + * Upon execution :
> > + * Copy the stack eflags to top of stack
> > + * Pop eflags into the eflags register
> > + * Far return: pop EIP and CS into their register, and additionally pop EFLAGS.
> > + */
> > +#define NATIVE_INTERRUPT_RETURN_NMI_SAFE	pushl 8(%esp);	\
> > +						popfl;		\
> > +						lret $4
> > +#endif
> > +
> >  #ifdef CONFIG_PARAVIRT
> >  #include <asm/paravirt.h>
> >  #else
> > @@ -109,6 +164,7 @@ static inline unsigned long __raw_local_
> >  
> >  #define ENABLE_INTERRUPTS(x)	sti
> >  #define DISABLE_INTERRUPTS(x)	cli
> > +#define INTERRUPT_RETURN_NMI_SAFE	NATIVE_INTERRUPT_RETURN_NMI_SAFE
> >  
> >  #ifdef CONFIG_X86_64
> >  #define SWAPGS	swapgs
> > Index: linux/arch/x86/kernel/dumpstack.c
> > ===================================================================
> > --- linux.orig/arch/x86/kernel/dumpstack.c
> > +++ linux/arch/x86/kernel/dumpstack.c
> > @@ -237,6 +237,8 @@ void __kprobes oops_end(unsigned long fl
> >  
> >  	if (!signr)
> >  		return;
> > +	if (in_nmi())
> > +		panic("Fatal exception in non-maskable interrupt");
> >  	if (in_interrupt())
> >  		panic("Fatal exception in interrupt");
> >  	if (panic_on_oops)
> > Index: linux/arch/x86/kernel/entry_32.S
> > ===================================================================
> > --- linux.orig/arch/x86/kernel/entry_32.S
> > +++ linux/arch/x86/kernel/entry_32.S
> > @@ -80,6 +80,8 @@
> >  
> >  #define nr_syscalls ((syscall_table_size)/4)
> >  
> > +#define HARDNMI_MASK 0x40000000
> > +
> 
> This is called "NMI_MASK" in 2.6.30. Did you test the x86_64 or 
> x86_32 portion of this patch ? 64-bits seems ok, but not 32.

i only tested the 64-bit side, and fixed up NMI_MASK only on that 
side (as you can see it from the patch). This was a partial port of 
your patch.

> It's all I can spot for now, but if you have popf/ret firing to 
> return to userspace instructions, there is clearly something fishy 
> there.

I think Linus's observation about cr2 corruption explains all the 
symptoms i saw.

And it all stabilized and started behaving under load once we 
switched to Peter's fast-GUP based soft-pte-lookup method.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ