[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMzpN2hWhcYBT0noFSc24e_fH-sVwK+Da97Byvge4tsF3_vRTQ@mail.gmail.com>
Date: Tue, 22 May 2012 21:26:13 -0400
From: Brian Gerst <brgerst@...il.com>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Avi Kivity <avi@...hat.com>,
linux-kernel <linux-kernel@...r.kernel.org>,
Ingo Molnar <mingo@...e.hu>, "H. Peter Anvin" <hpa@...or.com>,
Thomas Gleixner <tglx@...utronix.de>,
Paul Turner <pjt@...gle.com>,
Peter Zijlstra <peterz@...radead.org>,
Frederic Weisbecker <fweisbec@...il.com>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Subject: Re: NMI vs #PF clash
On Tue, May 22, 2012 at 8:39 PM, Steven Rostedt <rostedt@...dmis.org> wrote:
> On Tue, 2012-05-22 at 08:33 -0700, Linus Torvalds wrote:
>> On Tue, May 22, 2012 at 7:27 AM, Steven Rostedt <rostedt@...dmis.org> wrote:
>> >
>> > Is reading it fast? Then we could do a two reads and only write when
>> > needed.
>>
>> Even better: we could do nothing at all.
>>
>> We could just say: let's make sure that any #PF case that can happen
>> in #NMI can also be re-done with arbitrary 'error_code' and 'struct
>> regs' contents.
>>
>> At that point, what could happen is
>> - #PF
>> - NMI
>> - #PF
>> - read cr2 for NMI fault
>> - handle the NMI #PF
>> - return from #PF
>> - return from #NMI
>> - read cr2 for original #PF fault - but get the NMI cr2 again
>> - hande the #PF again (this should be a no-op now)
>
> I tested this by forcing all NMIs to have a page fault, and then running
> perf against hackbench to see what breaks. This doesn't work because the
> cr2 contains where the fault happened.
>
> If the NMI faulting address was in kernel space, the page fault it
> interrupted will think the user task is trying to access kernel memory
> and fault.
>
> Basically, the address will not be valid for the user task and this will
> make the task take a SEGFAULT.
>
> I had the NMI fault on the address 0x12348, but had a exception table to
> handle it.
>
> hackbench[2236]: segfault at 12348 ip 0000003054a091c4 sp 00007fff22c6e840 error 4 in ld-2.12.2.so[3054a00000+1e000]
>
> With the below patch, the faulting NMIs don't seem to be harming anything.
>
>
>> - return from #PF
>> - instruction restart causes new #PF
>> - now we do the original page fault
>>
>> So one option is to just make sure that the few cases (just the
>> vmalloc area?) that NMI can trigger are all ok to be re-done with
>> other state.
>>
>> I note that right now we have
>>
>> if (unlikely(fault_in_kernel_space(address))) {
>> if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) {
>> if (vmalloc_fault(address) >= 0)
>> return;
>>
>> and that the error_code check means that the retried NMI #PF would not
>> go through that. But maybe we don't even need that check?
>>
>> That error_code thing seems to literally be the only thing that keeps
>> us from just re-doing the vmalloc_fault() silently.
>
> Avi, you want to test this?
>
> -- Steve
>
> Index: linux-trace.git/arch/x86/kernel/entry_64.S
> ===================================================================
> --- linux-trace.git.orig/arch/x86/kernel/entry_64.S
> +++ linux-trace.git/arch/x86/kernel/entry_64.S
> @@ -1586,7 +1586,7 @@ ENTRY(nmi)
> * Check the special variable on the stack to see if NMIs are
> * executing.
> */
> - cmpl $1, -8(%rsp)
> + cmpl $1, -16(%rsp)
> je nested_nmi
>
> /*
> @@ -1615,7 +1615,7 @@ nested_nmi:
>
> 1:
> /* Set up the interrupted NMIs stack to jump to repeat_nmi */
> - leaq -6*8(%rsp), %rdx
> + leaq -7*8(%rsp), %rdx
> movq %rdx, %rsp
> CFI_ADJUST_CFA_OFFSET 6*8
> pushq_cfi $__KERNEL_DS
> @@ -1625,7 +1625,7 @@ nested_nmi:
> pushq_cfi $repeat_nmi
>
> /* Put stack back */
> - addq $(11*8), %rsp
> + addq $(12*8), %rsp
> CFI_ADJUST_CFA_OFFSET -11*8
>
> nested_nmi_out:
> @@ -1650,6 +1650,8 @@ first_nmi:
> * +-------------------------+
> * | temp storage for rdx |
> * +-------------------------+
> + * | saved cr2 |
> + * +-------------------------+
> * | NMI executing variable |
> * +-------------------------+
> * | Saved SS |
> @@ -1672,8 +1674,20 @@ first_nmi:
> * to the repeat_nmi. The original stack frame and the temp storage
> * is also used by nested NMIs and can not be trusted on exit.
> */
> +
> + /*
> + * Save off the CR2 register. If we take a page fault in the NMI then
> + * it could corrupt the CR2 value. If the NMI preempts a page fault
> + * handler before it was able to read the CR2 register, and then the
> + * NMI itself takes a page fault, the page fault that was preempted
> + * will read the information from the NMI page fault and not the
> + * origin fault. Save it off and restore it if it changes.
> + */
> + movq %cr2, %rdx
> + pushq_cfi %rdx
> +
> /* Do not pop rdx, nested NMIs will corrupt that part of the stack */
> - movq (%rsp), %rdx
> + movq 8(%rsp), %rdx
> CFI_RESTORE rdx
>
> /* Leave room for the "in NMI" variable. */
> @@ -1682,7 +1696,7 @@ first_nmi:
>
> /* Copy the stack frame to the Saved frame */
> .rept 5
> - pushq_cfi 6*8(%rsp)
> + pushq_cfi 7*8(%rsp)
> .endr
> CFI_DEF_CFA_OFFSET SS+8-RIP
>
> @@ -1734,6 +1748,13 @@ end_repeat_nmi:
> nmi_swapgs:
> SWAPGS_UNSAFE_STACK
> nmi_restore:
> + /* Test if the cr2 reg changed */
> + movq ORIG_RAX-R15+(12*8)(%rsp), %rdx
> + movq %cr2, %rcx
> + cmp %rdx, %rcx
> + je 1f
> + movq %rdx, %cr2
> +1:
> RESTORE_ALL 8
> /* Clear the NMI executing stack variable */
> movq $0, 10*8(%rsp)
You could save cr2 in a callee-saved register (like r12) instead of
putting it on the stack.
--
Brian Gerst
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists