linux-kernel - Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <s5h619reiog.wl-tiwai@suse.de>
Date:	Mon, 23 Mar 2015 18:18:39 +0100
From:	Takashi Iwai <tiwai@...e.de>
To:	Denys Vlasenko <dvlasenk@...hat.com>
Cc:	Andy Lutomirski <luto@...capital.net>,
	Denys Vlasenko <vda.linux@...glemail.com>,
	Jiri Kosina <jkosina@...e.cz>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Stefan Seyfried <stefan.seyfried@...glemail.com>,
	X86 ML <x86@...nel.org>, LKML <linux-kernel@...r.kernel.org>,
	Tejun Heo <tj@...nel.org>
Subject: Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

At Mon, 23 Mar 2015 17:07:15 +0100,
Denys Vlasenko wrote:
> 
> On 03/23/2015 02:22 PM, Takashi Iwai wrote:
> > At Mon, 23 Mar 2015 10:35:41 +0100,
> > Takashi Iwai wrote:
> >>
> >> At Mon, 23 Mar 2015 10:02:52 +0100,
> >> Takashi Iwai wrote:
> >>>
> >>> At Fri, 20 Mar 2015 19:16:53 +0100,
> >>> Denys Vlasenko wrote:
> >>>> Takashi, are you willing to reproduce the panic one more time,
> >>>> with this patch? I would like to see whether oops messages
> >>>> are more informative with it.
> >>>
> >>> It can't be applied to 4.0-rc5, unfortunately.
> >>>
> >>> arch/x86/kernel/entry_64.S: Assembler messages:
> >>> arch/x86/kernel/entry_64.S:1725: Error: no such instruction: `alloc_pt_gpregs_on_stack'
> >>> arch/x86/kernel/entry_64.S:1716: Error: invalid operands (*UND* and *UND* sections) for `+'
> >>> scripts/Makefile.build:294: recipe for target 'arch/x86/kernel/entry_64.o' failed
> >>
> >> I pulled tip tree on top of 4.0-rc5, built with your patch and now
> >> succeeded to get a better message:
> >>
> >>  kvm: zapping shadow pages for mmio generation wraparound
> >>  kvm [5126]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff
> >>  Exception on user stack 00007ffd22c23ef0: RSP: 0018:00007ffd22c23f28  EFLAGS: 00010006
> >>  RIP: 0010:[<ffffffff8162681d>]  [<ffffffff8162681d>] netlink_attachskb+0x1d/0x1d0
> >>  PANIC: double fault, error_code: 0x0
> >>  CPU: 1 PID: 10819 Comm: cc1 Tainted: G        W       4.0.0-rc5-debug1+ #2
> >>  Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
> >>  task: ffff8800d1b34b10 ti: ffff8800d1b30000 task.ti: ffff8800d1b30000
> >>  RIP: 0010:[<ffffffff8162681d>]  [<ffffffff8162681d>] netlink_attachskb+0x1d/0x1d0
> >>  RSP: 0018:00007ffd22c23f28  EFLAGS: 00010006
> >>  RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00000000c0000101
> >>  RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007ffd22c23ef0
> >>  RBP: 0000000000000ea7 R08: 0000000000001ea7 R09: ffffffffffffffff
> >>  R10: 000000000309dbf8 R11: 0000000000000246 R12: 0000000000000001
> >>  R13: 0000000000000000 R14: 0000000003026e40 R15: 000000000309cd50
> >>  FS:  00007f89c83c2800(0000) GS:ffff88021d240000(0000) knlGS:0000000000000000
> >>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>  CR2: 000000000000016d CR3: 00000000d90a0000 CR4: 00000000001427e0
> >>  Stack:
> >>   0000000000000ea7 0000000000000000 0000000003099c10 0000000000000ea7
> >>   0000000000000ea7 0000000000000001 0000000003099c10 0000000000000ea7
> >>   0000000000c84696 0000000003099c88 00007f0122c23fb8 000000000302f610
> >>  Call Trace:
> >>   <UNK> 
> >>  Code: 
> >>  10 75 ee f0 ff 42 6c 48 89 d0 5d c3 66 90 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 49 89 d5 41 54 49 89 f4 53 48 89 fb 48 83 ec 30 <8b> 87 68 01 00 00 39 87 9c 01 00 00 7c 25 48 8b 87 88 04 00 00 
> >>  Kernel panic - not syncing: Machine halted.
> >>  CPU: 1 PID: 10819 Comm: cc1 Tainted: G        W       4.0.0-rc5-debug1+ #2
> >>  Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
> >>   0000000000000000 ffff8800d1b33e28 ffffffff816f80d2 0000000000000000
> >>   ffffffff81a22f81 ffff8800d1b33ea8 ffffffff816f2358 00000000000058d7
> >>   0000000000000008 ffff8800d1b33eb8 ffff8800d1b33e58 ffff8800d1b33ea8
> >>  Call Trace:
> >>   [<ffffffff816f80d2>] dump_stack+0x4c/0x6e
> >>   [<ffffffff816f2358>] panic+0xc0/0x1f3
> >>   [<ffffffff81046e65>] df_debug+0x35/0x40
> >>   [<ffffffff81003fe7>] do_double_fault+0x87/0x100
> >>   [<ffffffff81004167>] do_userpsace_rsp_in_kernel+0x107/0x140
> >>   [<ffffffff8162681d>] ? netlink_attachskb+0x1d/0x1d0
> >>   [<ffffffff81703ca6>] userpsace_rsp_in_kernel+0x36/0x40
> >>   [<ffffffff8162681d>] ? netlink_attachskb+0x1d/0x1d0
> >>
> >>
> >> So, it seems hitting in netlink_attachskb().
> >> I'd need to check whether this consistently hits there or just at
> >> random.
> > 
> > I managed to reproduce the bug two more times, and all three show the
> > very same stack trace like the above.  So, it's well reproducible.
> 
> FYI: the disassembly of netlink_attachskb (from "Code:" line) is:
> 
>    0:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>    5:   55                      push   %rbp
>    6:   48 89 e5                mov    %rsp,%rbp
>    9:   41 56                   push   %r14
>    b:   41 55                   push   %r13
>    d:   49 89 d5                mov    %rdx,%r13
>   10:   41 54                   push   %r12
>   12:   49 89 f4                mov    %rsi,%r12
>   15:   53                      push   %rbx
>   16:   48 89 fb                mov    %rdi,%rbx
>   19:   48 83 ec 30             sub    $0x30,%rsp
>   1d:   8b 87 68 01 00 00       mov    0x168(%rdi),%eax
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   23:   39 87 9c 01 00 00       cmp    %eax,0x19c(%rdi)
>   29:   7c 25                   jl     50 <_start+0x50>
>   2b:   48 8b 87 88 04 00 00    mov    0x488(%rdi),%rax
> 
> The ^^^^^ instruction is the one which faults. Since you said it
> consistently happens here, this should be a page fault, not an external
> hardware interrupt.
> 
> The code corresponds to the comparison in if():
> 
> int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
>                       long *timeo, struct sock *ssk)
> {
>         struct netlink_sock *nlk;
> 
>         nlk = nlk_sk(sk);
> 
>         if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
> 
> %rdi (which is 1st param, "struct sock *sk") is 00007ffd22c23ef0
> (userspace address), but it's just because my patch clobbers %rdi,   :(
> we don't know which value it had at that moment.
> 
> > I'm really puzzled now.  We have a few pieces of information:
> > 
> > - git bisection pointed the commit 96b6352c1271:
> >     x86_64, entry: Remove the syscall exit audit and schedule optimizations
> >   and reverting this "fixes" the problem indeed.  Even just moving two
> >   lines
> >     LOCKDEP_SYS_EXIT
> >     DISABLE_INTERRUPTS(CLBR_NONE) 
> >   at the beginning of ret_from_sys_call already fixes.  (Of course I
> >   can't prove the fix but it stabilizes for a day without crash while
> >   usually I hit the bug in 10 minutes in full test running.)
> 
> The commit 96b6352c1271 moved TIF_ALLWORK_MASK check from
> interrupt-disabled region to interrupt-enabled:
> 
>         cmpq $__NR_syscall_max,%rax
>         ja ret_from_sys_call
>         movq %r10,%rcx
>         call *sys_call_table(,%rax,8)  # XXX:    rip relative
>         movq %rax,RAX-ARGOFFSET(%rsp)
> ret_from_sys_call:
> 	testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 	jnz int_ret_from_sys_call_fixup	/* Go the the slow path */
>  	LOCKDEP_SYS_EXIT
>  	DISABLE_INTERRUPTS(CLBR_NONE)
>  	TRACE_IRQS_OFF
> ...
> ...
> int_ret_from_sys_call_fixup:
>         FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
>         jmp int_ret_from_sys_call
> ...
> ...
> GLOBAL(int_ret_from_sys_call)
>         DISABLE_INTERRUPTS(CLBR_NONE)
>         TRACE_IRQS_OFF
> 
> You reverted that by moving this insn to be after first DISABLE_INTERRUPTS(CLBR_NONE).

Oh yes.  I forgot to mention that I tested also only moving
DISABLE_INTERRUPTS(CLBR_NONE) at the beginning.  But this didn't
help by some reason.

And, I tested also without all kernel debug options (both
CONFIG_DEBUG_LOCK_ALLOC and CONFIG_TRACE_IRQFLAGS are off), but this
kernel also showed the same crash.

> I also don't see how moving that check (even if it is wrong in a more
> benign way) can have such a drastic effect.
> 
> 
> Shot-in-the-dark idea. At this code revision we did not yet
> store user's %rsp in pt_regs->sp, we used a fixup to populate it:
> 
>         .macro FIXUP_TOP_OF_STACK tmp offset=0
>         movq PER_CPU_VAR(old_rsp),\tmp
>         movq \tmp,RSP+\offset(%rsp)
> 
> (There are pending patches to fix this mess).
> 
> If an interrupt interrupting *kernel code* would go into a code path
> which does FIXUP_TOP_OF_STACK, it'd overwrite the correct saved %rsp
> with a user's one. The iret from interrupt would work,
> but the resulting CPU state would be inconsistent. But I don't see
> such a code path from interrupts to FIXUP_TOP_OF_STACK...
> 
> 
> > - Another piece is that the bug happens only when a KVM is running.
> >   The kernel ran without problem over days with similar tasks
> >   (compiling kernel, etc) when no KVM was used.
> 
> Conceivably virtualization support in CPUs can have nasty erratas.
> However, you and other reporter have different CPUs - yours
> is Ivy Bridge, his CPU is a Penryn.
> 
> I don't see the path how KVM helps to trigger this.
> 
> > - And now I get the trace as above, pointing netlink_attachskb().
> > 
> > I have a difficulty to imagine how all these pieces fit into a single
> > picture.  Is something already screwed up before that?
> 
> Well, a tiny bit more info will be seen if you'd change %rdi
> to, say, %r15 in these two lines in my patch:
> 
>        /* Save bogus RSP value */
>        movq    %rsp,%rdi
> ...
>        push    %rdi            /* pt_regs->sp */
> 
> Then original %rdi will be visible in the crash message.

OK, here we go.

 kvm: zapping shadow pages for mmio generation wraparound
 kvm [5490]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff
 Exception on user stack 00007fff1d7e5ec0: RSP: 0018:00007fff1d7e5ef8  EFLAGS: 00010002
 RIP: 0010:[<ffffffff8162681d>]  [<ffffffff8162681d>] netlink_attachskb+0x1d/0x1d0
 PANIC: double fault, error_code: 0x0
 CPU: 5 PID: 14285 Comm: fixdep Tainted: G        W       4.0.0-rc5-debug1+ #3
 Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
 task: ffff88020ba1c690 ti: ffff880206ba4000 task.ti: ffff880206ba4000
 RIP: 0010:[<ffffffff8162681d>]  [<ffffffff8162681d>] netlink_attachskb+0x1d/0x1d0
 RSP: 0018:00007fff1d7e5ef8  EFLAGS: 00010002
 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000c0000101
 RDX: 0000000000000000 RSI: 0000000000001ebb RDI: 0000000000000000
 RBP: 0000000000000022 R08: 0000000000000004 R09: 0000000000000000
 R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000001ebb
 R13: 00007fb642fcc6e4 R14: 00007fb642fcdc18 R15: 00007fff1d7e5ec0
 FS:  00007fb642fa9700(0000) GS:ffff88021d340000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000168 CR3: 00000000ce1b4000 CR4: 00000000001427e0
 Stack:
  0000000000000005 0000000000401582 00007fb642fcb180 0000000000000053
  0000000000400d8a 0000000000000000 0000000000000000 000000005d152a17
  0000000000400f8c 0000000000000000 0000000100000004 00007fb642fcb000
 Call Trace:
  <UNK> 
 Code: 
 10 75 ee f0 ff 42 6c 48 89 d0 5d c3 66 90 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 49 89 d5 41 54 49 89 f4 53 48 89 fb 48 83 ec 30 <8b> 87 68 01 00 00 39 87 9c 01 00 00 7c 25 48 8b 87 88 04 00 00 
 Kernel panic - not syncing: Machine halted.
 CPU: 5 PID: 14285 Comm: fixdep Tainted: G        W       4.0.0-rc5-debug1+ #3
 Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
  0000000000000000 ffff880206ba7e28 ffffffff816f80d2 0000000000000000
  ffffffff81a22f81 ffff880206ba7ea8 ffffffff816f2358 00000000000050da
  0000000000000008 ffff880206ba7eb8 ffff880206ba7e58 ffff880206ba7ea8
 Call Trace:
  [<ffffffff816f80d2>] dump_stack+0x4c/0x6e
  [<ffffffff816f2358>] panic+0xc0/0x1f3
  [<ffffffff81046e65>] df_debug+0x35/0x40
  [<ffffffff81003fe7>] do_double_fault+0x87/0x100
  [<ffffffff81004167>] do_userpsace_rsp_in_kernel+0x107/0x140
  [<ffffffff8162681d>] ? netlink_attachskb+0x1d/0x1d0
  [<ffffffff81703ca7>] userpsace_rsp_in_kernel+0x37/0x40
  [<ffffffff8162681d>] ? netlink_attachskb+0x1d/0x1d0


I have to leave my office now.  If you need any further tests, let me
know; I'll do it tomorrow.  In anyway I'll need to double-check
whether I tested properly.


thanks,

Takashi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/