linux-kernel - Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALCETrXZvSiT41+AYAPizSsGZ_=O=7wmb+Lwo_ChEZySxUnH-A@mail.gmail.com>
Date:	Wed, 18 Mar 2015 14:55:56 -0700
From:	Andy Lutomirski <luto@...capital.net>
To:	Denys Vlasenko <dvlasenk@...hat.com>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Stefan Seyfried <stefan.seyfried@...glemail.com>,
	Takashi Iwai <tiwai@...e.de>, X86 ML <x86@...nel.org>,
	LKML <linux-kernel@...r.kernel.org>, Tejun Heo <tj@...nel.org>
Subject: Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

On Wed, Mar 18, 2015 at 2:42 PM, Denys Vlasenko <dvlasenk@...hat.com> wrote:
> On 03/18/2015 10:32 PM, Linus Torvalds wrote:
>> On Wed, Mar 18, 2015 at 12:26 PM, Andy Lutomirski <luto@...capital.net> wrote:
>>>>
>>>> crash> disassemble page_fault
>>>> Dump of assembler code for function page_fault:
>>>>    0xffffffff816834a0 <+0>:     data32 xchg %ax,%ax
>>>>    0xffffffff816834a3 <+3>:     data32 xchg %ax,%ax
>>>>    0xffffffff816834a6 <+6>:     data32 xchg %ax,%ax
>>>>    0xffffffff816834a9 <+9>:     sub    $0x78,%rsp
>>>>    0xffffffff816834ad <+13>:    callq  0xffffffff81683620 <error_entry>
>>>
>>> The callq was the double-faulting instruction, and it is indeed the
>>> first function in here that would have accessed the stack.  (The sub
>>> *changes* rsp but isn't a memory access.)  So, since RSP is bogus, we
>>> page fault, and the page fault is promoted to a double fault.  The
>>> surprising thing is that the page fault itself seems to have been
>>> delivered okay, and RSP wasn't on a page boundary.
>>
>> Not at all surprising, and sure it was on a page boundry..
>>
>> Look closer.
>>
>> %rsp is 00007fffa55eafb8.
>>
>> But that's *after* page_fault has done that
>>
>>     sub    $0x78,%rsp
>>
>> so %rsp when the page fault happened was 0x7fffa55eb030. Which is a
>> different page.

Ah, I forgot to add 0x78.  You're right, of course.

>>
>> And that page happened to be mapped.
>>
>> So what happened is:
>>
>>  - we somehow entered kernel mode without switching stacks
>>
>>    (ie presumably syscall)
>>
>>  - the user stack was still fine
>>
>>  - we took a page fault, which once again didn't switch stacks,
>> because we were already in kernel mode. And this page fault worked,
>> because it just pushed the error code onto the user stack which was
>> mapped.
>>
>>  - we now took a second page fault within the page fault handler,
>> because now the stack pointer has been decremented and points one user
>> page down that is *not* mapped, so now that page fault cannot push the
>> error code and return information.
>>
>> Now, how we took that original page fault is sadly not very clear at
>> all.  I agree that it's something about system-call (how could we not
>> change stacks otherwise), but why it should have started now, I don't
>> know. I don't think "system_call" has changed at all.
>>
>> Maybe there is something wrong with the new "ret_from_sys_call" logic,
>> and that "use sysret to return to user mode" thing. Because this code
>> sequence:
>>
>> +       movq (RSP-RIP)(%rsp),%rsp
>> +       USERGS_SYSRET64
>>
>> in 'irq_return_via_sysret' is new to 4.0, and instead of entering the
>> kernel with a user stack poiinter, maybe we're *exiting* the kernel,
>> and have just reloaded the user stack pointer when "USERGS_SYSRET64"
>> takes some fault.
>
> Yes, so far we happily thought that SYSRET never fails...
>
> This merits adding some code which would at least BUG_ON
> if the faulting address is seen to match SYSRET64.

sysret64 can only fail with #GP, and we're totally screwed if that
happens, although I agree about the BUG_ON in principle.  Where would
we add it that would help in this case, though?  We never even made it
to C code.

In any event, this was a page fault.  sysret64 doesn't access memory.

>
> Now we only check for faulting IRETQ:
>
> error_kernelspace:
>         CFI_REL_OFFSET rcx, RCX+8
>         incl %ebx
>         leaq native_irq_return_iret(%rip),%rcx
>         cmpq %rcx,RIP+8(%rsp)
>         je error_bad_iret
>
>>
>> Is PARAVIRT enabled? The three nop's at the beginning of 'page_fault'
>> makes me suspect it is,  and that that is some paravirt rewriting
>> area. What does paravirt go for that USERGS_SYSRET64 (or for
>> SWAPGS_UNSAFE_STACK, for that matter).

On Xen, it goes to xen_sysret64, which touches the same percpu
variables that we touch on entry.  So I still like my percpu vmap
fault hypothesis, even though I don't understand what would trigger
it.

At the risk of asking awful questions, what happens if we deliver an
IST interrupt in vmx_handle_external_intr?  Can that happen?  It can't
be a good thing if it happens.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/