lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrWmz2JoWJptt-YxVszJX0J8m+OhQPqiXRJsE460tXbYNg@mail.gmail.com>
Date:	Mon, 23 Mar 2015 11:38:30 -0700
From:	Andy Lutomirski <luto@...capital.net>
To:	Denys Vlasenko <dvlasenk@...hat.com>
Cc:	Takashi Iwai <tiwai@...e.de>,
	Denys Vlasenko <vda.linux@...glemail.com>,
	Jiri Kosina <jkosina@...e.cz>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Stefan Seyfried <stefan.seyfried@...glemail.com>,
	X86 ML <x86@...nel.org>, LKML <linux-kernel@...r.kernel.org>,
	Tejun Heo <tj@...nel.org>
Subject: Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

On Mon, Mar 23, 2015 at 9:07 AM, Denys Vlasenko <dvlasenk@...hat.com> wrote:
> On 03/23/2015 02:22 PM, Takashi Iwai wrote:
>> At Mon, 23 Mar 2015 10:35:41 +0100,
>> Takashi Iwai wrote:
>>>
>>> At Mon, 23 Mar 2015 10:02:52 +0100,
>>> Takashi Iwai wrote:
>>>>
>>>> At Fri, 20 Mar 2015 19:16:53 +0100,
>>>> Denys Vlasenko wrote:

>> I'm really puzzled now.  We have a few pieces of information:
>>
>> - git bisection pointed the commit 96b6352c1271:
>>     x86_64, entry: Remove the syscall exit audit and schedule optimizations
>>   and reverting this "fixes" the problem indeed.  Even just moving two
>>   lines
>>     LOCKDEP_SYS_EXIT
>>     DISABLE_INTERRUPTS(CLBR_NONE)
>>   at the beginning of ret_from_sys_call already fixes.  (Of course I
>>   can't prove the fix but it stabilizes for a day without crash while
>>   usually I hit the bug in 10 minutes in full test running.)
>
> The commit 96b6352c1271 moved TIF_ALLWORK_MASK check from
> interrupt-disabled region to interrupt-enabled:
>
>         cmpq $__NR_syscall_max,%rax
>         ja ret_from_sys_call
>         movq %r10,%rcx
>         call *sys_call_table(,%rax,8)  # XXX:    rip relative
>         movq %rax,RAX-ARGOFFSET(%rsp)
> ret_from_sys_call:
>         testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>         jnz int_ret_from_sys_call_fixup /* Go the the slow path */
>         LOCKDEP_SYS_EXIT
>         DISABLE_INTERRUPTS(CLBR_NONE)
>         TRACE_IRQS_OFF
> ...
> ...
> int_ret_from_sys_call_fixup:
>         FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
>         jmp int_ret_from_sys_call
> ...
> ...
> GLOBAL(int_ret_from_sys_call)
>         DISABLE_INTERRUPTS(CLBR_NONE)
>         TRACE_IRQS_OFF
>
> You reverted that by moving this insn to be after first DISABLE_INTERRUPTS(CLBR_NONE).
>
> I also don't see how moving that check (even if it is wrong in a more
> benign way) can have such a drastic effect.

I bet I see it.  I have the advantage of having stared at KVM code and
cursed at it more recently than you, I suspect.  KVM does awful, awful
things to CPU state, and, as an optimization, it allows kernel code to
run with CPU state that would be totally invalid in user mode.  This
happens through a bunch of hooks, including this bit in __switch_to:

    /*
     * Now maybe reload the debug registers and handle I/O bitmaps
     */
    if (unlikely(task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT ||
             task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV))
        __switch_to_xtra(prev_p, next_p, tss);

IOW, we *change* tif during context switches.


The race looks like this:

    testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
    jnz int_ret_from_sys_call_fixup    /* Go the the slow path */

--- preempted here, switch to KVM guest ---

KVM guest enters and screws up, say, MSR_SYSCALL_MASK.  This wouldn't
happen to be a *32-bit* KVM guest, perhaps?

Now KVM schedules, calling __switch_to.  __switch_to sets
_TIF_USER_RETURN_NOTIFY.  We IRET back to the syscall exit code, turn
off interrupts, and do sysret.  We are now screwed.

I don't know why this manifests in this particular failure, but any
number of terrible things could happen now.

FWIW, this will affect things other than KVM.  For example, SIGKILL
sent while a process is sleeping in that two-instruction window won't
work.

Takashi, can you re-send your patch so we can review it for real in
light of this race?

>
>
> Shot-in-the-dark idea. At this code revision we did not yet
> store user's %rsp in pt_regs->sp, we used a fixup to populate it:
>
>         .macro FIXUP_TOP_OF_STACK tmp offset=0
>         movq PER_CPU_VAR(old_rsp),\tmp
>         movq \tmp,RSP+\offset(%rsp)
>
> (There are pending patches to fix this mess).
>
> If an interrupt interrupting *kernel code* would go into a code path
> which does FIXUP_TOP_OF_STACK, it'd overwrite the correct saved %rsp
> with a user's one. The iret from interrupt would work,
> but the resulting CPU state would be inconsistent. But I don't see
> such a code path from interrupts to FIXUP_TOP_OF_STACK...

I don't buy it.  Anything that does that is so completely broken that
I'd hope we'd have found it long ago.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ