lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 19 Mar 2015 08:41:57 -0700
From:	Andy Lutomirski <luto@...capital.net>
To:	Takashi Iwai <tiwai@...e.de>
Cc:	Denys Vlasenko <vda.linux@...glemail.com>,
	Denys Vlasenko <dvlasenk@...hat.com>,
	Jiri Kosina <jkosina@...e.cz>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Stefan Seyfried <stefan.seyfried@...glemail.com>,
	X86 ML <x86@...nel.org>, LKML <linux-kernel@...r.kernel.org>,
	Tejun Heo <tj@...nel.org>
Subject: Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

On Thu, Mar 19, 2015 at 8:22 AM, Takashi Iwai <tiwai@...e.de> wrote:
> At Thu, 19 Mar 2015 15:55:26 +0100,
> Takashi Iwai wrote:
>>
>> At Thu, 19 Mar 2015 14:47:12 +0100,
>> Takashi Iwai wrote:
>> >
>> > At Thu, 19 Mar 2015 13:48:56 +0100,
>> > Denys Vlasenko wrote:
>> > >
>> > > Having no more ideas at the moment, here is a tarball of 13 patches
>> > > of commits touching entry_64.S up to 4.0.0-rc1.
>> > >
>> > > x0001.patch is the latest, x0015.patch is the oldest.
>> > >
>> > > Patches 0003 and 0008 are not there since 0003 is empty merge patch
>> > > and 0008 does some PCI fixup.
>> > >
>> > > If this breakage is recent, it ought to be one of these.
>> > > Most of them do some non-trivial surgery.
>> > >
>> > > Even though I did not spot anything suspicious in them,
>> > > entry.S is notorious for subtle breakage.
>> > >
>> > > Try reverting them in sequence starting from x0001.patch
>> > > and see reverting which one makes crash disappear.
>> >
>> > OK, I'm going to check these git series.
>>
>> Reverting the commit
>> 96b6352c12711d5c0bb7157f49c92580248e8146
>>     x86_64, entry: Remove the syscall exit audit and schedule optimizations
>>
>> seems enough.  After reverting this one, the machine runs stable with
>> the kvm stress test.
>>
>> (I'll keep test running for a while; at the previous bisection, I hit
>>  the bug right after posting the mail ;)
>
> It survived long enough, so this looks like the spot.
>
> Also, I checked the patch below instead of reverting the commit, and
> this seems working, too.
>
>
> Takashi
>
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 1d74d161687c..5340ac7f88a9 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -364,12 +364,12 @@ system_call_fastpath:
>   * Has incomplete stack frame and undefined top of stack.
>   */
>  ret_from_sys_call:
> -       testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
> -       jnz int_ret_from_sys_call_fixup /* Go the the slow path */
> -
>         LOCKDEP_SYS_EXIT
>         DISABLE_INTERRUPTS(CLBR_NONE)
>         TRACE_IRQS_OFF
> +       testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
> +       jnz int_ret_from_sys_call_fixup /* Go the the slow path */
> +
>         CFI_REMEMBER_STATE
>         /*
>          * sysretq will re-enable interrupts:

The crash you're seeing could certainly be caused by an IRQ at the
wrong time.  However:

int_ret_from_sys_call_fixup:
        FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
        jmp int_ret_from_sys_call

and

GLOBAL(int_ret_from_sys_call)
        DISABLE_INTERRUPTS(CLBR_NONE)
        TRACE_IRQS_OFF

so with or without your little patch, we're turning off IRQs very
quickly.  retint_swapgs also turnes off interrupts before doing
anything.  So I don't see how your patch would have any effect.

I'm starting to wonder if the problem has something to do with running
fire_user_return_notifiers with IRQs on.  We appear to do that, and it
seems rather questionable to me that it's safe, given the sneaky
things that KVM does in there.

If we end up in user mode with a bad MSR_SYSCALL_MASK, we could see
your crash, although I don't see how that would happen either.

I'll try to write a diagnostic patch later this morning.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists