lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <7A0A9B37-20FF-4B17-B4F5-D8B999269FC4@amacapital.net>
Date:   Fri, 29 Dec 2017 17:10:35 -0700
From:   Andy Lutomirski <luto@...capital.net>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Toralf Förster <toralf.foerster@....de>,
        Alexander Tsoy <alexander@...y.me>,
        stable <stable@...r.kernel.org>,
        Linux Kernel <linux-kernel@...r.kernel.org>,
        the arch/x86 maintainers <x86@...nel.org>,
        jpoimboe@...hat.com
Subject: Re: 4.14.9 doesn't boot (regression)



> On Dec 29, 2017, at 3:53 PM, Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> 
>> On Fri, Dec 29, 2017 at 2:30 PM, Toralf Förster <toralf.foerster@....de> wrote:
>> 
>> The bad news - the issue is not solved with the changed cflags.
>> The good news - I could compile eventually a working config for my desktop  (works fine with 4.14.10 with generic CPU) having a higher screen resolution during boot.
>> 
>> So I made a "make distclean", followed by a "sudo zcat /proc/config.gz > .config", changed the .config to use MCORE2 instead of GENERIC and defined the string "-local" to ensure that the modules directory is really unique.
>> Then I run "time make -j4 && sudo make modules_install && sudo cp arch/x86_64/boot/bzImage /boot/vmlinuz-0 && sudo grub-mkconfig -o /boot/grub/grub.cfg", booted and made 3 fotos which were uploaded to [1], look for IMG_*
> 
> Ok, so what does seem to be consistent for everybody is that
> double-fault in the NMI backtrace.
> 
> So the fact that the NMI always hits on a double-fault does make me
> suspect that it's a infinite stream of double-faults, and that is
> presumably also what causes the RCU timeout.
> 
> And as I pointed out elsewhere (damn two threads), I think that it
> would help to simply catch the *first* double-fault.
> 
> And I *think* that the only thing that can make a double-fault
> silently be re-tried is the CONFIG_X86_ESPFIX64 case, so if you can
> build a failing kernel with the CONFIG_X86_ESPFIX64 case disabled in
> arch/x86/kernel/traps.c do_double_fault(), that would be interesting.

Double faults use IST, so a double fault that double faults will effectively just start over rather than eventually running out of stack and triple faulting.

But check out the registers. We have RSP = ...28fd8 and CR2 = ...27f08. IOW the double fault stack is ...28000 - ...28fff and we're somehow getting a failed page fault a couple hundred bytes below the bottom of the IST stack.  IOW, I think we're just stuck in a neverending loop of stack overflows.

(Also, Josh, the oops code should have printed the contents of the struct pt_regs at the top of the DF stack.  Any idea why it didn't?)

Toralf, can you send the complete output of:

objdump -dr arch/x86/kernel/traps.o

From the build tree of a nonworking kernel?

Also, you wouldn't happen to be using Gentoo perchance?  I already have two reports of a Gentoo system miscompiling the vDSO due to Gentoo enabling -fstack-check and GCC generating stack check code that is highly suboptimal, actively incorrect, and doesn't even manage to check the stack in a particularly helpful way.

If this is indeed what's going on, I'm going to try to come up with a patch to outright fail the build on these buggy systems.  We could probably fudge the build options to avoid the problem, but Gentoo really just needs fix its toolchain.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ