lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 9 Mar 2015 08:57:05 -0400
From:	Brian Gerst <brgerst@...il.com>
To:	Denys Vlasenko <vda.linux@...glemail.com>
Cc:	Andy Lutomirski <luto@...capital.net>,
	Fengguang Wu <fengguang.wu@...el.com>, X86 ML <x86@...nel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Ingo Molnar <mingo@...nel.org>, LKP <lkp@...org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [x86/asm/entry] BUG: unable to handle kernel paging request

On Sun, Mar 8, 2015 at 3:13 PM, Denys Vlasenko <vda.linux@...glemail.com> wrote:
> On Sat, Mar 7, 2015 at 1:33 AM, Andy Lutomirski <luto@...capital.net> wrote:
>> On Fri, Mar 6, 2015 at 3:30 PM, Fengguang Wu <fengguang.wu@...el.com> wrote:
>>> Greetings,
>>>
>>> 0day kernel testing robot got the below dmesg and the first bad commit is
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/asm
>>>
>>> commit 75182b1632a89f12540baa1806a7c5c180db620c
>>> Author:     Andy Lutomirski <luto@...capital.net>
>>> AuthorDate: Thu Mar 5 19:19:03 2015 -0800
>>> Commit:     Ingo Molnar <mingo@...nel.org>
>>> CommitDate: Fri Mar 6 08:32:57 2015 +0100
>>>
>>>     x86/asm/entry: Switch all C consumers of kernel_stack to this_cpu_sp0()
>>
>> This problem only happens on 32-bit kernels, and the culprit is, sort of:
>>
>> /*
>>  * The below -8 is to reserve 8 bytes on top of the ring0 stack.
>>  * This is necessary to guarantee that the entire "struct pt_regs"
>>  * is accessible even if the CPU haven't stored the SS/ESP registers
>>  * on the stack (interrupt gate does not save these registers
>>  * when switching to the same priv ring).
>>  * Therefore beware: accessing the ss/esp fields of the
>>  * "struct pt_regs" is possible, but they may contain the
>>  * completely wrong values.
>>  */
>> #define task_pt_regs(task)                                             \
>> ({                                                                     \
>>        struct pt_regs *__regs__;                                       \
>>        __regs__ = (struct pt_regs *)(KSTK_TOP(task_stack_page(task))-8); \
>>        __regs__ - 1;                                                   \
>> })
>>
>> I'm confused about multiple things:
>>
>> 1. I don't understand this comment.
>
> Comment says that in 32-bit x86, interrupts and exceptions
> in ring 0 do not push SS,ESP - they only save EFLAGS,CS,EIP
> in iret frame. (This happens because CPL doesn't
> change, not beacuse ot is zero).
>
> IRET insn likewise does not restore SS,ESP if it detects
> that RPL(stack_CS) = RPL(CS).
>
> This was changed for 64-bit more for two reasons:
> (1) it's simpler, and
> (2) it makes it possible for CPU to automatically align stack
> to 16-byte boundary (you need to undo that on iretq,
> therefore you need to always save old RSP).
>
> From AMD docs:
>
> "8.9.3 Interrupt Stack Frame
> In long mode, the return-program stack pointer (SS:RSP) is always
> pushed onto the interrupt-handler
> stack, regardless of whether or not a privilege change occurs.
> Although the SS register is not used in
> 64-bit mode, SS is pushed to allow returns into compatibility mode.
> Pushing SS:RSP unconditionally
> presents operating systems with a consistent interrupt-stack-frame
> size for all interrupts, except for
> error codes. Interrupt service-routine entry points that handle
> interrupts generated by non-error-code
> interrupts can push an error code on the stack for consistency.
> In long mode, when a control transfer to an interrupt handler occurs,
> the processor performs the
> following:
> 1. Aligns the new interrupt-stack frame by masking RSP with
> FFFF_FFFF_FFFF_FFF0h.
> ..."

I believe kernel threads used to run with the stack pointer starting
right at the top of the stack.  So if a kernel thread was interrupted
then regs->ss and regs->esp might be above the stack in the next page
(which could possibly fault).  However, any code that looks at those
two values on 32-bit without checking regs->cs first is buggy.  Even
if they can be read they are not valid if there was no CPL change.

It appears that kernel threads now have an empty pt_regs struct on the
top of the stack, which is used for execve.  That would mean there is
sufficient buffer for accessing regs->esp and regs->ss without
crossing the top of the stack.  The 8 byte offset is not needed
anymore.

>> 2. copy_thread thinks that sp0 should be task_pt_regs + 1, which is 8
>> bytes below the top of the stack.  cpu_tss, formerly INIT_TSS, thinks
>> that sp0 should be the top of the stack.  This is inconsistent.  What
>> gives?
>
> I ran a small test: made "umask" syscall print where  pt_regs is,
> and where it's end is:
>
> SYSCALL_DEFINE1(umask, int, mask)
> {
>         struct pt_regs *regs = task_pt_regs(current);
>         unsigned long ptregs = (unsigned long)regs;
>
>         pr_err("[%d] current:%lx pt_regs:%lx pt_regs_end:%lx\n",
>                 current->pid, (long)current, ptregs, ptregs + sizeof(*regs));
>         ....
>
> Result:
>
> / # umask
> [    8.393450] [910] current:c777ce00 pt_regs:c7567fb4 pt_regs_end:c7567ff8
> [    8.396601] [910] current:c777ce00 pt_regs:c7567fb4 pt_regs_end:c7567ff8
>
> It seems to be true that 32-bit x86 always has two unused words
> beyond "struct pt_regs" on stack during syscalls.
>
> IOW: tss.sp0 is *not* set at the very end of the stack on 32-bit.
>
>> 3. vm86 does something terrible to sp0, which might mean that we can't
>> use sp0 to find the top of the stack in general on 32-bit kernels,
>> which is sad.  We might need a partial revert or to introduce a
>> per-cpu variable "real_sp0" on 32-bit.  Ugh.  Could someone who
>> understands vm86 mode better than me give me a hint?

vm86 mode alters sp0 in order to save the 32-bit pt_regs on the top of
the stack.  The iret frame for vm86 mode includes ds/es/fs/gs on the
stack before the normal iret frame.  This is so that the segments can
be set for real mode semantics.  So the normal pt_regs would not be
large enough to include those extra segments in the ire frame.

I had started work on a patch to clean this up, by saving the 32-bit
regs and other data off the stack, but never finished it.  I may look
at it again soon.

>> 4. Not directly related, but the 32-bit tss_struct contains this gem:
>>
>>     /*
>>      * .. and then another 0x100 bytes for the emergency kernel stack:
>>      */
>>     unsigned long        stack[64];
>>
>> Last I checked, 0x100 != 64.  Also, wow, this is kind of disgusting. :)
>
>
> Seems to be unused: I commented it out on "defconfig" build
> and got no build errors.

This is only for the sysenter code on a 32-bit kernel.  It should be
commented out for a 64-bit kernel.

--
Brian Gerst
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ