[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK1hOcMrNb=RSqB92LzXRBRVZBkOdwggERSS3FLKak1EZi9gkA@mail.gmail.com>
Date: Sun, 8 Mar 2015 20:13:28 +0100
From: Denys Vlasenko <vda.linux@...glemail.com>
To: Andy Lutomirski <luto@...capital.net>
Cc: Fengguang Wu <fengguang.wu@...el.com>, X86 ML <x86@...nel.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Ingo Molnar <mingo@...nel.org>, LKP <lkp@...org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [x86/asm/entry] BUG: unable to handle kernel paging request
On Sat, Mar 7, 2015 at 1:33 AM, Andy Lutomirski <luto@...capital.net> wrote:
> On Fri, Mar 6, 2015 at 3:30 PM, Fengguang Wu <fengguang.wu@...el.com> wrote:
>> Greetings,
>>
>> 0day kernel testing robot got the below dmesg and the first bad commit is
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/asm
>>
>> commit 75182b1632a89f12540baa1806a7c5c180db620c
>> Author: Andy Lutomirski <luto@...capital.net>
>> AuthorDate: Thu Mar 5 19:19:03 2015 -0800
>> Commit: Ingo Molnar <mingo@...nel.org>
>> CommitDate: Fri Mar 6 08:32:57 2015 +0100
>>
>> x86/asm/entry: Switch all C consumers of kernel_stack to this_cpu_sp0()
>
> This problem only happens on 32-bit kernels, and the culprit is, sort of:
>
> /*
> * The below -8 is to reserve 8 bytes on top of the ring0 stack.
> * This is necessary to guarantee that the entire "struct pt_regs"
> * is accessible even if the CPU haven't stored the SS/ESP registers
> * on the stack (interrupt gate does not save these registers
> * when switching to the same priv ring).
> * Therefore beware: accessing the ss/esp fields of the
> * "struct pt_regs" is possible, but they may contain the
> * completely wrong values.
> */
> #define task_pt_regs(task) \
> ({ \
> struct pt_regs *__regs__; \
> __regs__ = (struct pt_regs *)(KSTK_TOP(task_stack_page(task))-8); \
> __regs__ - 1; \
> })
>
> I'm confused about multiple things:
>
> 1. I don't understand this comment.
Comment says that in 32-bit x86, interrupts and exceptions
in ring 0 do not push SS,ESP - they only save EFLAGS,CS,EIP
in iret frame. (This happens because CPL doesn't
change, not beacuse ot is zero).
IRET insn likewise does not restore SS,ESP if it detects
that RPL(stack_CS) = RPL(CS).
This was changed for 64-bit more for two reasons:
(1) it's simpler, and
(2) it makes it possible for CPU to automatically align stack
to 16-byte boundary (you need to undo that on iretq,
therefore you need to always save old RSP).
>From AMD docs:
"8.9.3 Interrupt Stack Frame
In long mode, the return-program stack pointer (SS:RSP) is always
pushed onto the interrupt-handler
stack, regardless of whether or not a privilege change occurs.
Although the SS register is not used in
64-bit mode, SS is pushed to allow returns into compatibility mode.
Pushing SS:RSP unconditionally
presents operating systems with a consistent interrupt-stack-frame
size for all interrupts, except for
error codes. Interrupt service-routine entry points that handle
interrupts generated by non-error-code
interrupts can push an error code on the stack for consistency.
In long mode, when a control transfer to an interrupt handler occurs,
the processor performs the
following:
1. Aligns the new interrupt-stack frame by masking RSP with
FFFF_FFFF_FFFF_FFF0h.
..."
> 2. copy_thread thinks that sp0 should be task_pt_regs + 1, which is 8
> bytes below the top of the stack. cpu_tss, formerly INIT_TSS, thinks
> that sp0 should be the top of the stack. This is inconsistent. What
> gives?
I ran a small test: made "umask" syscall print where pt_regs is,
and where it's end is:
SYSCALL_DEFINE1(umask, int, mask)
{
struct pt_regs *regs = task_pt_regs(current);
unsigned long ptregs = (unsigned long)regs;
pr_err("[%d] current:%lx pt_regs:%lx pt_regs_end:%lx\n",
current->pid, (long)current, ptregs, ptregs + sizeof(*regs));
....
Result:
/ # umask
[ 8.393450] [910] current:c777ce00 pt_regs:c7567fb4 pt_regs_end:c7567ff8
[ 8.396601] [910] current:c777ce00 pt_regs:c7567fb4 pt_regs_end:c7567ff8
It seems to be true that 32-bit x86 always has two unused words
beyond "struct pt_regs" on stack during syscalls.
IOW: tss.sp0 is *not* set at the very end of the stack on 32-bit.
> 3. vm86 does something terrible to sp0, which might mean that we can't
> use sp0 to find the top of the stack in general on 32-bit kernels,
> which is sad. We might need a partial revert or to introduce a
> per-cpu variable "real_sp0" on 32-bit. Ugh. Could someone who
> understands vm86 mode better than me give me a hint?
>
> 4. Not directly related, but the 32-bit tss_struct contains this gem:
>
> /*
> * .. and then another 0x100 bytes for the emergency kernel stack:
> */
> unsigned long stack[64];
>
> Last I checked, 0x100 != 64. Also, wow, this is kind of disgusting. :)
Seems to be unused: I commented it out on "defconfig" build
and got no build errors.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists