[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMzpN2hGNHzLNT=HoYGmthwT4BRx+BuM-TaNPQdPvMUXHK7LNw@mail.gmail.com>
Date: Fri, 27 Mar 2015 16:53:18 -0400
From: Brian Gerst <brgerst@...il.com>
To: Andy Lutomirski <luto@...capital.net>
Cc: Denys Vlasenko <dvlasenk@...hat.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Borislav Petkov <bp@...en8.de>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
X86 ML <x86@...nel.org>, Ingo Molnar <mingo@...nel.org>
Subject: Re: ia32_sysenter_target does not preserve EFLAGS
On Fri, Mar 27, 2015 at 2:37 PM, Andy Lutomirski <luto@...capital.net> wrote:
> On Mar 27, 2015 7:26 AM, "Denys Vlasenko" <dvlasenk@...hat.com> wrote:
>>
>> Hi,
>>
>> While running some tests I noticed that EFLAGS
>> is not saved across syscalls if I use 32-bit
>> userspace, use SYSENTER, and paravirt is active.
>>
>> Looking at the code, it's actually clear why that happens.
>>
>> /*
>> * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs.
>> * IF and VM in rflags are cleared (IOW: interrupts are off).
>> * SYSENTER does not save anything on the stack,
>> * and does not save old rip (!!!) and rflags.
>> */
>> ENTRY(ia32_sysenter_target)
>> SWAPGS_UNSAFE_STACK <============================
>> movq PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
>> ENABLE_INTERRUPTS(CLBR_NONE)
>>
>> movl %ebp, %ebp
>> movl %eax, %eax
>> movl ASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d
>>
>> /* Construct struct pt_regs on stack */
>> pushq_cfi $__USER32_DS /* pt_regs->ss */
>> pushq_cfi %rbp /* pt_regs->sp */
>> CFI_REL_OFFSET rsp,0
>> pushfq_cfi /* pt_regs->flags */
>>
>> The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks,
>> will change EFLAGS, and it *can't* save/restore them -
>> there is no place to save it, since neither stack nor
>> PER_CPU() is usable at that point.
>>
>> Interestingly, *no one ever complained*!
>>
>> Apparently, users *don't* depend on arithmetic flags
>> to survive over syscall. They also okay with DF flag
>> being cleared.
>>
>> Let's go flag-by-flag.
>>
>> ID - probably no one depends on it
>> VIP,VIF,VM - v86 stuff, not supported in 64bit
>> AC - someone probably do use this
>> RF - should be cleared to 0
>> NT - iret via task gate, not supported in 64bit
>> IOPL - usually 00, sys_iopl() can change it
>> DF - according to C ABI, should be 0
>> IF - should be preserved (but almost always 1)
>> TF - should be preserved
>> arith flags - probably no one cares
>>
>> IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_
>> IF.
>>
>> AC and IOPL are preserved even with this paravirt quirk
>> because paravirt hooks do not mangle them.
>>
>> TF preservation and proper restoration is handled by
>> do_debug + syscall_trace_enter_phase2 + iret
>> combo.
>>
>> We unconditionally set IF. This is only a problem for applications
>> which use sys_iopl(3) and, disable IRQs in userspace and perform
>> syscalls. The set of such apps is probably empty.
>> (This "bug" exists even for non-paravirt case).
>>
>> So, formally, we have a bug: we do not preserve IF,
>> DF and arith flags.
>>
>> I'm proposing to use this opportunity to amend syscall ABI
>> to say that arith flags are not preserved across syscalls,
>> and DF can be cleared to 0 by syscalls (but can't be set to 1).
>> Evidently, it's broken for some time for some virtualized
>> setups and users are okay.
>
> I think I'd rather fix it. I want to give x86_64 a sysenter stack
> like x86_32's, since AFAICT the only reason that #DF needs to use IST
> is because sysenter with TF set is the only way I can see that #DF
> could happen with an invalid stack.
What if RSP gets corrupted in the kernel? That would cause a fault
that gets promoted to #DF, since the iret frame can't be pushed. You
would at least get an oops out instead of a triple fault reset.
> Also, Houston, we have a bug, probably rootable, and probably damn
> near impossible to exploit without crashing your system:
>
> User does sysenter. We end up in native_irq_enable_sysexit. We do:
>
> swapgs
> sti
>
> <-- NMI here can happen on some (all?) cpus, returns successfully
> *with interrupts unmasked*
>
> <-- IRQ. Boom
The sti will delay interrupts for one instruction, and that should include NMIs.
The Intel SDM states for STI:
"The IF flag and the STI and CLI instructions do not prohibit the
generation of exceptions and NMI interrupts. NMI
interrupts (and SMIs) may be blocked for one macroinstruction following an STI."
> My preferred fix would be to use sysretl instead of sysexit. As far
> as I know, there are no 64-bit CPUs at all that don't support sysretl.
It would be nice to have one less return path. I'm curious to know
why Intel doesn't support syscall from compatibility mode but does
support sysret to compatibility mode.
--
Brian Gerst
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists