linux-kernel - Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly disable irqs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150308085508.GA13164@gmail.com>
Date:	Sun, 8 Mar 2015 09:55:08 +0100
From:	Ingo Molnar <mingo@...nel.org>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Andy Lutomirski <luto@...capital.net>,
	Oleg Nesterov <oleg@...hat.com>,
	Dave Hansen <dave.hansen@...el.com>,
	Borislav Petkov <bp@...e.de>, Pekka Riikonen <priikone@....fi>,
	Rik van Riel <riel@...hat.com>,
	Suresh Siddha <sbsiddha@...il.com>,
	LKML <linux-kernel@...r.kernel.org>,
	"Yu, Fenghua" <fenghua.yu@...el.com>,
	Quentin Casasnovas <quentin.casasnovas@...cle.com>,
	Denys Vlasenko <dvlasenk@...hat.com>
Subject: Re: [PATCH 1/1] x86/fpu: math_state_restore() should not blindly
 disable irqs


* Linus Torvalds <torvalds@...ux-foundation.org> wrote:

> On Sat, Mar 7, 2015 at 2:36 AM, Ingo Molnar <mingo@...nel.org> wrote:
> >
> > We could save the same 10 cycles from page fault overhead as well,
> > AFAICS.
> 
> Are trap gates actually noticeably faster? Or is it just he
> "conditional_sti()" you're worried about?

( I'll talk about the CR2 complication later, please ignore that 
  problem for a moment. )

So I base my thinking on the following hierarchy of fast paths. In a 
typical x86 system there are 4 main types of 'context switches', in 
order of decreasing frequency:

   - syscall    'context switch': entry/exit from syscall
   - trap/fault 'context switch': entry/exit from trap/fault
   - irq        'context switch': entry/exit from irqs
   - task       'context switch': scheduler context-switch

Where each successive level is about an order of magnitude less 
frequently executed on a typical system than the previous one, so to 
optimize a level we are willing to push overhead to the next one(s).

So the primary payoff in executing much of the entry code with irqs 
enabled would be to allow 64-bit *syscalls* to be made without irqs 
disabled: the first, most important level of context entries.

Doing that would give us four (theoretical) performance advantages:

  - No implicit irq disabling overhead when the syscall instruction is
    executed: we could change MSR_SYSCALL_MASK from 0xc0000084 to
    0xc0000284, which removes the implicit CLI on syscall entry.

  - No explicit irq enabling overhead via ENABLE_INTERRUPTS() [STI] in
    system_call.

  - No explicit irq disabling overhead in the ret_from_sys_call fast 
    path, i.e. no DISABLE_INTERRUPTS() [CLI].

  - No implicit irq enabling overhead in ret_from_sys_call's 
    USERGS_SYSRET64: the SYSRETQ instruction would not have to 
    re-enable irqs as the user-space IF in R11 would match that of the 
    current IF.

whether that's an actual performance win in practice as well needs to 
be measured, but I'd be (very!) shocked if it wasn't in the 20+ cycles 
range: which is absolutely huge in terms of system_call optimizations.

Now do I think this could be done realistically? I think yes (by 
re-using the NMI code's paranoid entry codepaths for the irq code as 
well, essentially fixing up the effects of any partial entries in an 
irq entry slow path), but I could be wrong about that.

My main worry isn't even feasibility but maintainability and general 
robustness: all these asm cleanups we are doing now could enable such 
a trick to be pulled off robustly.

But I think it could be done technically, because the NMI code already 
has to be careful about 'preempting' partial entries, so we have the 
logic.

Complications:

 - We now enable double partial entries: partial syscall interrupted
   by an irq interrupted by an NMI context. I think it should all work
   out fine but it's a new scenario.

 - We'd enable interruptible return from system call, which caused
   trouble in the past. Solvable IMHO, by being careful in the irq 
   entry code.

 - We'd now have to be extra careful about the possibility of 
   exceptions triggered in the entry/exit code itself, triggered by 
   any sort of unusual register content or MMU fault.

Simplifications:

 - I'd ruthlessly disable IRQs for any sort of non fast path: for 
   example 32-bit compat entries, ptrace or any other slowpath - at 
   least initially until we map out the long term effects of this 
   optimization.

Does this scare me? Yes, I think it should scare any sane person, but 
I don't think it's all that bad: all the NMI paranoidentry work has 
already the trail blazed, and most of the races will be readily 
triggerable via regular irq loads, so it's not like we'll leave subtle 
bugs in there.

Being able to do the same with certain traps, because irq entry is 
careful about partial entry state, would just be a secondary bonus.

Regarding the CR2 value on page faults:

> Anyway, for page faulting, we traditionally actually wanted an 
> interrupt gate, because of how we wanted to avoid interrupts coming 
> in and possibly messing up %cr2 due to vmalloc faults, but more 
> importantly for preemption. [...]

Here too I think we could take a page from the NMI code: save cr2 in 
the page fault asm code, recognize from the irq code when we are 
interrupting that and dropping into a slowpath that saves cr2 right 
there. Potentially task-context-switching will be safe after that.

Saving cr2 in the early page fault code should be much less of an 
overhead than what the IRQ disabling/enabling costs, so this should be 
a win. The potential win could be similar to that of system calls:

  - Removal of an implicit 'CLI' in irq gates

  - Removal of the explicit 'STI' in conditional_sti in your proposed 
    code

  - Removal of an explicit 'CLI' (DISABLE_INTERRUPTS) in 
    error_exit/retint_swapgs.

  - Removal of an implicit 'STI' when SYSRET enables interrupts from R11

and the same savings would apply to FPU traps as well. I'd leave all 
other low frequency traps as interrupt gates: #GP, debug, int3, etc.

> [...] vmalloc faults are "harmless" because we'll notice that it's 
> already done, return, and then re-take the real fault. But a 
> preemption event before we read %cr2 can cause bad things to happen:
> 
>  - page fault pushes error code on stack, address in %cr2
> 
>  - we don't have interrupts disabled, and some interrupt comes in and
> causes preemption
> 
>  - some other process runs, take another page fault. %cr2 now is the
> wrong address
> 
>  - we go back to the original thread (perhaps on another cpu), which
> now reads %cr2 for the wrong address
> 
>  - we send the process a SIGSEGV because we think it's accessing
> memory that it has no place touching

I think none of this corruption happens if an interrupting context is 
aware of this and 'fixes up' the entry state accordingly. Am I missing 
anything subtle perhaps?

This would be arguably new (and tricky) code, as today the NMI code 
solves this problem by trying to never fault and thus never corrupt 
CR2. But ... unlike NMIs, this triggers so often via a mix of regular 
traps and regular irqs that I'm pretty sure it can be pulled off 
robustly.

> So the page fault code actually *needs* interrupts disabled until we 
> read %cr2. Stupid x86 trap semantics where the error code is on the 
> thread-safe stack, but %cr2 is not.
> 
> Maybe there is some trick I'm missing, but on the whole I think 
> "interrupt gate + conditional_sti()" does have things going for it. 
> Yes, it still leaves NMI as being special, but NMI really *is* 
> special.

So I think it's all doable, the payoffs are significant in terms of 
entry speedups.

But only on a clean code base, as the x86 entry assembly code was way 
beyond any sane maintainability threshold IMHO - it's fortunately 
improving rapidly these days due to the nice work from Andy and Denys!

What do you think?

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/