lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20171126142637.fajrnxsudgs4tqt2@gmail.com>
Date:   Sun, 26 Nov 2017 15:26:37 +0100
From:   Ingo Molnar <mingo@...nel.org>
To:     Andy Lutomirski <luto@...capital.net>
Cc:     Borislav Petkov <bp@...en8.de>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        "H . Peter Anvin" <hpa@...or.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: [PATCH] x86/entry/64: Create a percpu SYSCALL entry trampoline


* Andy Lutomirski <luto@...capital.net> wrote:

> On Sat, Nov 25, 2017 at 3:40 AM, Borislav Petkov <bp@...en8.de> wrote:
> > On Fri, Nov 24, 2017 at 06:23:43PM +0100, Ingo Molnar wrote:
> >> From: Andy Lutomirski <luto@...nel.org>
> >>
> >> Handling SYSCALL is tricky: the SYSCALL handler is entered with every
> >> single register (except FLAGS), including RSP, live.  It somehow needs
> >> to set RSP to point to a valid stack, which means it needs to save the
> >> user RSP somewhere and find its own stack pointer.  The canonical way
> >> to do this is with SWAPGS, which lets us access percpu data using the
> >> %gs prefix.
> >>
> >> With KAISER-like pagetable switching, this is problematic.  Without a
> >> scratch register, switching CR3 is impossible, so %gs-based percpu
> >> memory would need to be mapped in the user pagetables.  Doing that
> >> without information leaks is difficult or impossible.
> >>
> >> Instead, use a different sneaky trick.  Map a copy of the first part
> >> of the SYSCALL asm at a different address for each CPU.  Now RIP
> >> varies depending on the CPU, so we can use RIP-relative memory access
> >> to access percpu memory.  By putting the relevant information (one
> >> scratch slot and the stack address) at a constant offset relative to
> >> RIP, we can make SYSCALL work without relying on %gs.
> >>
> >> A nice thing about this approach is that we can easily switch it on
> >> and off if we want pagetable switching to be configurable.
> >>
> >> The compat variant of SYSCALL doesn't have this problem in the first
> >> place -- there are plenty of scratch registers, since we don't care
> >> about preserving r8-r15.  This patch therefore doesn't touch SYSCALL32
> >> at all.
> >>
> >> XXX: Whenever we settle how KAISER gets turned on and off, we should do
> >> the same to this.
> >>
> >> Signed-off-by: Andy Lutomirski <luto@...nel.org>
> >> Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
> >> Cc: Borislav Petkov <bpetkov@...e.de>
> >> Cc: Brian Gerst <brgerst@...il.com>
> >> Cc: Dave Hansen <dave.hansen@...el.com>
> >> Cc: Josh Poimboeuf <jpoimboe@...hat.com>
> >> Cc: Linus Torvalds <torvalds@...ux-foundation.org>
> >> Cc: Peter Zijlstra <peterz@...radead.org>
> >> Link: https://lkml.kernel.org/r/b95ccae0a5a2f090c901e49fce7c9e8ff6acd40d.1511497875.git.luto@kernel.org
> >> Signed-off-by: Ingo Molnar <mingo@...nel.org>
> >> ---
> >>  arch/x86/entry/entry_64.S     | 48 +++++++++++++++++++++++++++++++++++++++++++
> >>  arch/x86/include/asm/fixmap.h |  2 ++
> >>  arch/x86/kernel/asm-offsets.c |  1 +
> >>  arch/x86/kernel/cpu/common.c  | 12 ++++++++++-
> >>  arch/x86/kernel/vmlinux.lds.S | 10 +++++++++
> >>  5 files changed, 72 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> >> index 426b8c669d6a..0cde243b7542 100644
> >> --- a/arch/x86/entry/entry_64.S
> >> +++ b/arch/x86/entry/entry_64.S
> >> @@ -140,6 +140,54 @@ END(native_usergs_sysret64)
> >>   * with them due to bugs in both AMD and Intel CPUs.
> >>   */
> >>
> >> +     .pushsection .entry_trampoline, "ax"
> >> +
> >> +/*
> >> + * The code in here gets remapped into cpu_entry_area's trampoline.  This means
> >> + * that the assembler and linker have the wrong idea as to where this code
> >> + * lives (and, in fact, it's mapped more than once, so it's not even at a
> >> + * fixed address).  So we can't reference any symbols outside the entry
> >> + * trampoline and expect it to work.
> >> + *
> >> + * Instead, we carefully abuse %rip-relative addressing.
> >> + * .Lentry_trampoline(%rip) refers to the start of the remapped) entry
> >
> > _entry_trampoline(%rip) I'd guess.
> 
> Indeed.  It used to be .L, but then I put it in the linker script the
> obvious way and it wasn't any more.
> 
> >
> >> + * trampoline.  We can thus find cpu_entry_area with this macro:
> >
> > Uuh, fun. :)
> 
> That's what I thought!
> 
> >
> >> + */
> >> +
> >> +#define CPU_ENTRY_AREA \
> >> +     _entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
> >
> > So this generates
> >
> >         _entry_trampoline - 16384(%rip)
> >
> > here. IINM, the layout looks like this
> >
> > [ GDT | TSS | TSS | TSS | trampoline ]
> >
> > where each section is a page, and we have 4 pages per CPU. Just for my
> > own understanding...
> 
> Indeed.
> 
> >
> >> +
> >> +/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
> >> +#define RSP_SCRATCH CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
> >> +     SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
> >
> > I'm wondering if it would be easier to make SYSENTER_stack part of
> > struct cpu_entry_area and thus simplify that wild offset math here :)
> 
> It's like that with the last patch that I haven't send out applied, actually :)
> 
> >
> > Also, pls align:
> >
> > #define RSP_SCRATCH CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
> >                     SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
> 
> Done.
> 
> >
> >> +
> >> +ENTRY(entry_SYSCALL_64_trampoline)
> >> +     UNWIND_HINT_EMPTY
> >> +     swapgs
> >> +
> >> +     /* Stash the user RSP. */
> >> +     movq    %rsp, RSP_SCRATCH
> >> +
> >> +     /* Load the top of the task stack into RSP */
> >> +     movq    CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
> >
> > Yeah, let's put CPU_ENTRY_AREA first because then it reads easier:
> >
> > movq    CPU_ENTRY_AREA + CPU_ENTRY_AREA_tss + TSS_sp1, %rsp
> 
> Nope, it won't build.  That would expand like 0x1(%rip) + 0x2, which
> isn't valid.
> 
> >
> > i.e., pointer to cpu_entry_area, offset to tss within said
> > cpu_entry_area and then inside that, sp1.
> >
> > Ditto for above.
> >
> > ...
> >
> >> @@ -1417,10 +1424,13 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
> >>  /* May not be marked __init: used by software suspend */
> >>  void syscall_init(void)
> >>  {
> >> +     extern char _entry_trampoline[];
> >> +     extern char entry_SYSCALL_64_trampoline[];
> >> +
> >>       int cpu = smp_processor_id();
> >>
> >>       wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
> >> -     wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
> >> +     wrmsrl(MSR_LSTAR, (unsigned long)get_cpu_entry_area(cpu)->entry_trampoline + (entry_SYSCALL_64_trampoline - _entry_trampoline));
> >
> > Definitely a local var.
> 
> Done.
> 
> >
> >>  #ifdef CONFIG_IA32_EMULATION
> >>       wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
> >> diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
> >> index a4009fb9be87..2738cfb6c8c8 100644
> >> --- a/arch/x86/kernel/vmlinux.lds.S
> >> +++ b/arch/x86/kernel/vmlinux.lds.S
> >> @@ -107,6 +107,16 @@ SECTIONS
> >>               SOFTIRQENTRY_TEXT
> >>               *(.fixup)
> >>               *(.gnu.warning)
> >> +
> >> +#ifdef CONFIG_X86_64
> >> +             /* Entry trampoline */
> >
> > No need for that comment - variable and section names are enough. :)

Ok, I've picked up this version and added Boris's Reviewed-by tag - see the latest 
below.

Thanks,

	Ingo

==================>
Subject: x86/entry/64: Create a percpu SYSCALL entry trampoline
From: Andy Lutomirski <luto@...nel.org>
Date: Sat, 11 Nov 2017 19:52:54 -0800

Handling SYSCALL is tricky: the SYSCALL handler is entered with every
single register (except FLAGS), including RSP, live.  It somehow needs
to set RSP to point to a valid stack, which means it needs to save the
user RSP somewhere and find its own stack pointer.  The canonical way
to do this is with SWAPGS, which lets us access percpu data using the
%gs prefix.

With KAISER-like pagetable switching, this is problematic.  Without a
scratch register, switching CR3 is impossible, so %gs-based percpu
memory would need to be mapped in the user pagetables.  Doing that
without information leaks is difficult or impossible.

Instead, use a different sneaky trick.  Map a copy of the first part
of the SYSCALL asm at a different address for each CPU.  Now RIP
varies depending on the CPU, so we can use RIP-relative memory access
to access percpu memory.  By putting the relevant information (one
scratch slot and the stack address) at a constant offset relative to
RIP, we can make SYSCALL work without relying on %gs.

A nice thing about this approach is that we can easily switch it on
and off if we want pagetable switching to be configurable.

The compat variant of SYSCALL doesn't have this problem in the first
place -- there are plenty of scratch registers, since we don't care
about preserving r8-r15.  This patch therefore doesn't touch SYSCALL32
at all.

XXX: Whenever we settle how KAISER gets turned on and off, we should do
the same to this.

Signed-off-by: Andy Lutomirski <luto@...nel.org>
Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
Reviewed-by: Thomas Gleixner <tglx@...utronix.de>
Reviewed-by: Borislav Petkov <bpetkov@...e.de>
Cc: Brian Gerst <brgerst@...il.com>
Cc: Dave Hansen <dave.hansen@...el.com>
Cc: Josh Poimboeuf <jpoimboe@...hat.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Peter Zijlstra <peterz@...radead.org>
Link: https://lkml.kernel.org/r/b95ccae0a5a2f090c901e49fce7c9e8ff6acd40d.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@...nel.org>
---
 arch/x86/entry/entry_64.S     |   48 ++++++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/fixmap.h |    2 +
 arch/x86/kernel/asm-offsets.c |    1 
 arch/x86/kernel/cpu/common.c  |   15 ++++++++++++-
 arch/x86/kernel/vmlinux.lds.S |    9 +++++++
 5 files changed, 74 insertions(+), 1 deletion(-)

Index: tip/arch/x86/entry/entry_64.S
===================================================================
--- tip.orig/arch/x86/entry/entry_64.S
+++ tip/arch/x86/entry/entry_64.S
@@ -140,6 +140,54 @@ END(native_usergs_sysret64)
  * with them due to bugs in both AMD and Intel CPUs.
  */
 
+	.pushsection .entry_trampoline, "ax"
+
+/*
+ * The code in here gets remapped into cpu_entry_area's trampoline.  This means
+ * that the assembler and linker have the wrong idea as to where this code
+ * lives (and, in fact, it's mapped more than once, so it's not even at a
+ * fixed address).  So we can't reference any symbols outside the entry
+ * trampoline and expect it to work.
+ *
+ * Instead, we carefully abuse %rip-relative addressing.
+ * _entry_trampoline(%rip) refers to the start of the remapped) entry
+ * trampoline.  We can thus find cpu_entry_area with this macro:
+ */
+
+#define CPU_ENTRY_AREA \
+	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
+
+/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
+#define RSP_SCRATCH	CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
+			SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
+
+ENTRY(entry_SYSCALL_64_trampoline)
+	UNWIND_HINT_EMPTY
+	swapgs
+
+	/* Stash the user RSP. */
+	movq	%rsp, RSP_SCRATCH
+
+	/* Load the top of the task stack into RSP */
+	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
+
+	/* Start building the simulated IRET frame. */
+	pushq	$__USER_DS			/* pt_regs->ss */
+	pushq	RSP_SCRATCH			/* pt_regs->sp */
+	pushq	%r11				/* pt_regs->flags */
+	pushq	$__USER_CS			/* pt_regs->cs */
+	pushq	%rcx				/* pt_regs->ip */
+
+	/*
+	 * x86 lacks a near absolute jump, and we can't jump to the real
+	 * entry text with a relative jump, so we fake it using retq.
+	 */
+	pushq	$entry_SYSCALL_64_after_hwframe
+	retq
+END(entry_SYSCALL_64_trampoline)
+
+	.popsection
+
 ENTRY(entry_SYSCALL_64)
 	UNWIND_HINT_EMPTY
 	/*
Index: tip/arch/x86/include/asm/fixmap.h
===================================================================
--- tip.orig/arch/x86/include/asm/fixmap.h
+++ tip/arch/x86/include/asm/fixmap.h
@@ -58,6 +58,8 @@ struct cpu_entry_area {
 	 * of the TSS region.
 	 */
 	struct tss_struct tss;
+
+	char entry_trampoline[PAGE_SIZE];
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
Index: tip/arch/x86/kernel/asm-offsets.c
===================================================================
--- tip.orig/arch/x86/kernel/asm-offsets.c
+++ tip/arch/x86/kernel/asm-offsets.c
@@ -101,4 +101,5 @@ void common(void) {
 
 	/* Layout info for cpu_entry_area */
 	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
+	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
 }
Index: tip/arch/x86/kernel/cpu/common.c
===================================================================
--- tip.orig/arch/x86/kernel/cpu/common.c
+++ tip/arch/x86/kernel/cpu/common.c
@@ -507,6 +507,8 @@ DEFINE_PER_CPU(struct cpu_entry_area *,
 static inline void setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
+	extern char _entry_trampoline[];
+
 	/* On 64-bit systems, we use a read-only fixmap GDT. */
 	pgprot_t gdt_prot = PAGE_KERNEL_RO;
 #else
@@ -553,6 +555,11 @@ static inline void setup_cpu_entry_area(
 #ifdef CONFIG_X86_32
 	this_cpu_write(cpu_entry_area, get_cpu_entry_area(cpu));
 #endif
+
+#ifdef CONFIG_X86_64
+	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
+		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
+#endif
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1417,10 +1424,16 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char,
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
+	extern char _entry_trampoline[];
+	extern char entry_SYSCALL_64_trampoline[];
+
 	int cpu = smp_processor_id();
+	unsigned long SYSCALL64_entry_trampoline =
+		(unsigned long)get_cpu_entry_area(cpu)->entry_trampoline +
+		(entry_SYSCALL_64_trampoline - _entry_trampoline);
 
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
-	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+	wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
 
 #ifdef CONFIG_IA32_EMULATION
 	wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
Index: tip/arch/x86/kernel/vmlinux.lds.S
===================================================================
--- tip.orig/arch/x86/kernel/vmlinux.lds.S
+++ tip/arch/x86/kernel/vmlinux.lds.S
@@ -107,6 +107,15 @@ SECTIONS
 		SOFTIRQENTRY_TEXT
 		*(.fixup)
 		*(.gnu.warning)
+
+#ifdef CONFIG_X86_64
+		. = ALIGN(PAGE_SIZE);
+		_entry_trampoline = .;
+		*(.entry_trampoline)
+		. = ALIGN(PAGE_SIZE);
+		ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is too big");
+#endif
+
 		/* End of text section */
 		_etext = .;
 	} :text = 0x9090

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ