linux-kernel - Re: [PATCH PTI v2 6/6] x86/pti: Put the LDT in its own PGD if PTI is on

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 11 Dec 2017 09:49:50 -0800
From:   Dave Hansen <dave.hansen@...el.com>
To:     Andy Lutomirski <luto@...nel.org>, x86@...nel.org
Cc:     linux-kernel@...r.kernel.org, Borislav Petkov <bp@...en8.de>,
        Brian Gerst <brgerst@...il.com>,
        David Laight <David.Laight@...lab.com>,
        Kees Cook <keescook@...omium.org>,
        Peter Zijlstra <peterz@...radead.org>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
Subject: Re: [PATCH PTI v2 6/6] x86/pti: Put the LDT in its own PGD if PTI is
 on

So, before this,

On 12/10/2017 10:47 PM, Andy Lutomirski wrote:
...> +	if (unlikely(ldt)) {
> +		if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI)) {
> +			if (WARN_ON_ONCE((unsigned long)ldt->slot > 1)) {
> +				clear_LDT();
> +				return;
> +			}

I'm missing the purpose of the slots.  Are you hoping to use those
eventually for randomization, but just punting on implementing it for now?

> +
> +			set_ldt(ldt_slot_va(ldt->slot), ldt->nr_entries);
> +		} else {
> +			set_ldt(ldt->entries, ldt->nr_entries);
> +		}

This seems like a much better place to point out why the aliasing exists
and what it is doing than the place it is actually commented.

Maybe:

			/*
			 * ldt->entries is not mapped into the user page
			 * tables when page table isolation is enabled.
			 * Point the hardware to the alias we created.
			 */
			set_ldt(ldt_slot_va(ldt->slot), ...
		} else {
			/*
			 * Point the hardware at the normal kernel
			 * mapping when not isolated.
			 */
			set_ldt(ldt->entries, ldt->nr_entries);
		}

>  /*
> - * User space process size. 47bits minus one guard page.  The guard
> - * page is necessary on Intel CPUs: if a SYSCALL instruction is at
> - * the highest possible canonical userspace address, then that
> - * syscall will enter the kernel with a non-canonical return
> - * address, and SYSRET will explode dangerously.  We avoid this
> - * particular problem by preventing anything from being mapped
> - * at the maximum canonical address.
> + * User space process size.  This is the first address outside the user range.
> + * There are a few constraints that determine this:
> + *
> + * On Intel CPUs, if a SYSCALL instruction is at the highest canonical
> + * address, then that syscall will enter the kernel with a
> + * non-canonical return address, and SYSRET will explode dangerously.
> + * We avoid this particular problem by preventing anything executable
> + * from being mapped at the maximum canonical address.
> + *
> + * On AMD CPUs in the Ryzen family, there's a nasty bug in which the
> + * CPUs malfunction if they execute code from the highest canonical page.
> + * They'll speculate right off the end of the canonical space, and
> + * bad things happen.  This is worked around in the same way as the
> + * Intel problem.
> + *
> + * With page table isolation enabled, we map the LDT in ... [stay tuned]
>   */
>  #define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
>  
> diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
> index ae5615b03def..46ad333ed797 100644
> --- a/arch/x86/kernel/ldt.c
> +++ b/arch/x86/kernel/ldt.c
> @@ -19,6 +19,7 @@
>  #include <linux/uaccess.h>
>  
>  #include <asm/ldt.h>
> +#include <asm/tlb.h>
>  #include <asm/desc.h>
>  #include <asm/mmu_context.h>
>  #include <asm/syscalls.h>
> @@ -46,13 +47,12 @@ static void refresh_ldt_segments(void)
>  static void flush_ldt(void *__mm)
>  {
>  	struct mm_struct *mm = __mm;
> -	mm_context_t *pc;
>  
>  	if (this_cpu_read(cpu_tlbstate.loaded_mm) != mm)
>  		return;
>  
> -	pc = &mm->context;
> -	set_ldt(pc->ldt->entries, pc->ldt->nr_entries);
> +	__flush_tlb_all();
> +	load_mm_ldt(mm);

Why the new TLB flush?

Shouldn't this be done in the more obvious place closer to the page
table manipulation?

> @@ -90,9 +90,112 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
>  	}
>  
>  	new_ldt->nr_entries = num_entries;
> +	new_ldt->slot = -1;
>  	return new_ldt;
>  }

This seems a bit silly to do given that 'slot' is an int and this patch
introduces warnings looking for positive values:

	if (WARN_ON_ONCE((unsigned long)ldt->slot > 1)) {

Seems like a good idea to just have a single warning in there looking
for non-zero, probably covering the PTI and non-PTI cases (at least for
now until the slots get used).

> +/*
> + * If PTI is enabled, this maps the LDT into the kernelmode and
> + * usermode tables for the given mm.
> + *
> + * There is no corresponding unmap function.  Even if the LDT is freed, we
> + * leave the PTEs around until the slot is reused or the mm is destroyed.
> + * This is harmless: the LDT is always in ordinary memory, and no one will
> + * access the freed slot.
> + *
> + * If we wanted to unmap freed LDTs, we'd also need to do a flush to make
> + * it useful, and the flush would slow down modify_ldt().
> + */
> +static int map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
> +{
> +#ifdef CONFIG_PAGE_TABLE_ISOLATION
> +	spinlock_t *ptl;
> +	bool is_vmalloc;
> +	bool had_top_level_entry;
> +	pgd_t *pgd;
> +	int i;
> +
> +	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
> +		return 0;
> +
> +	WARN_ON(ldt->slot != -1);

Only allow mapping newly-allocated LDTs?

> +	/*
> +	 * Did we already have the top level entry allocated?  We can't
> +	 * use pgd_none() for this because it doens't do anything on
> +	 * 4-level page table kernels.
> +	 */
> +	pgd = pgd_offset(mm, LDT_BASE_ADDR);
> +	had_top_level_entry = (pgd->pgd != 0);
> +
> +	is_vmalloc = is_vmalloc_addr(ldt->entries);
> +
> +	for (i = 0; i * PAGE_SIZE < ldt->nr_entries * LDT_ENTRY_SIZE; i++) {
> +		unsigned long offset = i << PAGE_SHIFT;
> +		unsigned long va = (unsigned long)ldt_slot_va(slot) + offset;
> +		const void *src = (char *)ldt->entries + offset;
> +		unsigned long pfn = is_vmalloc ? vmalloc_to_pfn(src) :
> +			page_to_pfn(virt_to_page(src));
> +		pte_t pte, *ptep;
> +
> +		ptep = get_locked_pte(mm, va, &ptl);

It's *probably* worth calling out that all the page table allocation
happens in there.  I went looking for it in this patch and it took me a
few minutes to find it.

> +		if (!ptep)
> +			return -ENOMEM;
> +		pte = pfn_pte(pfn, __pgprot(__PAGE_KERNEL & ~_PAGE_GLOBAL));

This ~_PAGE_GLOBAL is for the same reason as all the other KPTI code,
right?  BTW, does this function deserve to be in the LDT code or kpti?

> +			      set_pte_at(mm, va, ptep, pte);
> +		pte_unmap_unlock(ptep, ptl);
> +	}

Might want to fix up the set_pte_at() whitespace damage.

> +	if (mm->context.ldt) {
> +		/*
> +		 * We already had an LDT.  The top-level entry should already
> +		 * have been allocated and synchronized with the usermode
> +		 * tables.
> +		 */
> +		WARN_ON(!had_top_level_entry);
> +		if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
> +			WARN_ON(!kernel_to_user_pgdp(pgd)->pgd);
> +	} else {
> +		/*
> +		 * This is the first time we're mapping an LDT for this process.
> +		 * Sync the pgd to the usermode tables.
> +		 */
> +		WARN_ON(had_top_level_entry);
> +		if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI)) {
> +			WARN_ON(kernel_to_user_pgdp(pgd)->pgd);
> +			set_pgd(kernel_to_user_pgdp(pgd), *pgd);
> +		}
> +	}
> +
> +	flush_tlb_mm_range(mm,
> +			   (unsigned long)ldt_slot_va(slot),
> +			   (unsigned long)ldt_slot_va(slot) + LDT_SLOT_STRIDE,
> +			   0);

Why wait until here to flush?  Isn't this primarily for the case where
set_pte_at() overwrote something?

> +	ldt->slot = slot;
> +
> +	return 0;
> +#else
> +	return -EINVAL;
> +#endif
> +}
> +
> +static void free_ldt_pgtables(struct mm_struct *mm)
> +{
> +#ifdef CONFIG_PAGE_TABLE_ISOLATION
> +	struct mmu_gather tlb;
> +	unsigned long start = LDT_BASE_ADDR;
> +	unsigned long end = start + (1UL << PGDIR_SHIFT);
> +
> +	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
> +		return;
> +
> +	tlb_gather_mmu(&tlb, mm, start, end);
> +	free_pgd_range(&tlb, start, end, start, end);
> +	tlb_finish_mmu(&tlb, start, end);
> +#endif
> +}

Isn't this primarily called at exit()?  Isn't it a bit of a shame we
can't combine this with the other exit()-time TLB flushing?

>  /* After calling this, the LDT is immutable. */
>  static void finalize_ldt_struct(struct ldt_struct *ldt)
>  {
> @@ -134,17 +237,15 @@ int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm)
>  	int retval = 0;
>  
>  	mutex_init(&mm->context.lock);
> +	mm->context.ldt = NULL;
> +
>  	old_mm = current->mm;
> -	if (!old_mm) {
> -		mm->context.ldt = NULL;
> +	if (!old_mm)
>  		return 0;
> -	}
>  
>  	mutex_lock(&old_mm->context.lock);
> -	if (!old_mm->context.ldt) {
> -		mm->context.ldt = NULL;
> +	if (!old_mm->context.ldt)
>  		goto out_unlock;
> -	}
>  
>  	new_ldt = alloc_ldt_struct(old_mm->context.ldt->nr_entries);
>  	if (!new_ldt) {
> @@ -155,8 +256,17 @@ int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm)
>  	memcpy(new_ldt->entries, old_mm->context.ldt->entries,
>  	       new_ldt->nr_entries * LDT_ENTRY_SIZE);
>  	finalize_ldt_struct(new_ldt);
> +	retval = map_ldt_struct(mm, new_ldt, 0);
> +	if (retval)
> +		goto out_free;
>  
>  	mm->context.ldt = new_ldt;
> +	goto out_unlock;
> +
> +out_free:
> +	free_ldt_pgtables(mm);
> +	free_ldt_struct(new_ldt);
> +	return retval;
>  
>  out_unlock:
>  	mutex_unlock(&old_mm->context.lock);
> @@ -174,6 +284,11 @@ void destroy_context_ldt(struct mm_struct *mm)
>  	mm->context.ldt = NULL;
>  }
>  
> +void ldt_arch_exit_mmap(struct mm_struct *mm)
> +{
> +	free_ldt_pgtables(mm);
> +}
> +
>  static int read_ldt(void __user *ptr, unsigned long bytecount)
>  {
>  	struct mm_struct *mm = current->mm;
> @@ -285,6 +400,11 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
>  
>  	new_ldt->entries[ldt_info.entry_number] = ldt;
>  	finalize_ldt_struct(new_ldt);
> +	error = map_ldt_struct(mm, new_ldt, old_ldt ? !old_ldt->slot : 0);
> +	if (error) {
> +		free_ldt_struct(old_ldt);
> +		goto out_unlock;
> +	}

Ahh, and the slots finally get used in the last hunk!  That was a long
wait! :)

OK, so slots can be 0 or 1, and we need that so we *have* an LDT to use
while we're setting up the new one.  Sounds sane, but it was pretty
non-obvious from everything up to this point and it's still pretty hard
to spot with the !old_ldt->slot in there.

Might be worth commenting when slot is defined.

Also, from a high level, this does increase the overhead of KPTI in a
non-trivial way, right?  It costs us three more page table pages per
process allocated at fork() and freed at exit() and a new TLB flush.