linux-kernel - Re: [PATCH PTI v2 6/6] x86/pti: Put the LDT in its own PGD if PTI is on

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrU6V0LJwTTJFbgmq-BvG5MpDXnXTGhB8O8_dJfU=4kFSw@mail.gmail.com>
Date:   Mon, 11 Dec 2017 10:40:13 -0800
From:   Andy Lutomirski <luto@...nel.org>
To:     Dave Hansen <dave.hansen@...el.com>
Cc:     Andy Lutomirski <luto@...nel.org>, X86 ML <x86@...nel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Borislav Petkov <bp@...en8.de>,
        Brian Gerst <brgerst@...il.com>,
        David Laight <David.Laight@...lab.com>,
        Kees Cook <keescook@...omium.org>,
        Peter Zijlstra <peterz@...radead.org>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
Subject: Re: [PATCH PTI v2 6/6] x86/pti: Put the LDT in its own PGD if PTI is on

On Mon, Dec 11, 2017 at 9:49 AM, Dave Hansen <dave.hansen@...el.com> wrote:
> So, before this,
>
> On 12/10/2017 10:47 PM, Andy Lutomirski wrote:
> ...> +  if (unlikely(ldt)) {
>> +             if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI)) {
>> +                     if (WARN_ON_ONCE((unsigned long)ldt->slot > 1)) {
>> +                             clear_LDT();
>> +                             return;
>> +                     }
>
> I'm missing the purpose of the slots.  Are you hoping to use those
> eventually for randomization, but just punting on implementing it for now?
>
>> +
>> +                     set_ldt(ldt_slot_va(ldt->slot), ldt->nr_entries);
>> +             } else {
>> +                     set_ldt(ldt->entries, ldt->nr_entries);
>> +             }
>
> This seems like a much better place to point out why the aliasing exists
> and what it is doing than the place it is actually commented.
>
> Maybe:
>
>                         /*
>                          * ldt->entries is not mapped into the user page
>                          * tables when page table isolation is enabled.
>                          * Point the hardware to the alias we created.
>                          */
>                         set_ldt(ldt_slot_va(ldt->slot), ...
>                 } else {
>                         /*
>                          * Point the hardware at the normal kernel
>                          * mapping when not isolated.
>                          */
>                         set_ldt(ldt->entries, ldt->nr_entries);
>                 }
>

Good call.

>>  /*
>> - * User space process size. 47bits minus one guard page.  The guard
>> - * page is necessary on Intel CPUs: if a SYSCALL instruction is at
>> - * the highest possible canonical userspace address, then that
>> - * syscall will enter the kernel with a non-canonical return
>> - * address, and SYSRET will explode dangerously.  We avoid this
>> - * particular problem by preventing anything from being mapped
>> - * at the maximum canonical address.
>> + * User space process size.  This is the first address outside the user range.
>> + * There are a few constraints that determine this:
>> + *
>> + * On Intel CPUs, if a SYSCALL instruction is at the highest canonical
>> + * address, then that syscall will enter the kernel with a
>> + * non-canonical return address, and SYSRET will explode dangerously.
>> + * We avoid this particular problem by preventing anything executable
>> + * from being mapped at the maximum canonical address.
>> + *
>> + * On AMD CPUs in the Ryzen family, there's a nasty bug in which the
>> + * CPUs malfunction if they execute code from the highest canonical page.
>> + * They'll speculate right off the end of the canonical space, and
>> + * bad things happen.  This is worked around in the same way as the
>> + * Intel problem.
>> + *
>> + * With page table isolation enabled, we map the LDT in ... [stay tuned]
>>   */
>>  #define TASK_SIZE_MAX        ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
>>
>> diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
>> index ae5615b03def..46ad333ed797 100644
>> --- a/arch/x86/kernel/ldt.c
>> +++ b/arch/x86/kernel/ldt.c
>> @@ -19,6 +19,7 @@
>>  #include <linux/uaccess.h>
>>
>>  #include <asm/ldt.h>
>> +#include <asm/tlb.h>
>>  #include <asm/desc.h>
>>  #include <asm/mmu_context.h>
>>  #include <asm/syscalls.h>
>> @@ -46,13 +47,12 @@ static void refresh_ldt_segments(void)
>>  static void flush_ldt(void *__mm)
>>  {
>>       struct mm_struct *mm = __mm;
>> -     mm_context_t *pc;
>>
>>       if (this_cpu_read(cpu_tlbstate.loaded_mm) != mm)
>>               return;
>>
>> -     pc = &mm->context;
>> -     set_ldt(pc->ldt->entries, pc->ldt->nr_entries);
>> +     __flush_tlb_all();
>> +     load_mm_ldt(mm);
>
> Why the new TLB flush?

It was an attempt to debug a bug and I forgot to delete it.

>> @@ -90,9 +90,112 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
>>       }
>>
>>       new_ldt->nr_entries = num_entries;
>> +     new_ldt->slot = -1;
>>       return new_ldt;
>>  }
>
> This seems a bit silly to do given that 'slot' is an int and this patch
> introduces warnings looking for positive values:
>
>         if (WARN_ON_ONCE((unsigned long)ldt->slot > 1)) {
>
> Seems like a good idea to just have a single warning in there looking
> for non-zero, probably covering the PTI and non-PTI cases (at least for
> now until the slots get used).

The idea is to warn if we haven't mapped it yet.

>
>> +/*
>> + * If PTI is enabled, this maps the LDT into the kernelmode and
>> + * usermode tables for the given mm.
>> + *
>> + * There is no corresponding unmap function.  Even if the LDT is freed, we
>> + * leave the PTEs around until the slot is reused or the mm is destroyed.
>> + * This is harmless: the LDT is always in ordinary memory, and no one will
>> + * access the freed slot.
>> + *
>> + * If we wanted to unmap freed LDTs, we'd also need to do a flush to make
>> + * it useful, and the flush would slow down modify_ldt().
>> + */
>> +static int map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
>> +{
>> +#ifdef CONFIG_PAGE_TABLE_ISOLATION
>> +     spinlock_t *ptl;
>> +     bool is_vmalloc;
>> +     bool had_top_level_entry;
>> +     pgd_t *pgd;
>> +     int i;
>> +
>> +     if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
>> +             return 0;
>> +
>> +     WARN_ON(ldt->slot != -1);
>
> Only allow mapping newly-allocated LDTs?
>
>> +     /*
>> +      * Did we already have the top level entry allocated?  We can't
>> +      * use pgd_none() for this because it doens't do anything on
>> +      * 4-level page table kernels.
>> +      */
>> +     pgd = pgd_offset(mm, LDT_BASE_ADDR);
>> +     had_top_level_entry = (pgd->pgd != 0);
>> +
>> +     is_vmalloc = is_vmalloc_addr(ldt->entries);
>> +
>> +     for (i = 0; i * PAGE_SIZE < ldt->nr_entries * LDT_ENTRY_SIZE; i++) {
>> +             unsigned long offset = i << PAGE_SHIFT;
>> +             unsigned long va = (unsigned long)ldt_slot_va(slot) + offset;
>> +             const void *src = (char *)ldt->entries + offset;
>> +             unsigned long pfn = is_vmalloc ? vmalloc_to_pfn(src) :
>> +                     page_to_pfn(virt_to_page(src));
>> +             pte_t pte, *ptep;
>> +
>> +             ptep = get_locked_pte(mm, va, &ptl);
>
> It's *probably* worth calling out that all the page table allocation
> happens in there.  I went looking for it in this patch and it took me a
> few minutes to find it.
>
>> +             if (!ptep)
>> +                     return -ENOMEM;
>> +             pte = pfn_pte(pfn, __pgprot(__PAGE_KERNEL & ~_PAGE_GLOBAL));
>
> This ~_PAGE_GLOBAL is for the same reason as all the other KPTI code,
> right?  BTW, does this function deserve to be in the LDT code or kpti?
>
>> +                           set_pte_at(mm, va, ptep, pte);
>> +             pte_unmap_unlock(ptep, ptl);
>> +     }
>
> Might want to fix up the set_pte_at() whitespace damage.

It's not damaged -- it lines up nicely if you look at the file instead
of the patch :)

>
>> +     if (mm->context.ldt) {
>> +             /*
>> +              * We already had an LDT.  The top-level entry should already
>> +              * have been allocated and synchronized with the usermode
>> +              * tables.
>> +              */
>> +             WARN_ON(!had_top_level_entry);
>> +             if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
>> +                     WARN_ON(!kernel_to_user_pgdp(pgd)->pgd);
>> +     } else {
>> +             /*
>> +              * This is the first time we're mapping an LDT for this process.
>> +              * Sync the pgd to the usermode tables.
>> +              */
>> +             WARN_ON(had_top_level_entry);
>> +             if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI)) {
>> +                     WARN_ON(kernel_to_user_pgdp(pgd)->pgd);
>> +                     set_pgd(kernel_to_user_pgdp(pgd), *pgd);
>> +             }
>> +     }
>> +
>> +     flush_tlb_mm_range(mm,
>> +                        (unsigned long)ldt_slot_va(slot),
>> +                        (unsigned long)ldt_slot_va(slot) + LDT_SLOT_STRIDE,
>> +                        0);
>
> Why wait until here to flush?  Isn't this primarily for the case where
> set_pte_at() overwrote something?

I think it would be okay if we did it sooner, but CPUs are allowed to
cache intermediate mappings, and we're changing the userspace tables
above it.

>> +
>> +static void free_ldt_pgtables(struct mm_struct *mm)
>> +{
>> +#ifdef CONFIG_PAGE_TABLE_ISOLATION
>> +     struct mmu_gather tlb;
>> +     unsigned long start = LDT_BASE_ADDR;
>> +     unsigned long end = start + (1UL << PGDIR_SHIFT);
>> +
>> +     if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_PTI))
>> +             return;
>> +
>> +     tlb_gather_mmu(&tlb, mm, start, end);
>> +     free_pgd_range(&tlb, start, end, start, end);
>> +     tlb_finish_mmu(&tlb, start, end);
>> +#endif
>> +}
>
> Isn't this primarily called at exit()?  Isn't it a bit of a shame we
> can't combine this with the other exit()-time TLB flushing?

Yes.  In fact, we don't really need that flush at all since the mm is
totally dead.  But the free_pgd_range() API is too dumb.  And yes, we
have the same issue in the normal mm/memory.c code.  In general, exit
handling is seriously overengineered.

> Also, from a high level, this does increase the overhead of KPTI in a
> non-trivial way, right?  It costs us three more page table pages per
> process allocated at fork() and freed at exit() and a new TLB flush.

Yeah, but no one will care.  modify_ldt() is used for DOSEMU, Wine,
and really old 32-bit programs.