linux-kernel - Re: [RFC 2/2] x86/pti/64: Remove the SYSCALL64 entry trampoline

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <422DF5AC-6B45-406F-B3FC-DD1AA9BC18F6@amacapital.net>
Date:   Sun, 22 Jul 2018 13:59:21 -0700
From:   Andy Lutomirski <luto@...capital.net>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Andrew Lutomirski <luto@...nel.org>,
        the arch/x86 maintainers <x86@...nel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>
Subject: Re: [RFC 2/2] x86/pti/64: Remove the SYSCALL64 entry trampoline

> On Jul 22, 2018, at 11:27 AM, Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> 
>> On Sun, Jul 22, 2018 at 10:45 AM Andy Lutomirski <luto@...nel.org> wrote:
>> 
>> This patch changes the code to map the percpu TSS into the user page
>> tables to allow the non-trampoline SYSCALL64 path to work under PTI.
> 
> Me likey.
> 
> However:
> 
>> This does not add a new direct information leak, since the TSS is
>> readable by Meltdown from the cpu_entry_area alias regardless.
> 
> Afaik, it does now potentially expose through meltdown the per-thread
> entry stack info, which is new.

It’s always been exposed through the RO alias. The only new exposure is the *address* of the RW alias, I think.

> 
> But I don't think that's a show-stopper.
> 
>> static void __init pti_clone_user_shared(void)
>> {
>> +       for_each_possible_cpu(cpu) {
> 
> But this code is pretty disgusting and seems wrong.
> 
> Do you really want to do all trhe _possible_ cpu's, not just the
> online ones? I'd rather expose less (think MAXCPU) and then have the
> CPU hotplug code expose the page as the CPU comes up?

We already have exactly the same issue for cpu_entry_area. If we change it, I think we should do cpu_entry_area at the same time.  But that’s awkward because cpu_entry_area is mapped one PMD at a time right now.

It’s also awkward to expose a percpu page dynamically, because (I think) percpu data isn’t guaranteed to all be in the same PGD-sized area. A vmalloc fault in the early SYSCALL64 path is fatal.

> 
>> +               unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
>> +               phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
>> +               pte_t *target_pte;
>> +
>> +               target_pte = pti_user_pagetable_walk_pte(va);
> 
> This function only exists if CONFIG_X86_VSYSCALL_EMULATION, so it
> won't even compile under (very unusual) configurations.

Oops.

> 
> The "disgusting" part is that I think it could/should share more code
> with the vsyscall case, and the whole target-pte checking and setting
> should be shared too.

I tried that. It was uglier. The percpu code wants to make up a new PTE because the real kernel mapping uses large pages. The vsyscall code wants to copy a PTE because it’s really a PTE and it has unusual permissions.

> 
> Beause not being shared, I react to this:
> 
>> +               set_pte(target_pte, pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL));
> 
> Hmm. The vsyscall code just does
> 
>        *target_pte = ..
> 
> without any set_pte() stuff. Do we want/need the PVOP cases, and if
> so, why doesn't the vsyscall case need it?

It doesn’t need it. I could use plain assignment.

> 
> Anyway, I love the approach, and how this gets rid of the nasty
> trampoline, so no real complaints, just "this needs some fixups".
> 
> 

I’ll do the fixups. I think that, if we want to unmap the pages for CPUs that aren’t present, that should be a separate patch. I’m also not convinced it adds much value.

In general, PTI is fairly crappy, and it leaks all kinds of information. I suspect the worst leak is the NMI stack for local and remote CPUs. Fixing *that* is going to be fugly, but may actually be important, because I can easily imagine malicious user code that causes arbitrary kernel memory to get read and spilled on the NMI stack.

What we *should* do IMO is defer allocation of percpu space for not-present CPUs to save a bunch of memory.  But that’s a major change and will probably break things.