[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <422DF5AC-6B45-406F-B3FC-DD1AA9BC18F6@amacapital.net>
Date: Sun, 22 Jul 2018 13:59:21 -0700
From: Andy Lutomirski <luto@...capital.net>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Andrew Lutomirski <luto@...nel.org>,
the arch/x86 maintainers <x86@...nel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>
Subject: Re: [RFC 2/2] x86/pti/64: Remove the SYSCALL64 entry trampoline
> On Jul 22, 2018, at 11:27 AM, Linus Torvalds <torvalds@...ux-foundation.org> wrote:
>
>> On Sun, Jul 22, 2018 at 10:45 AM Andy Lutomirski <luto@...nel.org> wrote:
>>
>> This patch changes the code to map the percpu TSS into the user page
>> tables to allow the non-trampoline SYSCALL64 path to work under PTI.
>
> Me likey.
>
> However:
>
>> This does not add a new direct information leak, since the TSS is
>> readable by Meltdown from the cpu_entry_area alias regardless.
>
> Afaik, it does now potentially expose through meltdown the per-thread
> entry stack info, which is new.
It’s always been exposed through the RO alias. The only new exposure is the *address* of the RW alias, I think.
>
> But I don't think that's a show-stopper.
>
>> static void __init pti_clone_user_shared(void)
>> {
>> + for_each_possible_cpu(cpu) {
>
> But this code is pretty disgusting and seems wrong.
>
> Do you really want to do all trhe _possible_ cpu's, not just the
> online ones? I'd rather expose less (think MAXCPU) and then have the
> CPU hotplug code expose the page as the CPU comes up?
We already have exactly the same issue for cpu_entry_area. If we change it, I think we should do cpu_entry_area at the same time. But that’s awkward because cpu_entry_area is mapped one PMD at a time right now.
It’s also awkward to expose a percpu page dynamically, because (I think) percpu data isn’t guaranteed to all be in the same PGD-sized area. A vmalloc fault in the early SYSCALL64 path is fatal.
>
>> + unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
>> + phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
>> + pte_t *target_pte;
>> +
>> + target_pte = pti_user_pagetable_walk_pte(va);
>
> This function only exists if CONFIG_X86_VSYSCALL_EMULATION, so it
> won't even compile under (very unusual) configurations.
Oops.
>
> The "disgusting" part is that I think it could/should share more code
> with the vsyscall case, and the whole target-pte checking and setting
> should be shared too.
I tried that. It was uglier. The percpu code wants to make up a new PTE because the real kernel mapping uses large pages. The vsyscall code wants to copy a PTE because it’s really a PTE and it has unusual permissions.
>
> Beause not being shared, I react to this:
>
>> + set_pte(target_pte, pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL));
>
> Hmm. The vsyscall code just does
>
> *target_pte = ..
>
> without any set_pte() stuff. Do we want/need the PVOP cases, and if
> so, why doesn't the vsyscall case need it?
It doesn’t need it. I could use plain assignment.
>
> Anyway, I love the approach, and how this gets rid of the nasty
> trampoline, so no real complaints, just "this needs some fixups".
>
>
I’ll do the fixups. I think that, if we want to unmap the pages for CPUs that aren’t present, that should be a separate patch. I’m also not convinced it adds much value.
In general, PTI is fairly crappy, and it leaks all kinds of information. I suspect the worst leak is the NMI stack for local and remote CPUs. Fixing *that* is going to be fugly, but may actually be important, because I can easily imagine malicious user code that causes arbitrary kernel memory to get read and spilled on the NMI stack.
What we *should* do IMO is defer allocation of percpu space for not-present CPUs to save a bunch of memory. But that’s a major change and will probably break things.
Powered by blists - more mailing lists