linux-kernel - Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALCETrVeb7g+wAP8hox9TEBBWYAFXGzOv5t2aQVrtuW4D6NQQQ@mail.gmail.com>
Date:   Thu, 25 Jan 2018 09:04:21 -0800
From:   Andy Lutomirski <luto@...nel.org>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Arjan van de Ven <arjan@...ux.intel.com>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        LKML <linux-kernel@...r.kernel.org>,
        KarimAllah Ahmed <karahmed@...zon.de>,
        Andi Kleen <ak@...ux.intel.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Andy Lutomirski <luto@...nel.org>,
        Ashok Raj <ashok.raj@...el.com>,
        Asit Mallick <asit.k.mallick@...el.com>,
        Borislav Petkov <bp@...e.de>,
        Dan Williams <dan.j.williams@...el.com>,
        Dave Hansen <dave.hansen@...el.com>,
        David Woodhouse <dwmw@...zon.co.uk>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        "H . Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...hat.com>,
        Janakarajan Natarajan <Janakarajan.Natarajan@....com>,
        Joerg Roedel <joro@...tes.org>,
        Jun Nakajima <jun.nakajima@...el.com>,
        Laura Abbott <labbott@...hat.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Masami Hiramatsu <mhiramat@...nel.org>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Radim Krcmar <rkrcmar@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Tom Lendacky <thomas.lendacky@....com>, X86 ML <x86@...nel.org>
Subject: Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same
 user process

On Thu, Jan 25, 2018 at 8:41 AM, Peter Zijlstra <peterz@...radead.org> wrote:
> On Thu, Jan 25, 2018 at 06:07:07AM -0800, Arjan van de Ven wrote:
>> On 1/25/2018 5:50 AM, Peter Zijlstra wrote:
>> > On Thu, Jan 25, 2018 at 05:21:30AM -0800, Arjan van de Ven wrote:
>> > > >
>> > > > This means that 'A -> idle -> A' should never pass through switch_mm to
>> > > > begin with.
>> > > >
>> > > > Please clarify how you think it does.
>> > > >
>> > >
>> > > the idle code does leave_mm() to avoid having to IPI CPUs in deep sleep states
>> > > for a tlb flush.
>> >
>> > The intel_idle code does, not the idle code. This is squirreled away in
>> > some driver :/
>>
>> afaik (but haven't looked in a while) acpi drivers did too
>
> Only makes it worse.. drivers shouldn't be frobbing with things like
> this.
>
>> > > (trust me, that you really want, sequentially IPI's a pile of cores in a deep sleep
>> > > state to just flush a tlb that's empty, the performance of that is horrific)
>> >
>> > Hurmph. I'd rather fix that some other way than leave_mm(), this is
>> > piling special on special.
>> >
>> the problem was tricky. but of course if something better is possible lets figure this out
>
> How about something like the below? It boots with "nopcid" appended to
> the cmdline.
>
> Andy, could you pretty please have a look at this? This is fickle code
> at best and I'm sure I messed _something_ up.
>
> The idea is simple, do what we do for virt. Don't send IPI's to CPUs
> that don't need them (in virt's case because the vCPU isn't running, in
> our case because we're not in fact running a user process), but mark the
> CPU as having needed a TLB flush.

I haven't tried to fully decipher the patch, but I think the idea is
wrong.  (I think it's the same wrong idea that Rik and I both had and
that I got into Linus' tree for a while...)  The problem is that it's
not actually correct to run indefinitely in kernel mode using stale
cached page table data.  The stale PTEs themselves are fine, but the
stale intermediate translations can cause the CPU to speculatively
load complete garbage into the TLB, and that's bad (and causes MCEs on
AMD CPUs).

I think we only really have two choices: tlb_defer_switch_to_init_mm()
== true and tlb_defer_switch_to_init_mm() == false.  The current
heuristic is to not defer if we have PCID, because loading CR3 is
reasonably fast.



>  void native_flush_tlb_others(const struct cpumask *cpumask,
>                              const struct flush_tlb_info *info)
>  {
> +       struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__tlb_mask);
> +
>         count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
>         if (info->end == TLB_FLUSH_ALL)
>                 trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
> @@ -531,6 +543,19 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
>                                                (void *)info, 1);
>                 return;
>         }
> +
> +       if (tlb_defer_switch_to_init_mm() && flushmask) {
> +               int cpu;
> +
> +               cpumask_copy(flushmask, cpumask);
> +               for_each_cpu(cpu, flushmask) {
> +                       if (cmpxchg(per_cpu_ptr(&cpu_tlbstate.is_lazy, cpu), 1, 2) >= 1)
> +                               __cpumask_clear_cpu(cpu, flushmask);

If this code path here executes and we're flushing because we just
removed a reference to a page table and we're about to free the page
table, then the CPU that we didn't notify by IPI can start using
whatever gets written to the pagetable after it's freed, and that's
bad :(

> +               }
> +
> +               cpumask = flushmask;
> +       }
> +
>         smp_call_function_many(cpumask, flush_tlb_func_remote,
>                                (void *)info, 1);
>  }