[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrXY+fKsqE9ZKSJaFuw0EUh9zmt342VdG6fxDjx96LswWw@mail.gmail.com>
Date: Mon, 9 Oct 2017 10:50:34 -0700
From: Andy Lutomirski <luto@...nel.org>
To: Borislav Petkov <bp@...en8.de>
Cc: Andy Lutomirski <luto@...nel.org>, X86 ML <x86@...nel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Markus Trippelsdorf <markus@...ppelsdorf.de>,
Adam Borowski <kilobyte@...band.pl>,
Brian Gerst <brgerst@...il.com>,
Johannes Hirte <johannes.hirte@...enkhaos.de>
Subject: Re: [RFC PATCH] x86/mm: Flush more aggressively in lazy TLB mode
On Mon, Oct 9, 2017 at 10:36 AM, Borislav Petkov <bp@...en8.de> wrote:
> On Mon, Oct 09, 2017 at 07:02:31PM +0200, Borislav Petkov wrote:
>> From 29a6426c20f25b8df767356a04727cb113e8e784 Mon Sep 17 00:00:00 2001
>> From: Andy Lutomirski <luto@...nel.org>
>> Date: Mon, 9 Oct 2017 09:50:49 -0700
>> Subject: [PATCH] x86/mm: Flush more aggressively in lazy TLB mode
>>
>> Since commit 94b1b03b519b, x86's lazy TLB mode has been all the way
>
> Write it out:
>
> Since commit
>
> 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
Will do.
>
> x86's lazy...
>
>> lazy: when running a kernel thread (including the idle thread), the
>> kernel keeps using the last user mm's page tables without attempting
>> to maintain user TLB coherence at all. From a pure semantic
>> perspective, this is fine -- kernel threads won't attempt to access
>> user pages, so having stale TLB entries doesn't matter.
>>
>> Unfortunately, I forgot about a subtlety. By skipping TLB flushes,
>> we also allow any paging-structure caches that may exist on the CPU
>> to become incoherent. This means that we can have a
>> paging-structure cache entry that references a freed page table, and
>> the CPU is within its rights to do a speculative page walk starting
>> at the freed page table.
>>
>> I can imagine this causing two different problems:
>>
>> - A speculative page walk starting from a bogus page table could read
>> IO addresses. I haven't seen any reports of this causing problems.
>>
>> - A speculative page walk that involves a bogus page table can install
>> garbage in the TLB. Such garbage would always be at a user VA, but
>> some AMD CPUs have logic that triggers a machine check when it notices
>> these bogus entries. I've seen a couple reports of this.
>
> It is actually more of an optimization which assumes that paging-structure
> entries are in WB DRAM:
>
> "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables
> performance optimization that assumes PML4, PDP, PDE, and PTE entries
> are in cacheable WB-DRAM; memory type checks may be bypassed, and
> addresses outside of WB-DRAM may result in undefined behavior or NB
> protocol errors. 1=Disables performance optimization and allows PML4,
> PDP, PDE and PTE entries to be in any memory type. Operating systems
> that maintain page tables in memory types other than WB- DRAM must set
> TlbCacheDis to insure proper operation."
>
> The MCE generated is an NB protocol error to signal that
>
> "Link: A specific coherent-only packet from a CPU was issued to an
> IO link. This may be caused by software which addresses page table
> structures in a memory type other than cacheable WB-DRAM without
> properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
> example, when page table structure addresses are above top of memory. In
> such cases, the NB will generate an MCE if it sees a mismatch between
> the memory operation generated by the core and the link type."
>
> I'm assuming coherent-only packets don't go out on IO links, thus the
> error.
Makes sense. It's even possible that the address is entirely bogus
and doesn't refer to anything at all.
>
>> Reinstate TLB coherence in lazy mode. With this patch applied, we
>> do it in one of two ways. If we have PCID, we simply switch back to
>> init_mm's page tables when we enter a kernel thread -- this seems to
>> be quite cheap except for the cost of serializing the CPU. If we
>> don't have PCID, then we set a flag and switch to init_mm the first
>> time we would otherwise need to flush the TLB.
>>
>> /sys/kernel/debug/x86/tlb_use_lazy_mode can be changed to override
>> the default mode for benchmarking.
>
> So this should be tlb_full_lazy_mode, no?
>
> Because we are lazy before but not "all the way" as you say above...
The choices are somewhat lazy and not lazy at all.
>
> And frankly, I'm not really crazy about this benchmarking aspect. Why
> would you have this in every kernel? I mean, it could be a patch ontop
> for people to measure on boxes but it is not really worth it. I mean,
> the PCID fun only showed some small improvement in some microbenchmark
> and nothing earth-shattering so having it everywhere to get some meh
> numbers is not really worth the trouble.
>
> Besides, this TLB code is already complex as hell.
The degree of simplification I would get by removing it is basically
nil. The debugfs code itself goes away, and a
static_branch_unlikely() turns into a static_cpu_has(), and that's it.
The real reason I added it is because Chris Mason volunteered to
benchmark it, and I'll send it to him once it survives a bit of
review.
>
> IOW, you could simplify this by removing the static key and adding it
> with a patch ontop for people who wanna play with this. For the majority
> of the systems and use cases it doesn't really matter, IMO.
>
>> In theory, we could optimize this better by only flushing the TLB in
>> lazy CPUs when a page table is freed. Doing that would require
>> auditing the mm code to make sure that all page table freeing goes
>> through tlb_remove_page() as well as reworking some data structures
>> to implement the improved flush logic.
>>
>> Fixes: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
>> Reported-by: Markus Trippelsdorf <markus@...ppelsdorf.de>
>> Reported-by: Adam Borowski <kilobyte@...band.pl>
>> Cc: Brian Gerst <brgerst@...il.com>
>> Cc: Borislav Petkov <bp@...en8.de>
>> Signed-off-by: Andy Lutomirski <luto@...nel.org>
>> Cc: "H. Peter Anvin" <hpa@...or.com>
>> Cc: "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
>> Cc: Adam Borowski <kilobyte@...band.pl>
>> Cc: Andy Lutomirski <luto@...nel.org>
>> Cc: Borislav Petkov <bp@...en8.de>
>> Cc: Borislav Petkov <bp@...e.de>
>> Cc: Brian Gerst <brgerst@...il.com>
>> Cc: Daniel Borkmann <daniel@...earbox.net>
>> Cc: Eric Biggers <ebiggers@...gle.com>
>> Cc: Ingo Molnar <mingo@...hat.com>
>> Cc: Kees Cook <keescook@...omium.org>
>> Cc: Markus Trippelsdorf <markus@...ppelsdorf.de>
>> Cc: Nadav Amit <nadav.amit@...il.com>
>> Cc: Rik van Riel <riel@...hat.com>
>> Cc: Roman Kagan <rkagan@...tuozzo.com>
>> Cc: Thomas Gleixner <tglx@...utronix.de>
>> Cc: lkml <linux-kernel@...r.kernel.org>
>> Cc: x86-ml <x86@...nel.org>
>> Link: http://lkml.kernel.org/r/8eccc9240041ea7cb94624cab8d07e2a6e911ba7.1507567665.git.luto@kernel.org
>> Signed-off-by: Borislav Petkov <bp@...e.de>
>> ---
>> arch/x86/include/asm/mmu_context.h | 8 +-
>> arch/x86/include/asm/tlbflush.h | 24 ++++++
>> arch/x86/mm/tlb.c | 153 +++++++++++++++++++++++++++----------
>> 3 files changed, 136 insertions(+), 49 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
>> index c120b5db178a..3c856a15b98e 100644
>> --- a/arch/x86/include/asm/mmu_context.h
>> +++ b/arch/x86/include/asm/mmu_context.h
>> @@ -126,13 +126,7 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
>> DEBUG_LOCKS_WARN_ON(preemptible());
>> }
>>
>> -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
>> -{
>> - int cpu = smp_processor_id();
>> -
>> - if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
>> - cpumask_clear_cpu(cpu, mm_cpumask(mm));
>> -}
>> +void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
>>
>> static inline int init_new_context(struct task_struct *tsk,
>> struct mm_struct *mm)
>> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
>> index 4893abf7f74f..d362161d3291 100644
>> --- a/arch/x86/include/asm/tlbflush.h
>> +++ b/arch/x86/include/asm/tlbflush.h
>> @@ -83,6 +83,13 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
>> #endif
>>
>> /*
>> + * If tlb_use_lazy_mode is true, then we try to avoid switching CR3 to point
>> + * to init_mm when we switch to a kernel thread (e.g. the idle thread). If
>> + * it's false, then we immediately switch CR3 when entering a kernel thread.
>> + */
>> +DECLARE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
>> +
>> +/*
>> * 6 because 6 should be plenty and struct tlb_state will fit in
>> * two cache lines.
>> */
>> @@ -105,6 +112,23 @@ struct tlb_state {
>> u16 next_asid;
>>
>> /*
>> + * We can be in one of several states:
>> + *
>> + * - Actively using an mm. Our CPU's bit will be set in
>> + * mm_cpumask(loaded_mm) and is_lazy == false;
>> + *
>> + * - Not using a real mm. loaded_mm == &init_mm. Our CPU's bit
>> + * will not be set in mm_cpumask(&init_mm) and is_lazy == false.
>
> And this is the old lazy mode, right?
>
> I think we should state what lazy we mean ...
This is non-lazy. It's roughtly what our state was in old kernels
when we went lazy and then called leave_mm().
>
>> + * - Lazily using a real mm. loaded_mm != &init_mm, our bit
>> + * is set in mm_cpumask(loaded_mm), but is_lazy == true.
>> + * We're heuristically guessing that the CR3 load we
>> + * skipped more than makes up for the overhead added by
>> + * lazy mode.
>> + */
>> + bool is_lazy;
>
>
>> +
>> + /*
>> * Access to this CR4 shadow and to H/W CR4 is protected by
>> * disabling interrupts when modifying either one.
>> */
>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index 49d9778376d7..658bf0090565 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -30,6 +30,8 @@
>>
>> atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
>>
>> +DEFINE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
>> +
>> static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
>> u16 *new_asid, bool *need_flush)
>> {
>> @@ -80,7 +82,7 @@ void leave_mm(int cpu)
>> return;
>>
>> /* Warn if we're not lazy. */
>> - WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
>> + WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
>>
>> switch_mm(NULL, &init_mm, NULL);
>> }
>> @@ -142,45 +144,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>> __flush_tlb_all();
>> }
>> #endif
>> + this_cpu_write(cpu_tlbstate.is_lazy, false);
>>
>> if (real_prev == next) {
>> VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
>> next->context.ctx_id);
>>
>> - if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
>> - /*
>> - * There's nothing to do: we weren't lazy, and we
>> - * aren't changing our mm. We don't need to flush
>> - * anything, nor do we need to update CR3, CR4, or
>> - * LDTR.
>> - */
>> - return;
>> - }
>> -
>> - /* Resume remote flushes and then read tlb_gen. */
>> - cpumask_set_cpu(cpu, mm_cpumask(next));
>> - next_tlb_gen = atomic64_read(&next->context.tlb_gen);
>> -
>> - if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) <
>> - next_tlb_gen) {
>> - /*
>> - * Ideally, we'd have a flush_tlb() variant that
>> - * takes the known CR3 value as input. This would
>> - * be faster on Xen PV and on hypothetical CPUs
>> - * on which INVPCID is fast.
>> - */
>> - this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen,
>> - next_tlb_gen);
>> - write_cr3(build_cr3(next, prev_asid));
>> - trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
>> - TLB_FLUSH_ALL);
>> - }
>> -
>> /*
>> - * We just exited lazy mode, which means that CR4 and/or LDTR
>> - * may be stale. (Changes to the required CR4 and LDTR states
>> - * are not reflected in tlb_gen.)
>> + * We don't currently support having a real mm loaded without
>> + * our cpu set in mm_cpumask(). We have all the bookkeeping
>
> s/cpu/CPU/
>
>> + * in place to figure out whether we would need to flush
>> + * if our cpu were cleared in mm_cpumask(), but we don't
>
> ditto.
>
>> + * currently use it.
>> */
>> + if (WARN_ON_ONCE(real_prev != &init_mm &&
>> + !cpumask_test_cpu(cpu, mm_cpumask(next))))
>> + cpumask_set_cpu(cpu, mm_cpumask(next));
>> +
>> + return;
>> } else {
>> u16 new_asid;
>> bool need_flush;
> --
> Regards/Gruss,
> Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.
Powered by blists - more mailing lists