[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160111182548.GF6344@twins.programming.kicks-ass.net>
Date: Mon, 11 Jan 2016 19:25:48 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: linux-kernel@...r.kernel.org, dave.hansen@...ux.intel.com,
riel@...hat.com, brgerst@...il.com, akpm@...ux-foundation.org,
luto@...capital.net, mingo@...nel.org, dvlasenk@...hat.com,
hpa@...or.com, tglx@...utronix.de, bp@...en8.de, luto@...nel.org,
torvalds@...ux-foundation.org
Cc: linux-tip-commits@...r.kernel.org
Subject: Re: [tip:x86/urgent] x86/mm: Add barriers and document switch_mm()
-vs-flush synchronization
On Mon, Jan 11, 2016 at 03:42:40AM -0800, tip-bot for Andy Lutomirski wrote:
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -116,8 +116,34 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> #endif
> cpumask_set_cpu(cpu, mm_cpumask(next));
>
> - /* Re-load page tables */
> + /*
> + * Re-load page tables.
> + *
> + * This logic has an ordering constraint:
> + *
> + * CPU 0: Write to a PTE for 'next'
> + * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI.
> + * CPU 1: set bit 1 in next's mm_cpumask
> + * CPU 1: load from the PTE that CPU 0 writes (implicit)
> + *
> + * We need to prevent an outcome in which CPU 1 observes
> + * the new PTE value and CPU 0 observes bit 1 clear in
> + * mm_cpumask. (If that occurs, then the IPI will never
> + * be sent, and CPU 0's TLB will contain a stale entry.)
> + *
> + * The bad outcome can occur if either CPU's load is
> + * reordered before that CPU's store, so both CPUs much
s/much/must/ ?
> + * execute full barriers to prevent this from happening.
> + *
> + * Thus, switch_mm needs a full barrier between the
> + * store to mm_cpumask and any operation that could load
> + * from next->pgd. This barrier synchronizes with
> + * remote TLB flushers. Fortunately, load_cr3 is
> + * serializing and thus acts as a full barrier.
> + *
> + */
> load_cr3(next->pgd);
> +
> trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
>
> /* Stop flush ipis for the previous mm */
> @@ -156,10 +182,15 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> * schedule, protecting us from simultaneous changes.
> */
> cpumask_set_cpu(cpu, mm_cpumask(next));
> +
> /*
> * We were in lazy tlb mode and leave_mm disabled
> * tlb flush IPI delivery. We must reload CR3
> * to make sure to use no freed page tables.
> + *
> + * As above, this is a barrier that forces
> + * TLB repopulation to be ordered after the
> + * store to mm_cpumask.
somewhat confused by this comment, cpumask_set_cpu() is a LOCK BTS, that
is already fully ordered.
> */
> load_cr3(next->pgd);
> trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 8ddb5d0..8f4cc3d 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -188,17 +191,29 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> if (!current->mm) {
> leave_mm(smp_processor_id());
> +
> + /* Synchronize with switch_mm. */
> + smp_mb();
> +
> goto out;
> }
> + } else {
> leave_mm(smp_processor_id());
> +
> + /* Synchronize with switch_mm. */
> + smp_mb();
> + }
> }
The alternative is making leave_mm() unconditionally imply a full
barrier. I've not looked at other sites using it though.
Powered by blists - more mailing lists