lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 30 Aug 2016 17:09:43 -0400
From:   Rik van Riel <riel@...hat.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Ingo Molnar <mingo@...nel.org>, "H. Peter Anvin" <hpa@...or.com>,
        serebrin@...gle.com, Peter Zijlstra <peterz@...radead.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Andrew Lutomirski <luto@...nel.org>,
        Borislav Petkov <bp@...e.de>, Mel Gorman <mgorman@...e.de>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: [PATCH RFC v5] x86,mm,sched: make lazy TLB mode even lazier

On Tue, 30 Aug 2016 15:53:32 -0400
Rik van Riel <riel@...hat.com> wrote:

> On Sat, 27 Aug 2016 16:02:25 -0700
> Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> 
> > The only remaining comment is that I'd make that
> > lazy_tlb_can_skip_flush() function just use a switch table for the
> > tlbstate comparisons rather than the repeated conditionals.  
> 
> After staring at the code for an hour or so yesterday, I found a race
> condition. It took me a few minutes to realize we can fix it with a
> cmpxchg at context switch time, and then most of a day to realize that
> we only need that cmpxchg in the context switch code if the old tlb
> state is TLBSTATE_LAZY.
> 
> Context switch times when the tlb state is something else should be
> unaffected.
> 
> The 4th version of the patch (below) closes that race condition, and
> includes the improvements suggested by Ingo and you.
>  
> > I'd love to see the results from Benjamin - maybe it helps a lot, and
> > maybe it doesn't. But regardless, the patch makes sense to me.  
> 
> I would love to see test results from Ben, as well.  Ben? :)

The kbuild test robot helpfully reminded me that I forgot to enable
CONFIG_PARAVIRT in my build. v5 adds a one-line change in paravirt_types.h,
to match the changed prototype for flush_tlb_others.

Sorry about the noise.

---8<---

Subject: x86,mm,sched: make lazy TLB mode even lazier

Lazy TLB mode can result in an idle CPU being woken up for a TLB
flush, when all it really needed to do was flush %CR3 before the
next context switch.

This is mostly fine on bare metal, though sub-optimal from a power
saving point of view, and deeper C-states could make TLB flushes
take a little longer than desired.

On virtual machines, the pain can be much worse, especially if a
currently non-running VCPU is woken up for a TLB invalidation
IPI, on a CPU that is busy running another task. It could take
a while before that IPI is handled, leading to performance issues.

This patch deals with the issue by introducing a third TLB state,
TLBSTATE_FLUSH, which causes %CR3 to be flushed at the next
context switch.

A CPU that transitions from TLBSTATE_LAZY to TLBSTATE_OK during
the attempted transition to TLBSTATE_FLUSH will get a TLB flush
IPI, just like a CPU that was in TLBSTATE_OK to begin with.

Nothing is done for a CPU that is already in TLBSTATE_FLUSH mode.

Signed-off-by: Rik van Riel <riel@...hat.com>
Reported-by: Benjamin Serebrin <serebrin@...gle.com>
---
 arch/x86/include/asm/paravirt_types.h |  2 +-
 arch/x86/include/asm/tlbflush.h       |  3 +-
 arch/x86/include/asm/uv/uv.h          |  6 ++--
 arch/x86/mm/tlb.c                     | 64 ++++++++++++++++++++++++++++++++---
 arch/x86/platform/uv/tlb_uv.c         |  2 +-
 5 files changed, 67 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 7fa9e7740ba3..b7e695c90c43 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -225,7 +225,7 @@ struct pv_mmu_ops {
 	void (*flush_tlb_user)(void);
 	void (*flush_tlb_kernel)(void);
 	void (*flush_tlb_single)(unsigned long addr);
-	void (*flush_tlb_others)(const struct cpumask *cpus,
+	void (*flush_tlb_others)(struct cpumask *cpus,
 				 struct mm_struct *mm,
 				 unsigned long start,
 				 unsigned long end);
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4e5be94e079a..c3dbacbc49be 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -304,12 +304,13 @@ extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
 #define flush_tlb()	flush_tlb_current_task()
 
-void native_flush_tlb_others(const struct cpumask *cpumask,
+void native_flush_tlb_others(struct cpumask *cpumask,
 				struct mm_struct *mm,
 				unsigned long start, unsigned long end);
 
 #define TLBSTATE_OK	1
 #define TLBSTATE_LAZY	2
+#define TLBSTATE_FLUSH	3
 
 static inline void reset_lazy_tlbstate(void)
 {
diff --git a/arch/x86/include/asm/uv/uv.h b/arch/x86/include/asm/uv/uv.h
index 062921ef34e9..7e83cc633ba1 100644
--- a/arch/x86/include/asm/uv/uv.h
+++ b/arch/x86/include/asm/uv/uv.h
@@ -13,7 +13,7 @@ extern int is_uv_system(void);
 extern void uv_cpu_init(void);
 extern void uv_nmi_init(void);
 extern void uv_system_init(void);
-extern const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
+extern struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask,
 						 struct mm_struct *mm,
 						 unsigned long start,
 						 unsigned long end,
@@ -25,8 +25,8 @@ static inline enum uv_system_type get_uv_system_type(void) { return UV_NONE; }
 static inline int is_uv_system(void)	{ return 0; }
 static inline void uv_cpu_init(void)	{ }
 static inline void uv_system_init(void)	{ }
-static inline const struct cpumask *
-uv_flush_tlb_others(const struct cpumask *cpumask, struct mm_struct *mm,
+static inline struct cpumask *
+uv_flush_tlb_others(struct cpumask *cpumask, struct mm_struct *mm,
 		    unsigned long start, unsigned long end, unsigned int cpu)
 { return cpumask; }
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5643fd0b1a7d..634248b38db9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,10 +140,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	}
 #ifdef CONFIG_SMP
 	  else {
+		int *tlbstate = this_cpu_ptr(&cpu_tlbstate.state);
+		int oldstate = *tlbstate;
+
+		if (unlikely(oldstate == TLBSTATE_LAZY)) {
+			/*
+			 * The TLB flush code (lazy_tlb_can_skip_flush) can
+			 * move the TLB state to TLBSTATE_FLUSH concurrently
+			 * with a context switch. Using cmpxchg here will catch
+			 * that transition, causing a TLB flush below.
+			 */
+			oldstate = cmpxchg(tlbstate, oldstate, TLBSTATE_OK);
+		}
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
+
 		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
 
-		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
+		if (oldstate == TLBSTATE_FLUSH ||
+				!cpumask_test_cpu(cpu, mm_cpumask(next))) {
 			/*
 			 * On established mms, the mm_cpumask is only changed
 			 * from irq context, from ptep_clear_flush() while in
@@ -242,11 +256,44 @@ static void flush_tlb_func(void *info)
 
 }
 
-void native_flush_tlb_others(const struct cpumask *cpumask,
+/*
+ * Determine whether a CPU's TLB needs to be flushed now, or whether the
+ * flush can be delayed until the next context switch, by changing the
+ * tlbstate from TLBSTATE_LAZY to TLBSTATE_FLUSH.
+ */
+static bool lazy_tlb_can_skip_flush(int cpu)
+{
+	int *tlbstate = &per_cpu(cpu_tlbstate.state, cpu);
+	int old;
+
+	switch (*tlbstate) {
+	case TLBSTATE_FLUSH:
+		/* The TLB will be flushed on the next context switch. */
+		return true;
+	case TLBSTATE_LAZY:
+		/*
+		 * The CPU is in TLBSTATE_LAZY, which could context switch back
+		 * to TLBSTATE_OK, re-using the old TLB state without a flush.
+		 * If that happened, send a TLB flush IPI.
+		 *
+		 * Otherwise, the state is now TLBSTATE_FLUSH, and TLB will
+		 * be flushed at the next context switch. Skip the IPI.
+		 */
+		old = cmpxchg(tlbstate, TLBSTATE_LAZY, TLBSTATE_FLUSH);
+		return old != TLBSTATE_OK;
+	case TLBSTATE_OK:
+	default:
+		/* A task on the CPU is actively using the mm. Flush the TLB. */
+		return false;
+	}
+}
+
+void native_flush_tlb_others(struct cpumask *cpumask,
 				 struct mm_struct *mm, unsigned long start,
 				 unsigned long end)
 {
 	struct flush_tlb_info info;
+	unsigned int cpu;
 
 	if (end == 0)
 		end = start + PAGE_SIZE;
@@ -262,8 +309,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 				(end - start) >> PAGE_SHIFT);
 
 	if (is_uv_system()) {
-		unsigned int cpu;
-
 		cpu = smp_processor_id();
 		cpumask = uv_flush_tlb_others(cpumask, mm, start, end, cpu);
 		if (cpumask)
@@ -271,6 +316,17 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 								&info, 1);
 		return;
 	}
+
+	/*
+	 * Instead of sending IPIs to CPUs in lazy TLB mode, move that
+	 * CPU's TLB state to TLBSTATE_FLUSH, causing the TLB to be flushed
+	 * at the next context switch.
+	 */
+	for_each_cpu(cpu, cpumask) {
+		if (lazy_tlb_can_skip_flush(cpu))
+			cpumask_clear_cpu(cpu, cpumask);
+	}
+
 	smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
 }
 
diff --git a/arch/x86/platform/uv/tlb_uv.c b/arch/x86/platform/uv/tlb_uv.c
index fdb4d42b4ce5..7a2221a81e77 100644
--- a/arch/x86/platform/uv/tlb_uv.c
+++ b/arch/x86/platform/uv/tlb_uv.c
@@ -1090,7 +1090,7 @@ static int set_distrib_bits(struct cpumask *flush_mask, struct bau_control *bcp,
  * Returns pointer to cpumask if some remote flushing remains to be
  * done.  The returned pointer is valid till preemption is re-enabled.
  */
-const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
+struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask,
 						struct mm_struct *mm,
 						unsigned long start,
 						unsigned long end,

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ