linux-kernel - [PATCH RFC v3] x86,mm,sched: make lazy TLB mode even lazier

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160825170133.0a783ae8@riellap.home.surriel.com>
Date:   Thu, 25 Aug 2016 17:01:33 -0400
From:   Rik van Riel <riel@...hat.com>
To:     "H. Peter Anvin" <hpa@...or.com>
Cc:     serebrin@...gle.com, mingo@...nel.org, peterz@...radead.org,
        linux-kernel@...r.kernel.org, luto@...nel.org, bp@...e.de,
        mgorman@...e.de, tglx@...utronix.de
Subject: [PATCH RFC v3] x86,mm,sched: make lazy TLB mode even lazier

On Thu, 25 Aug 2016 12:42:15 -0700
"H. Peter Anvin" <hpa@...or.com> wrote:

> Why grabbing a lock instead of cmpxchg?

... and some more cleanups later, this might actually be
good to merge, assuming it works for Benjamin :)

---8<---

Subject: x86,mm,sched: make lazy TLB mode even lazier

Lazy TLB mode can result in an idle CPU being woken up for a TLB
flush, when all it really needed to do was flush %cr3 before the
next context switch.

This is mostly fine on bare metal, though sub-optimal from a power
saving point of view, and deeper C states could make TLB flushes
take a little longer than desired.

On virtual machines, the pain can be much worse, especially if a
currently non-running VCPU is woken up for a TLB invalidation
IPI, on a CPU that is busy running another task. It could take
a while before that IPI is handled, leading to performance issues.

This patch deals with the issue by introducing a third tlb state,
TLBSTATE_FLUSH, which causes %cr3 to be flushed at the next
context switch.

A CPU that transitions from TLBSTATE_LAZY to TLBSTATE_OK during
the attempted transition to TLBSTATE_FLUSH will get a TLB flush
IPI, just like a CPU that was in TLBSTATE_OK to begin with.

Nothing is done for a CPU that is already in TLBSTATE_FLUSH mode.

This patch is totally untested, because I am at a conference right
now, and Benjamin has the test case :)

Benjamin, does this help your issue?

Signed-off-by: Rik van Riel <riel@...hat.com>
Reported-by: Benjamin Serebrin <serebrin@...gle.com>
---
 arch/x86/include/asm/tlbflush.h |  1 +
 arch/x86/mm/tlb.c               | 47 ++++++++++++++++++++++++++++++++++++++---
 2 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4e5be94e079a..5ae8e4b174f8 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -310,6 +310,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 
 #define TLBSTATE_OK	1
 #define TLBSTATE_LAZY	2
+#define TLBSTATE_FLUSH	3
 
 static inline void reset_lazy_tlbstate(void)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5643fd0b1a7d..4352db65a129 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,10 +140,12 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	}
 #ifdef CONFIG_SMP
 	  else {
+		int oldstate = this_cpu_read(cpu_tlbstate.state);
 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
 
-		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
+		if (oldstate == TLBSTATE_FLUSH ||
+				!cpumask_test_cpu(cpu, mm_cpumask(next))) {
 			/*
 			 * On established mms, the mm_cpumask is only changed
 			 * from irq context, from ptep_clear_flush() while in
@@ -242,11 +244,42 @@ static void flush_tlb_func(void *info)
 
 }
 
+/*
+ * Determine whether a CPU's TLB needs to be flushed now, or whether the
+ * flush can be delayed until the next context switch, by changing the
+ * tlbstate from TLBSTATE_LAZY to TLBSTATE_FLUSH.
+ */
+static bool lazy_tlb_can_skip_flush(int cpu) {
+	int *tlbstate = &per_cpu(cpu_tlbstate.state, cpu);
+	int old;
+
+	/* A task on the CPU is actively using the mm. Flush the TLB. */
+	if (*tlbstate == TLBSTATE_OK)
+		return false;
+
+	/* The TLB will be flushed on the next context switch. */
+	if (*tlbstate == TLBSTATE_FLUSH)
+		return true;
+
+	/*
+	 * The CPU is in TLBSTATE_LAZY, which could context switch back
+	 * to TLBSTATE_OK, re-using the TLB state without a TLB flush.
+	 * In that case, a TLB flush IPI needs to be sent.
+	 *
+	 * Otherwise, the TLB state is now TLBSTATE_FLUSH, and the
+	 * TLB flush IPI can be skipped.
+	 */
+	old = cmpxchg(tlbstate, TLBSTATE_LAZY, TLBSTATE_FLUSH);
+
+	return old != TLBSTATE_OK;
+}
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 				 struct mm_struct *mm, unsigned long start,
 				 unsigned long end)
 {
 	struct flush_tlb_info info;
+	unsigned int cpu;
 
 	if (end == 0)
 		end = start + PAGE_SIZE;
@@ -262,8 +295,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 				(end - start) >> PAGE_SHIFT);
 
 	if (is_uv_system()) {
-		unsigned int cpu;
-
 		cpu = smp_processor_id();
 		cpumask = uv_flush_tlb_others(cpumask, mm, start, end, cpu);
 		if (cpumask)
@@ -271,6 +302,16 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 								&info, 1);
 		return;
 	}
+
+	/*
+	 * Instead of sending IPIs to CPUs in lazy TLB mode, move that
+	 * CPUs TLB state to TLBSTATE_FLUSH, causing the TLB to be flushed
+	 * at the next context switch.
+	 */
+	for_each_cpu(cpu, cpumask)
+		if (lazy_tlb_can_skip_flush(cpu))
+			cpumask_clear_cpu(cpu, (struct cpumask *)cpumask);
+
 	smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
 }