linux-kernel - Re: [PATCH RFC UGLY] x86,mm,sched: make lazy TLB mode even lazier

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7E7CF02F-F0B1-493A-98B3-B078174811DA@zytor.com>
Date:   Thu, 25 Aug 2016 12:42:15 -0700
From:   "H. Peter Anvin" <hpa@...or.com>
To:     Rik van Riel <riel@...hat.com>, serebrin@...gle.com
CC:     mingo@...nel.org, peterz@...radead.org,
        linux-kernel@...r.kernel.org, luto@...nel.org, bp@...e.de,
        mgorman@...e.de, tglx@...utronix.de
Subject: Re: [PATCH RFC UGLY] x86,mm,sched: make lazy TLB mode even lazier

On August 25, 2016 12:04:59 PM PDT, Rik van Riel <riel@...hat.com> wrote:
>Subject: x86,mm,sched: make lazy TLB mode even lazier
>
>Lazy TLB mode can result in an idle CPU being woken up for a TLB
>flush, when all it really needed to do was flush %cr3 before the
>next context switch.
>
>This is mostly fine on bare metal, though sub-optimal from a power
>saving point of view, and deeper C states could make TLB flushes
>take a little longer than desired.
>
>On virtual machines, the pain can be much worse, especially if a
>currently non-running VCPU is woken up for a TLB invalidation
>IPI, on a CPU that is busy running another task. It could take
>a while before that IPI is handled, leading to performance issues.
>
>This patch is still ugly, and the sched.h include needs to be cleaned
>up a lot (how would the scheduler people like to see the context switch
>blocking abstracted?)
>
>This patch deals with the issue by introducing a third tlb state,
>TLBSTATE_FLUSH, which causes %cr3 to be flushed at the next
>context switch. A CPU is transitioned from TLBSTATE_LAZY to
>TLBSTATE_FLUSH with the rq lock held, to prevent context switches.
>
>Nothing is done for a CPU that is already in TLBSTATE_FLUH mode.
>
>This patch is totally untested, because I am at a conference right
>now, and Benjamin has the test case :)
>
>Signed-off-by: Rik van Riel <riel@...hat.com>
>Reported-by: Benjamin Serebrin <serebrin@...gle.com>
>---
> arch/x86/include/asm/tlbflush.h |  1 +
>arch/x86/mm/tlb.c               | 38
>+++++++++++++++++++++++++++++++++++---
> 2 files changed, 36 insertions(+), 3 deletions(-)
>
>diff --git a/arch/x86/include/asm/tlbflush.h
>b/arch/x86/include/asm/tlbflush.h
>index 4e5be94e079a..5ae8e4b174f8 100644
>--- a/arch/x86/include/asm/tlbflush.h
>+++ b/arch/x86/include/asm/tlbflush.h
>@@ -310,6 +310,7 @@ void native_flush_tlb_others(const struct cpumask
>*cpumask,
> 
> #define TLBSTATE_OK	1
> #define TLBSTATE_LAZY	2
>+#define TLBSTATE_FLUSH	3
> 
> static inline void reset_lazy_tlbstate(void)
> {
>diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>index 5643fd0b1a7d..5b4cda49ac0c 100644
>--- a/arch/x86/mm/tlb.c
>+++ b/arch/x86/mm/tlb.c
>@@ -6,6 +6,7 @@
> #include <linux/interrupt.h>
> #include <linux/module.h>
> #include <linux/cpu.h>
>+#include "../../../kernel/sched/sched.h"
> 
> #include <asm/tlbflush.h>
> #include <asm/mmu_context.h>
>@@ -140,10 +141,12 @@ void switch_mm_irqs_off(struct mm_struct *prev,
>struct mm_struct *next,
> 	}
> #ifdef CONFIG_SMP
> 	  else {
>+		int oldstate = this_cpu_read(cpu_tlbstate.state);
> 		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
> 		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
> 
>-		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
>+		if (oldstate == TLBSTATE_FLUSH ||
>+				!cpumask_test_cpu(cpu, mm_cpumask(next))) {
> 			/*
> 			 * On established mms, the mm_cpumask is only changed
> 			 * from irq context, from ptep_clear_flush() while in
>@@ -242,11 +245,29 @@ static void flush_tlb_func(void *info)
> 
> }
> 
>+/*
>+ * This function moves a CPU from TLBSTATE_LAZY to TLBSTATE_FLUSH,
>which
>+ * will force it to flush %cr3 at the next context switch, effectively
>+ * doing a delayed TLB flush for a CPU in lazy TLB mode.
>+ * This takes the runqueue lock to protect against the race condition
>+ * of the target CPU rescheduling while we change its TLB state.
>+ * Do nothing if the TLB state is already set to TLBSTATE_FLUSH.
>+ */
>+static void set_lazy_tlbstate_flush(int cpu) {
>+	if (per_cpu(cpu_tlbstate.state, cpu) == TLBSTATE_LAZY) {
>+		raw_spin_lock(&cpu_rq(cpu)->lock);
>+		if (per_cpu(cpu_tlbstate.state, cpu) == TLBSTATE_LAZY)
>+			per_cpu(cpu_tlbstate.state, cpu) = TLBSTATE_FLUSH;
>+		raw_spin_unlock(&cpu_rq(cpu)->lock);
>+	}
>+}
>+
> void native_flush_tlb_others(const struct cpumask *cpumask,
> 				 struct mm_struct *mm, unsigned long start,
> 				 unsigned long end)
> {
> 	struct flush_tlb_info info;
>+	unsigned int cpu;
> 
> 	if (end == 0)
> 		end = start + PAGE_SIZE;
>@@ -262,8 +283,6 @@ void native_flush_tlb_others(const struct cpumask
>*cpumask,
> 				(end - start) >> PAGE_SHIFT);
> 
> 	if (is_uv_system()) {
>-		unsigned int cpu;
>-
> 		cpu = smp_processor_id();
> 		cpumask = uv_flush_tlb_others(cpumask, mm, start, end, cpu);
> 		if (cpumask)
>@@ -271,6 +290,19 @@ void native_flush_tlb_others(const struct cpumask
>*cpumask,
> 								&info, 1);
> 		return;
> 	}
>+
>+	/*
>+	 * Instead of sending IPIs to CPUs in lazy TLB mode, move that
>+	 * CPUs TLB state to TLBSTATE_FLUSH, causing the TLB to be flushed
>+	 * at the next context switch.
>+	 */
>+	for_each_cpu(cpu, cpumask) {
>+		if (per_cpu(cpu_tlbstate.state, cpu) != TLBSTATE_OK) {
>+			set_lazy_tlbstate_flush(cpu);
>+			cpumask_clear_cpu(cpu, (struct cpumask *)cpumask);
>+		}
>+	}
>+
> 	smp_call_function_many(cpumask, flush_tlb_func, &info, 1);
> }
> 

Why grabbing a lock instead of cmpxchg?
-- 
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.