linux-kernel - Re: [PATCH RFC v3] x86,mm,sched: make lazy TLB mode even lazier

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160827080316.GA11325@gmail.com>
Date:   Sat, 27 Aug 2016 10:03:16 +0200
From:   Ingo Molnar <mingo@...nel.org>
To:     Rik van Riel <riel@...hat.com>
Cc:     "H. Peter Anvin" <hpa@...or.com>, serebrin@...gle.com,
        peterz@...radead.org, linux-kernel@...r.kernel.org,
        luto@...nel.org, bp@...e.de, mgorman@...e.de, tglx@...utronix.de,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH RFC v3] x86,mm,sched: make lazy TLB mode even lazier


* Rik van Riel <riel@...hat.com> wrote:

> On Thu, 25 Aug 2016 12:42:15 -0700
> "H. Peter Anvin" <hpa@...or.com> wrote:
> 
> > Why grabbing a lock instead of cmpxchg?
> 
> ... and some more cleanups later, this might actually be
> good to merge, assuming it works for Benjamin :)
> 
> ---8<---

LGTM in principle (it's a pretty clever trick!), just some minor stylistic nits:

> 
> Subject: x86,mm,sched: make lazy TLB mode even lazier
> 
> Lazy TLB mode can result in an idle CPU being woken up for a TLB
> flush, when all it really needed to do was flush %cr3 before the
> next context switch.

s/%cr3/CR3

> This is mostly fine on bare metal, though sub-optimal from a power
> saving point of view, and deeper C states could make TLB flushes
> take a little longer than desired.

s/C state/C-state/

> On virtual machines, the pain can be much worse, especially if a
> currently non-running VCPU is woken up for a TLB invalidation
> IPI, on a CPU that is busy running another task. It could take
> a while before that IPI is handled, leading to performance issues.

So I was going to suggest:

 s/VCPU/vCPU

But then, out of curiosity, I ran:

   triton:~/tip> for N in $(git log v3.0.. | grep -i vcpu | sed 's/[^a-zA-Z\d\s]/ /g'); do echo $N; done | grep -iw vcpu  | sort | uniq -c | sort -n
      3 Vcpu
    125 vCPU
    527 VCPU
   1611 vcpu

So never mind, I guess we missed that boat by a few years already ;-)

> 
> This patch deals with the issue by introducing a third tlb state,
> TLBSTATE_FLUSH, which causes %cr3 to be flushed at the next
> context switch.

s/tlb state/TLB state

> 
> A CPU that transitions from TLBSTATE_LAZY to TLBSTATE_OK during
> the attempted transition to TLBSTATE_FLUSH will get a TLB flush
> IPI, just like a CPU that was in TLBSTATE_OK to begin with.
> 
> Nothing is done for a CPU that is already in TLBSTATE_FLUSH mode.
> 
> This patch is totally untested, because I am at a conference right
> now, and Benjamin has the test case :)
> 
> Benjamin, does this help your issue?
>
> Signed-off-by: Rik van Riel <riel@...hat.com>
> Reported-by: Benjamin Serebrin <serebrin@...gle.com>
> ---
>  arch/x86/include/asm/tlbflush.h |  1 +
>  arch/x86/mm/tlb.c               | 47 ++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 45 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 4e5be94e079a..5ae8e4b174f8 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -310,6 +310,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
>  
>  #define TLBSTATE_OK	1
>  #define TLBSTATE_LAZY	2
> +#define TLBSTATE_FLUSH	3
>  
>  static inline void reset_lazy_tlbstate(void)
>  {
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 5643fd0b1a7d..4352db65a129 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -140,10 +140,12 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  	}
>  #ifdef CONFIG_SMP
>  	  else {
> +		int oldstate = this_cpu_read(cpu_tlbstate.state);
>  		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);

Please separate local variable definition blocks from statements by an extra 
newline.

>  		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
>  
> -		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
> +		if (oldstate == TLBSTATE_FLUSH ||
> +				!cpumask_test_cpu(cpu, mm_cpumask(next))) {
>  			/*
>  			 * On established mms, the mm_cpumask is only changed
>  			 * from irq context, from ptep_clear_flush() while in
> @@ -242,11 +244,42 @@ static void flush_tlb_func(void *info)
>  
>  }
>  
> +/*
> + * Determine whether a CPU's TLB needs to be flushed now, or whether the
> + * flush can be delayed until the next context switch, by changing the
> + * tlbstate from TLBSTATE_LAZY to TLBSTATE_FLUSH.
> + */
> +static bool lazy_tlb_can_skip_flush(int cpu) {
> +	int *tlbstate = &per_cpu(cpu_tlbstate.state, cpu);
> +	int old;

Yeah, so this is not how the standard function definition looks like.

> +	/* A task on the CPU is actively using the mm. Flush the TLB. */
> +	if (*tlbstate == TLBSTATE_OK)
> +		return false;
> +
> +	/* The TLB will be flushed on the next context switch. */
> +	if (*tlbstate == TLBSTATE_FLUSH)
> +		return true;
> +
> +	/*
> +	 * The CPU is in TLBSTATE_LAZY, which could context switch back
> +	 * to TLBSTATE_OK, re-using the TLB state without a TLB flush.
> +	 * In that case, a TLB flush IPI needs to be sent.
> +	 *
> +	 * Otherwise, the TLB state is now TLBSTATE_FLUSH, and the
> +	 * TLB flush IPI can be skipped.
> +	 */
> +	old = cmpxchg(tlbstate, TLBSTATE_LAZY, TLBSTATE_FLUSH);
> +
> +	return old != TLBSTATE_OK;
> +}
> +
>  void native_flush_tlb_others(const struct cpumask *cpumask,
>  				 struct mm_struct *mm, unsigned long start,
>  				 unsigned long end)
>  {
>  	struct flush_tlb_info info;
> +	unsigned int cpu;
>  
>  	if (end == 0)
>  		end = start + PAGE_SIZE;
> @@ -262,8 +295,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
>  				(end - start) >> PAGE_SHIFT);
>  
>  	if (is_uv_system()) {
> -		unsigned int cpu;
> -
>  		cpu = smp_processor_id();
>  		cpumask = uv_flush_tlb_others(cpumask, mm, start, end, cpu);
>  		if (cpumask)
> @@ -271,6 +302,16 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
>  								&info, 1);
>  		return;
>  	}
> +
> +	/*
> +	 * Instead of sending IPIs to CPUs in lazy TLB mode, move that
> +	 * CPUs TLB state to TLBSTATE_FLUSH, causing the TLB to be flushed
> +	 * at the next context switch.

s/CPUs/CPU's

> +	 */
> +	for_each_cpu(cpu, cpumask)
> +		if (lazy_tlb_can_skip_flush(cpu))
> +			cpumask_clear_cpu(cpu, (struct cpumask *)cpumask);

Please remove the 'const' from the cpumask type definition instead of this ugly 
cast!

Plus this needs one more pair of curly braces and it's perfect! ;-)

I'd also like to wait for the Tested-by from Benjamin as well before we can 
proceeed.

Thanks,

	Ingo