linux-kernel - Re: [patch]x86: avoid unnecessary tlb flush

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1281660475.21194.1.camel@sli10-desk.sh.intel.com>
Date:	Fri, 13 Aug 2010 08:47:55 +0800
From:	Shaohua Li <shaohua.li@...el.com>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	lkml <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...e.hu>,
	Andi Kleen <andi@...stfloor.org>,
	"hpa@...or.com" <hpa@...or.com>
Subject: Re: [patch]x86: avoid unnecessary tlb flush

On Fri, 2010-08-06 at 13:19 +0800, Andrew Morton wrote:
> On Fri, 06 Aug 2010 11:28:28 +0800 Shaohua Li <shaohua.li@...el.com> wrote:
> 
> > In x86, access and dirty bits are set automatically by CPU when CPU accesses
> > memory. When we go into the code path of below flush_tlb_nonprotect_page(),
> > we already set dirty bit for pte and don't need flush tlb. This might mean
> > tlb entry in some CPUs hasn't dirty bit set, but this doesn't matter. When
> > the CPUs do page write, they will automatically check the bit and no software
> > involved. 
> > 
> > On the other hand, flush tlb in below position is harmful. Test creates CPU
> > number of threads, each thread writes to a same but random address in same vma
> > range and we measure the total time. Under a 4 socket system, original time is
> > 1.96s, while with the patch, the time is 0.8s. Under a 2 socket system, there is
> > 20% time cut too. perf shows a lot of time are taking to send ipi/handle ipi for
> > tlb flush.
> > 
> > Signed-off-by: Shaohua Li <shaohua.li@...el.com>
> > 
> > ---
> >  arch/x86/include/asm/pgtable.h |    3 +++
> >  include/asm-generic/pgtable.h  |    4 ++++
> >  mm/memory.c                    |    2 +-
> >  3 files changed, 8 insertions(+), 1 deletion(-)
> > 
> > Index: linux/arch/x86/include/asm/pgtable.h
> > ===================================================================
> > --- linux.orig/arch/x86/include/asm/pgtable.h	2010-07-29 13:25:12.000000000 +0800
> > +++ linux/arch/x86/include/asm/pgtable.h	2010-08-03 09:02:07.000000000 +0800
> > @@ -603,6 +603,9 @@ static inline void ptep_set_wrprotect(st
> >  	pte_update(mm, addr, ptep);
> >  }
> >  
> > +#define __HAVE_ARCH_FLUSH_TLB_NONPROTECT_PAGE
> > +#define flush_tlb_nonprotect_page(vma, address)
> > +
> >  /*
> >   * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
> >   *
> > Index: linux/include/asm-generic/pgtable.h
> > ===================================================================
> > --- linux.orig/include/asm-generic/pgtable.h	2010-07-29 13:25:12.000000000 +0800
> > +++ linux/include/asm-generic/pgtable.h	2010-08-03 09:02:07.000000000 +0800
> > @@ -129,6 +129,10 @@ static inline void ptep_set_wrprotect(st
> >  #define move_pte(pte, prot, old_addr, new_addr)	(pte)
> >  #endif
> >  
> > +#ifndef __HAVE_ARCH_FLUSH_TLB_NONPROTECT_PAGE
> > +#define flush_tlb_nonprotect_page(vma, address) flush_tlb_page(vma, address)
> > +#endif
> 
> The preferred technique here is
> 
> #ifndef flush_tlb_nonprotect_page
> #define flush_tlb_nonprotect_page(vma, address) flush_tlb_page(vma, address)
> #endif
> 
> so no need for __HAVE_ARCH_FLUSH_TLB_NONPROTECT_PAGE. 
> include/asm-generic/pgtable.h uses a mix of the two techniques.
ok, updated the patch.


In x86, access and dirty bits are set automatically by CPU when CPU accesses
memory. When we go into the code path of below flush_tlb_nonprotect_page(),
we already set dirty bit for pte and don't need flush tlb. This might mean
tlb entry in some CPUs hasn't dirty bit set, but this doesn't matter. When
the CPUs do page write, they will automatically check the bit and no software
involved. 

On the other hand, flush tlb in below position is harmful. Test creates CPU
number of threads, each thread writes to a same but random address in same vma
range and we measure the total time. Under a 4 socket system, original time is
1.96s, while with the patch, the time is 0.8s. Under a 2 socket system, there is
20% time cut too. perf shows a lot of time are taking to send ipi/handle ipi for
tlb flush.

Signed-off-by: Shaohua Li <shaohua.li@...el.com>

---
 arch/x86/include/asm/pgtable.h |    2 ++
 include/asm-generic/pgtable.h  |    4 ++++
 mm/memory.c                    |    2 +-
 3 files changed, 7 insertions(+), 1 deletion(-)

Index: linux/arch/x86/include/asm/pgtable.h
===================================================================
--- linux.orig/arch/x86/include/asm/pgtable.h	2010-08-13 08:23:13.000000000 +0800
+++ linux/arch/x86/include/asm/pgtable.h	2010-08-13 08:24:53.000000000 +0800
@@ -603,6 +603,8 @@ static inline void ptep_set_wrprotect(st
 	pte_update(mm, addr, ptep);
 }
 
+#define flush_tlb_nonprotect_page(vma, address)
+
 /*
  * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
  *
Index: linux/include/asm-generic/pgtable.h
===================================================================
--- linux.orig/include/asm-generic/pgtable.h	2010-08-13 08:23:13.000000000 +0800
+++ linux/include/asm-generic/pgtable.h	2010-08-13 08:24:53.000000000 +0800
@@ -129,6 +129,10 @@ static inline void ptep_set_wrprotect(st
 #define move_pte(pte, prot, old_addr, new_addr)	(pte)
 #endif
 
+#ifndef flush_tlb_nonprotect_page
+#define flush_tlb_nonprotect_page(vma, address) flush_tlb_page(vma, address)
+#endif
+
 #ifndef pgprot_noncached
 #define pgprot_noncached(prot)	(prot)
 #endif
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2010-08-13 08:23:13.000000000 +0800
+++ linux/mm/memory.c	2010-08-13 08:24:53.000000000 +0800
@@ -3116,7 +3116,7 @@ static inline int handle_pte_fault(struc
 		 * with threads.
 		 */
 		if (flags & FAULT_FLAG_WRITE)
-			flush_tlb_page(vma, address);
+			flush_tlb_nonprotect_page(vma, address);
 	}
 unlock:
 	pte_unmap_unlock(pte, ptl);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/