lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon,  6 Jun 2022 20:37:13 +0000
From:   Ankur Arora <ankur.a.arora@...cle.com>
To:     linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org
Cc:     torvalds@...ux-foundation.org, akpm@...ux-foundation.org,
        mike.kravetz@...cle.com, mingo@...nel.org, luto@...nel.org,
        tglx@...utronix.de, bp@...en8.de, peterz@...radead.org,
        ak@...ux.intel.com, arnd@...db.de, jgg@...dia.com,
        jon.grimm@....com, boris.ostrovsky@...cle.com,
        konrad.wilk@...cle.com, joao.m.martins@...cle.com,
        ankur.a.arora@...cle.com
Subject: [PATCH v3 09/21] x86/asm: add clear_pages_movnt()

Add clear_pages_movnt(), which uses MOVNTI as the underlying primitive.
With this, page-clearing can skip the memory hierarchy, thus providing
a non cache-polluting implementation of clear_pages().

MOVNTI, from the Intel SDM, Volume 2B, 4-101:
 "The non-temporal hint is implemented by using a write combining (WC)
  memory type protocol when writing the data to memory. Using this
  protocol, the processor does not write the data into the cache
  hierarchy, nor does it fetch the corresponding cache line from memory
  into the cache hierarchy."

The AMD Arch Manual has something similar to say as well.

One use-case is to zero large extents without bringing in never-to-be-
accessed cachelines. Also, often clear_pages_movnt() based clearing is
faster once extent sizes are O(LLC-size).

As the excerpt notes, MOVNTI is weakly ordered with respect to other
instructions operating on the memory hierarchy. This needs to be
handled by the caller by executing an SFENCE when done.

The implementation is straight-forward: unroll the inner loop to keep
the code similar to memset_movnti(), so that we can gauge
clear_pages_movnt() performance via perf bench mem memset.

 # Intel Icelakex
 # Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
 # (X86_FEATURE_ERMS) and x86-64-movnt:

 System:      Oracle X9-2 (2 nodes * 32 cores * 2 threads)
 Processor:   Intel Xeon(R) Platinum 8358 CPU @ 2.60GHz (Icelakex, 6:106:6)
 Memory:      512 GB evenly split between nodes
 LLC-size:    48MB for each node (32-cores * 2-threads)
 no_turbo: 1, Microcode: 0xd0001e0, scaling-governor: performance

              x86-64-stosb (5 runs)     x86-64-movnt (5 runs)    Delta(%)
              ----------------------    ---------------------    --------
     size            BW   (   stdev)          BW    (   stdev)

      2MB      14.37 GB/s ( +- 1.55)     12.59 GB/s ( +- 1.20)   -12.38%
     16MB      16.93 GB/s ( +- 2.61)     15.91 GB/s ( +- 2.74)    -6.02%
    128MB      12.12 GB/s ( +- 1.06)     22.33 GB/s ( +- 1.84)   +84.24%
   1024MB      12.12 GB/s ( +- 0.02)     23.92 GB/s ( +- 0.14)   +97.35%
   4096MB      12.08 GB/s ( +- 0.02)     23.98 GB/s ( +- 0.18)   +98.50%

Signed-off-by: Ankur Arora <ankur.a.arora@...cle.com>
---
 arch/x86/include/asm/page_64.h |  1 +
 arch/x86/lib/clear_page_64.S   | 21 +++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index a88a3508888a..3affc4ecb8da 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -55,6 +55,7 @@ extern unsigned long __phys_addr_symbol(unsigned long);
 void clear_pages_orig(void *page, unsigned long npages);
 void clear_pages_rep(void *page, unsigned long npages);
 void clear_pages_erms(void *page, unsigned long npages);
+void clear_pages_movnt(void *page, unsigned long npages);
 
 #define __HAVE_ARCH_CLEAR_USER_PAGES
 static inline void clear_pages(void *page, unsigned int npages)
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index 2cc3b681734a..83d14f1c9f57 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -58,3 +58,24 @@ SYM_FUNC_START(clear_pages_erms)
 	RET
 SYM_FUNC_END(clear_pages_erms)
 EXPORT_SYMBOL_GPL(clear_pages_erms)
+
+SYM_FUNC_START(clear_pages_movnt)
+	xorl	%eax,%eax
+	movq	%rsi,%rcx
+	shlq    $PAGE_SHIFT, %rcx
+
+	.p2align 4
+.Lstart:
+	movnti  %rax, 0x00(%rdi)
+	movnti  %rax, 0x08(%rdi)
+	movnti  %rax, 0x10(%rdi)
+	movnti  %rax, 0x18(%rdi)
+	movnti  %rax, 0x20(%rdi)
+	movnti  %rax, 0x28(%rdi)
+	movnti  %rax, 0x30(%rdi)
+	movnti  %rax, 0x38(%rdi)
+	addq    $0x40, %rdi
+	subl    $0x40, %ecx
+	ja      .Lstart
+	RET
+SYM_FUNC_END(clear_pages_movnt)
-- 
2.31.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ