linux-kernel - [PATCH v5 14/14] x86/clear_pages: Support clearing of page-extents

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20250710005926.1159009-15-ankur.a.arora@oracle.com>
Date: Wed,  9 Jul 2025 17:59:26 -0700
From: Ankur Arora <ankur.a.arora@...cle.com>
To: linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org
Cc: akpm@...ux-foundation.org, david@...hat.com, bp@...en8.de,
        dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
        mjguzik@...il.com, luto@...nel.org, peterz@...radead.org,
        acme@...nel.org, namhyung@...nel.org, tglx@...utronix.de,
        willy@...radead.org, raghavendra.kt@....com,
        boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
        ankur.a.arora@...cle.com
Subject: [PATCH v5 14/14] x86/clear_pages: Support clearing of page-extents

Define ARCH_HAS_CLEAR_PAGES so hugepage zeroing (via folio_zero_user())
can use clear_pages() to clear in page-extents. This allows the
processor -- when using string instructions (REP; STOS) -- to optimize
based on the extent size.

Also define ARCH_CLEAR_PAGE_EXTENT which is used by folio_zero_user() to
decide the maximum extent to be zeroed when running under cooperative
preemption models.

The resultant performance depends on the kinds of optimizations
available to the uarch for the extent being cleared. Two class
of optimizations:

  - clearing iteration costs can be amortized over a range larger that
    a single page.
  - cacheline allocation elision (seen only on AMD Zen models).

A demand fault workload shows an improved baseline due to the first
optimization and a larger improvement when the extent is large enough
for the second one.

AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):

 $ perf bench mem map -p $pg-sz -f demand -s 64GB -l 5

                 mm/folio_zero_user    x86/folio_zero_user       change
                  (GB/s  +- %stdev)     (GB/s  +- %stdev)

   pg-sz=2MB       11.82  +- 0.67%        16.48  +-  0.30%       + 39.4%	preempt=*

   pg-sz=1GB       17.14  +- 1.39%        17.42  +-  0.98% [#]   +  1.6%	preempt=none|voluntary
   pg-sz=1GB       17.51  +- 1.19%        43.23  +-  5.22%       +146.8%	preempt=full|lazy

[#] Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
allocation, which is higher than ARCH_CLEAR_PAGE_EXTENT, so
preempt=none|voluntary sees no improvement on the pg-sz=1GB.

The improvement due to the hardware eliding cacheline allocation for
pg-sz=1GB can be seen in the reduced L1-dcache-loads:

   - 44,513,459,667      cycles                           #    2.420 GHz                         ( +-  0.44% )  (35.71%)
   -  1,378,032,592      instructions                     #    0.03  insn per cycle
   - 11,224,288,082      L1-dcache-loads                  #  610.187 M/sec                       ( +-  0.08% )  (35.72%)
   -  5,373,473,118      L1-dcache-load-misses            #   47.87% of all L1-dcache accesses   ( +-  0.00% )  (35.71%)

   + 20,093,219,076      cycles                           #    2.421 GHz                         ( +-  3.64% )  (35.69%)
   +  1,378,032,592      instructions                     #    0.03  insn per cycle
   +    186,525,095      L1-dcache-loads                  #   22.479 M/sec                       ( +-  2.11% )  (35.74%)
   +     73,479,687      L1-dcache-load-misses            #   39.39% of all L1-dcache accesses   ( +-  3.03% )  (35.74%)

Also as mentioned earlier, the baseline improvement is not specific to
AMD Zen*. Intel Icelakex (pg-sz=2MB|1GB) sees a similar improvement as
the Milan pg-sz=2MB workload above (~35%).

Signed-off-by: Ankur Arora <ankur.a.arora@...cle.com>
---
 arch/x86/Kconfig               | 4 ++++
 arch/x86/include/asm/page_64.h | 7 +++++++
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 71019b3b54ea..8a7ce6ab229b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -395,6 +395,10 @@ config GENERIC_CALIBRATE_DELAY
 config ARCH_HAS_CPU_RELAX
 	def_bool y
 
+config ARCH_HAS_CLEAR_PAGES
+	def_bool y
+	depends on X86_64 && !HIGHMEM
+
 config ARCH_HIBERNATION_POSSIBLE
 	def_bool y
 
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 5625d616bd00..221c7404fc3a 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -40,6 +40,13 @@ extern unsigned long __phys_addr_symbol(unsigned long);
 
 #define __phys_reloc_hide(x)	(x)
 
+/*
+ * When running under voluntary preemption models, limit the max extent
+ * to pages worth 8MB. With a clearing BW of ~10GBps, this should result
+ * in a worst case scheduling latency of ~1ms.
+ */
+#define ARCH_CLEAR_PAGE_EXTENT (8 << (20 - PAGE_SHIFT))
+
 void memzero_page_aligned_unrolled(void *addr, u64 len);
 
 /*
-- 
2.43.5