[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250710005926.1159009-15-ankur.a.arora@oracle.com>
Date: Wed, 9 Jul 2025 17:59:26 -0700
From: Ankur Arora <ankur.a.arora@...cle.com>
To: linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org
Cc: akpm@...ux-foundation.org, david@...hat.com, bp@...en8.de,
dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
mjguzik@...il.com, luto@...nel.org, peterz@...radead.org,
acme@...nel.org, namhyung@...nel.org, tglx@...utronix.de,
willy@...radead.org, raghavendra.kt@....com,
boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
ankur.a.arora@...cle.com
Subject: [PATCH v5 14/14] x86/clear_pages: Support clearing of page-extents
Define ARCH_HAS_CLEAR_PAGES so hugepage zeroing (via folio_zero_user())
can use clear_pages() to clear in page-extents. This allows the
processor -- when using string instructions (REP; STOS) -- to optimize
based on the extent size.
Also define ARCH_CLEAR_PAGE_EXTENT which is used by folio_zero_user() to
decide the maximum extent to be zeroed when running under cooperative
preemption models.
The resultant performance depends on the kinds of optimizations
available to the uarch for the extent being cleared. Two class
of optimizations:
- clearing iteration costs can be amortized over a range larger that
a single page.
- cacheline allocation elision (seen only on AMD Zen models).
A demand fault workload shows an improved baseline due to the first
optimization and a larger improvement when the extent is large enough
for the second one.
AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):
$ perf bench mem map -p $pg-sz -f demand -s 64GB -l 5
mm/folio_zero_user x86/folio_zero_user change
(GB/s +- %stdev) (GB/s +- %stdev)
pg-sz=2MB 11.82 +- 0.67% 16.48 +- 0.30% + 39.4% preempt=*
pg-sz=1GB 17.14 +- 1.39% 17.42 +- 0.98% [#] + 1.6% preempt=none|voluntary
pg-sz=1GB 17.51 +- 1.19% 43.23 +- 5.22% +146.8% preempt=full|lazy
[#] Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
allocation, which is higher than ARCH_CLEAR_PAGE_EXTENT, so
preempt=none|voluntary sees no improvement on the pg-sz=1GB.
The improvement due to the hardware eliding cacheline allocation for
pg-sz=1GB can be seen in the reduced L1-dcache-loads:
- 44,513,459,667 cycles # 2.420 GHz ( +- 0.44% ) (35.71%)
- 1,378,032,592 instructions # 0.03 insn per cycle
- 11,224,288,082 L1-dcache-loads # 610.187 M/sec ( +- 0.08% ) (35.72%)
- 5,373,473,118 L1-dcache-load-misses # 47.87% of all L1-dcache accesses ( +- 0.00% ) (35.71%)
+ 20,093,219,076 cycles # 2.421 GHz ( +- 3.64% ) (35.69%)
+ 1,378,032,592 instructions # 0.03 insn per cycle
+ 186,525,095 L1-dcache-loads # 22.479 M/sec ( +- 2.11% ) (35.74%)
+ 73,479,687 L1-dcache-load-misses # 39.39% of all L1-dcache accesses ( +- 3.03% ) (35.74%)
Also as mentioned earlier, the baseline improvement is not specific to
AMD Zen*. Intel Icelakex (pg-sz=2MB|1GB) sees a similar improvement as
the Milan pg-sz=2MB workload above (~35%).
Signed-off-by: Ankur Arora <ankur.a.arora@...cle.com>
---
arch/x86/Kconfig | 4 ++++
arch/x86/include/asm/page_64.h | 7 +++++++
2 files changed, 11 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 71019b3b54ea..8a7ce6ab229b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -395,6 +395,10 @@ config GENERIC_CALIBRATE_DELAY
config ARCH_HAS_CPU_RELAX
def_bool y
+config ARCH_HAS_CLEAR_PAGES
+ def_bool y
+ depends on X86_64 && !HIGHMEM
+
config ARCH_HIBERNATION_POSSIBLE
def_bool y
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 5625d616bd00..221c7404fc3a 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -40,6 +40,13 @@ extern unsigned long __phys_addr_symbol(unsigned long);
#define __phys_reloc_hide(x) (x)
+/*
+ * When running under voluntary preemption models, limit the max extent
+ * to pages worth 8MB. With a clearing BW of ~10GBps, this should result
+ * in a worst case scheduling latency of ~1ms.
+ */
+#define ARCH_CLEAR_PAGE_EXTENT (8 << (20 - PAGE_SHIFT))
+
void memzero_page_aligned_unrolled(void *addr, u64 len);
/*
--
2.43.5
Powered by blists - more mailing lists