[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20251027202109.678022-1-ankur.a.arora@oracle.com>
Date: Mon, 27 Oct 2025 13:21:02 -0700
From: Ankur Arora <ankur.a.arora@...cle.com>
To: linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org
Cc: akpm@...ux-foundation.org, david@...hat.com, bp@...en8.de,
dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
mjguzik@...il.com, luto@...nel.org, peterz@...radead.org,
acme@...nel.org, namhyung@...nel.org, tglx@...utronix.de,
willy@...radead.org, raghavendra.kt@....com,
boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
ankur.a.arora@...cle.com
Subject: [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages
This series adds clearing of contiguous page ranges for hugepages,
improving on the current page-at-a-time approach in two ways:
- amortizes the per-page setup cost over a larger extent
- when using string instructions, exposes the real region size
to the processor.
A processor could use a knowledge of the extent to optimize the
clearing. AMD Zen uarchs, as an example, elide allocation of
cachelines for regions larger than L3-size.
Demand faulting a 64GB region shows performance improvements:
$ perf bench mem map -p $pg-sz -f demand -s 64GB -l 5
baseline +series change
(GB/s +- %stdev) (GB/s +- %stdev)
pg-sz=2MB 12.92 +- 2.55% 17.03 +- 0.70% + 31.8% preempt=*
pg-sz=1GB 17.14 +- 2.27% 18.04 +- 1.05% [#] + 5.2% preempt=none|voluntary
pg-sz=1GB 17.26 +- 1.24% 42.17 +- 4.21% +144.3% preempt=full|lazy
[#] Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
allocation, which is higher than the maximum extent used on x86
(ARCH_CONTIG_PAGE_NR=8MB), so preempt=none|voluntary sees no improvement
with pg-sz=1GB.
The anon-w-seq test in the vm-scalability benchmark, however, does show
worse performance with utime increasing by ~9%:
stime utime
baseline 1654.63 ( +- 3.84% ) 811.00 ( +- 3.84% )
+series 1630.32 ( +- 2.73% ) 886.37 ( +- 5.19% )
In part this is because anon-w-seq runs with 384 processes zeroing
anonymously mapped memory which they then access sequentially. As
such this is a likely uncommon pattern where the memory bandwidth
is saturated while also being cache limited because we access the
entire region.
Raghavendra also tested previous version of the series on AMD Genoa [1].
Changelog:
v8:
- make clear_user_highpages(), clear_user_pages() and clear_pages()
more robust across architectures. (Thanks David!)
- split up folio_zero_user() changes into ones for clearing contiguous
regions and those for maintaining temporal locality since they have
different performance profiles (Suggested by Andrew Morton.)
- added Raghavendra's Reviewed-by, Tested-by.
- get rid of nth_page()
- perf related patches have been pulled already. Remove them.
v7:
- interface cleanups, comments for clear_user_highpages(), clear_user_pages(),
clear_pages().
- fixed build errors flagged by kernel test robot
(https://lore.kernel.org/lkml/20250917152418.4077386-1-ankur.a.arora@oracle.com/)
v6:
- perf bench mem: update man pages and other cleanups (Namhyung Kim)
- unify folio_zero_user() for HIGHMEM, !HIGHMEM options instead of
working through a new config option (David Hildenbrand).
- cleanups and simlification around that.
(https://lore.kernel.org/lkml/20250902080816.3715913-1-ankur.a.arora@oracle.com/)
v5:
- move the non HIGHMEM implementation of folio_zero_user() from x86
to common code (Dave Hansen)
- Minor naming cleanups, commit messages etc
(https://lore.kernel.org/lkml/20250710005926.1159009-1-ankur.a.arora@oracle.com/)
v4:
- adds perf bench workloads to exercise mmap() populate/demand-fault (Ingo)
- inline stosb etc (PeterZ)
- handle cooperative preemption models (Ingo)
- interface and other cleanups all over (Ingo)
(https://lore.kernel.org/lkml/20250616052223.723982-1-ankur.a.arora@oracle.com/)
v3:
- get rid of preemption dependency (TIF_ALLOW_RESCHED); this version
was limited to preempt=full|lazy.
- override folio_zero_user() (Linus)
(https://lore.kernel.org/lkml/20250414034607.762653-1-ankur.a.arora@oracle.com/)
v2:
- addressed review comments from peterz, tglx.
- Removed clear_user_pages(), and CONFIG_X86_32:clear_pages()
- General code cleanup
(https://lore.kernel.org/lkml/20230830184958.2333078-1-ankur.a.arora@oracle.com/)
Comments appreciated!
Also at:
github.com/terminus/linux clear-pages.v7
[1] https://lore.kernel.org/lkml/fffd4dad-2cb9-4bc9-8a80-a70be687fd54@amd.com/
Ankur Arora (6):
mm: introduce clear_pages() and clear_user_pages()
mm/highmem: introduce clear_user_highpages()
x86/mm: Simplify clear_page_*
x86/clear_page: Introduce clear_pages()
mm, folio_zero_user: support clearing page ranges
mm: folio_zero_user: cache neighbouring pages
David Hildenbrand (1):
treewide: provide a generic clear_user_page() variant
arch/alpha/include/asm/page.h | 1 -
arch/arc/include/asm/page.h | 2 +
arch/arm/include/asm/page-nommu.h | 1 -
arch/arm64/include/asm/page.h | 1 -
arch/csky/abiv1/inc/abi/page.h | 1 +
arch/csky/abiv2/inc/abi/page.h | 7 ---
arch/hexagon/include/asm/page.h | 1 -
arch/loongarch/include/asm/page.h | 1 -
arch/m68k/include/asm/page_mm.h | 1 +
arch/m68k/include/asm/page_no.h | 1 -
arch/microblaze/include/asm/page.h | 1 -
arch/mips/include/asm/page.h | 1 +
arch/nios2/include/asm/page.h | 1 +
arch/openrisc/include/asm/page.h | 1 -
arch/parisc/include/asm/page.h | 1 -
arch/powerpc/include/asm/page.h | 1 +
arch/riscv/include/asm/page.h | 1 -
arch/s390/include/asm/page.h | 1 -
arch/sparc/include/asm/page_32.h | 2 +
arch/sparc/include/asm/page_64.h | 1 +
arch/um/include/asm/page.h | 1 -
arch/x86/include/asm/page.h | 6 ---
arch/x86/include/asm/page_32.h | 6 +++
arch/x86/include/asm/page_64.h | 64 ++++++++++++++++++-----
arch/x86/lib/clear_page_64.S | 39 +++-----------
arch/xtensa/include/asm/page.h | 1 -
include/linux/highmem.h | 29 +++++++++++
include/linux/mm.h | 69 +++++++++++++++++++++++++
mm/memory.c | 82 ++++++++++++++++++++++--------
mm/util.c | 13 +++++
30 files changed, 247 insertions(+), 91 deletions(-)
--
2.43.5
Powered by blists - more mailing lists