[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20251215204922.475324-1-ankur.a.arora@oracle.com>
Date: Mon, 15 Dec 2025 12:49:14 -0800
From: Ankur Arora <ankur.a.arora@...cle.com>
To: linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org
Cc: akpm@...ux-foundation.org, david@...nel.org, bp@...en8.de,
dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
mjguzik@...il.com, luto@...nel.org, peterz@...radead.org,
tglx@...utronix.de, willy@...radead.org, raghavendra.kt@....com,
chleroy@...nel.org, ioworker0@...il.com, boris.ostrovsky@...cle.com,
konrad.wilk@...cle.com, ankur.a.arora@...cle.com
Subject: [PATCH v10 0/8] mm: folio_zero_user: clear contiguous pages
[ Resending with the list this time. Some of the recipients might have
received a duplicate copy of this email. Apologies. ]
This series adds clearing of contiguous page ranges for hugepages.
Major change over v9:
- move clear_user_page(), clear_user_pages() into highmem.h and
conditions both on the arch not defining clear_user_highpage().
Described in more detail in the changelog.
The series improves on the current page-at-a-time approach in two ways:
- amortizes the per-page setup cost over a larger extent
- when using string instructions, exposes the real region size
to the processor.
A processor could use knowledge of the full extent to optimize the
clearing better than if it sees only a single page sized extent at
a time. AMD Zen uarchs, as an example, elide cacheline allocation
for regions larger than LLC-size.
Demand faulting a 64GB region shows performance improvement:
$ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5
baseline +series change
(GB/s +- %stdev) (GB/s +- %stdev)
pg-sz=2MB 12.92 +- 2.55% 17.03 +- 0.70% + 31.8% preempt=*
pg-sz=1GB 17.14 +- 2.27% 18.04 +- 1.05% + 5.2% preempt=none|voluntary
pg-sz=1GB 17.26 +- 1.24% 42.17 +- 4.21% [#] +144.3% preempt=full|lazy
[#] Notice that we perform much better with preempt=full|lazy. That's
because preemptible models don't need explicit invocations of
cond_resched() to ensure reasonable preemption latency, which,
allows us to clear the full extent (1GB) in a single unit.
In comparison the maximum extent used for preempt=none|voluntary is
PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
The larger extent allows the processor to elide cacheline
allocation (on Milan the threshold is LLC-size=32MB.)
(The hope is that eventually, in the fullness of time, the lazy
preemption model will be able to do the same job that none or
voluntary models are used for, allowing cond_resched() to go away.)
The anon-w-seq test in the vm-scalability benchmark, however, does show
worse performance with utime increasing by ~9%:
stime utime
baseline 1654.63 ( +- 3.84% ) 811.00 ( +- 3.84% )
+series 1630.32 ( +- 2.73% ) 886.37 ( +- 5.19% )
In part this is because anon-w-seq runs with 384 processes zeroing
anonymously mapped memory which they then access sequentially. As
such this is a likely uncommon pattern where the memory bandwidth
is saturated while we are also cache limited because the workload
accesses the entire region.
Raghavendra also tested previous version of the series on AMD Genoa
also sees improvement [1] with preempt=lazy.
(The pg-sz=2MB improvement is much higher on Genoa than I see on
Milan):
$ perf bench mem map -p $page-size -f populate -s 64GB -l 10
base patched change
pg-sz=2MB 12.731939 GB/sec 26.304263 GB/sec 106.6%
pg-sz=1GB 26.232423 GB/sec 61.174836 GB/sec 133.2%
Changelog:
v10:
- Condition the definition of clear_user_page(), clear_user_pages()
on the architecture code defining clear_user_highpage. This
simplifies issues with architectures where some do not define
clear_user_page, but define clear_user_highpage().
Also, instead of splitting up across files, move them to linux/highmem.
This gets rid of build errors while using clear_usre_pages() for
architectures that use macro magic (such as sparc, m68k).
(Suggested by Christophe Leroy).
- thresh out some of the comments around the x86 clear_pages()
definition (Suggested by Borislav Petkov and Mateusz Guzik).
(https://lore.kernel.org/lkml/20251121202352.494700-1-ankur.a.arora@oracle.com/)
v9:
- Define PROCESS_PAGES_NON_PREEMPT_BATCH in common code (instead of
inheriting ARCH_PAGE_CONTIG_NR.)
- Also document this in much greater detail as clearing pages
needing a a constant dependent on the preemption model is
facially quite odd.
(Suggested by David Hildenbrand, Andrew Morton, Borislav Petkov.)
- Switch architectural markers from __HAVE_ARCH_CLEAR_USER_PAGE (and
similar) to clear_user_page etc. (Suggested by David Hildenbrand)
- s/memzero_page_aligned_unrolled/__clear_pages_unrolled/
(Suggested by Borislav Petkov.)
- style, comment fixes
(https://lore.kernel.org/lkml/20251027202109.678022-1-ankur.a.arora@oracle.com/)
v8:
- make clear_user_highpages(), clear_user_pages() and clear_pages()
more robust across architectures. (Thanks David!)
- split up folio_zero_user() changes into ones for clearing contiguous
regions and those for maintaining temporal locality since they have
different performance profiles (Suggested by Andrew Morton.)
- added Raghavendra's Reviewed-by, Tested-by.
- get rid of nth_page()
- perf related patches have been pulled already. Remove them.
v7:
- interface cleanups, comments for clear_user_highpages(), clear_user_pages(),
clear_pages().
- fixed build errors flagged by kernel test robot
(https://lore.kernel.org/lkml/20250917152418.4077386-1-ankur.a.arora@oracle.com/)
v6:
- perf bench mem: update man pages and other cleanups (Namhyung Kim)
- unify folio_zero_user() for HIGHMEM, !HIGHMEM options instead of
working through a new config option (David Hildenbrand).
- cleanups and simlification around that.
(https://lore.kernel.org/lkml/20250902080816.3715913-1-ankur.a.arora@oracle.com/)
v5:
- move the non HIGHMEM implementation of folio_zero_user() from x86
to common code (Dave Hansen)
- Minor naming cleanups, commit messages etc
(https://lore.kernel.org/lkml/20250710005926.1159009-1-ankur.a.arora@oracle.com/)
v4:
- adds perf bench workloads to exercise mmap() populate/demand-fault (Ingo)
- inline stosb etc (PeterZ)
- handle cooperative preemption models (Ingo)
- interface and other cleanups all over (Ingo)
(https://lore.kernel.org/lkml/20250616052223.723982-1-ankur.a.arora@oracle.com/)
v3:
- get rid of preemption dependency (TIF_ALLOW_RESCHED); this version
was limited to preempt=full|lazy.
- override folio_zero_user() (Linus)
(https://lore.kernel.org/lkml/20250414034607.762653-1-ankur.a.arora@oracle.com/)
v2:
- addressed review comments from peterz, tglx.
- Removed clear_user_pages(), and CONFIG_X86_32:clear_pages()
- General code cleanup
(https://lore.kernel.org/lkml/20230830184958.2333078-1-ankur.a.arora@oracle.com/)
Comments appreciated!
Also at:
github.com/terminus/linux clear-pages.v7
[1] https://lore.kernel.org/lkml/fffd4dad-2cb9-4bc9-8a80-a70be687fd54@amd.com/
Ankur Arora (7):
highmem: introduce clear_user_highpages()
mm: introduce clear_pages() and clear_user_pages()
highmem: do range clearing in clear_user_highpages()
x86/mm: Simplify clear_page_*
x86/clear_page: Introduce clear_pages()
mm, folio_zero_user: support clearing page ranges
mm: folio_zero_user: cache neighbouring pages
David Hildenbrand (1):
treewide: provide a generic clear_user_page() variant
arch/alpha/include/asm/page.h | 1 -
arch/arc/include/asm/page.h | 2 +
arch/arm/include/asm/page-nommu.h | 1 -
arch/arm64/include/asm/page.h | 1 -
arch/csky/abiv1/inc/abi/page.h | 1 +
arch/csky/abiv2/inc/abi/page.h | 7 ---
arch/hexagon/include/asm/page.h | 1 -
arch/loongarch/include/asm/page.h | 1 -
arch/m68k/include/asm/page_no.h | 1 -
arch/microblaze/include/asm/page.h | 1 -
arch/mips/include/asm/page.h | 1 +
arch/nios2/include/asm/page.h | 1 +
arch/openrisc/include/asm/page.h | 1 -
arch/parisc/include/asm/page.h | 1 -
arch/powerpc/include/asm/page.h | 1 +
arch/riscv/include/asm/page.h | 1 -
arch/s390/include/asm/page.h | 1 -
arch/sparc/include/asm/page_64.h | 1 +
arch/um/include/asm/page.h | 1 -
arch/x86/include/asm/page.h | 6 --
arch/x86/include/asm/page_32.h | 6 ++
arch/x86/include/asm/page_64.h | 76 ++++++++++++++++++-----
arch/x86/lib/clear_page_64.S | 39 +++---------
arch/xtensa/include/asm/page.h | 1 -
include/linux/highmem.h | 97 +++++++++++++++++++++++++++++-
include/linux/mm.h | 56 +++++++++++++++++
mm/memory.c | 86 +++++++++++++++++++-------
27 files changed, 298 insertions(+), 95 deletions(-)
--
2.31.1
Powered by blists - more mailing lists