linux-kernel - [PATCH v5 00/14] mm: folio_zero_user: clearing of page-extents

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250710005926.1159009-1-ankur.a.arora@oracle.com>
Date: Wed,  9 Jul 2025 17:59:12 -0700
From: Ankur Arora <ankur.a.arora@...cle.com>
To: linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org
Cc: akpm@...ux-foundation.org, david@...hat.com, bp@...en8.de,
        dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
        mjguzik@...il.com, luto@...nel.org, peterz@...radead.org,
        acme@...nel.org, namhyung@...nel.org, tglx@...utronix.de,
        willy@...radead.org, raghavendra.kt@....com,
        boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
        ankur.a.arora@...cle.com
Subject: [PATCH v5 00/14] mm: folio_zero_user: clearing of page-extents

This series adds clearing of page-extents for hugepages, improving on the
current page-at-a-time approach in two ways:

 - amortizes the per-page setup cost over a larger extent

 - when using string instructions, exposes the real region size to the
   processor. A processor could use that as a hint to optimize based
   on the full extent size. AMD Zen uarchs, as an example, elide
   allocation of cachelines for regions larger than L3-size.

Demand faulting a 64GB region shows performance improvements:

 $ perf bench mem map -p $pg-sz -f demand -s 64GB -l 5

                 mm/folio_zero_user    x86/folio_zero_user       change
                  (GB/s  +- %stdev)     (GB/s  +- %stdev)

   pg-sz=2MB       11.82  +- 0.67%        16.48  +-  0.30%       + 39.4%	preempt=*

   pg-sz=1GB       17.14  +- 1.39%        17.42  +-  0.98% [#]   +  1.6%	preempt=none|voluntary
   pg-sz=1GB       17.51  +- 1.19%        43.23  +-  5.22%       +146.8%	preempt=full|lazy

[#] Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
allocation, which is higher than ARCH_CLEAR_PAGE_EXTENT, so
preempt=none|voluntary sees no improvement with pg-sz=1GB.

Raghavendra also tested v3/v4 on AMD Genoa and sees similar improvement [1].

Structure of the series:

Patches 1-5, 8,
  "perf bench mem: Remove repetition around time measurement"
  "perf bench mem: Defer type munging of size to float"
  "perf bench mem: Move mem op parameters into a structure"
  "perf bench mem: Pull out init/fini logic"
  "perf bench mem: Switch from zalloc() to mmap()"
  "perf bench mem: Refactor mem_options"

refactor, and patches 6-7, 9
  "perf bench mem: Allow mapping of hugepages"
  "perf bench mem: Allow chunking on a memory region"
  "perf bench mem: Add mmap() workload"

add a few new perf bench mem workloads (chunking and mapping
performance).

Patches 10-11,
  "x86/mm: Simplify clear_page_*"
  "x86/clear_page: Introduce clear_pages()"

inlines the ERMS and REP_GOOD implementations used from clear_page()
and adds clear_pages() to handle page extents.

Patches 12-13,
  "mm: add config option for clearing page-extents"
  "mm: memory: support clearing page-extents"

adds support to do extent zeroing via folio_zero_user().

And, finally patch 14,
  "x86/clear_pages: Support clearing of page-extents"

adds x86 support so folio_zero_user() can take advantage of
clear_pages().

Changelog:

v5:
 - move the non HIGHMEM implementation of folio_zero_user() from x86
   to common code (Dave Hansen)
 - Minor naming cleanups, commit messages etc

v4:
 - adds perf bench workloads to exercise mmap() populate/demand-fault (Ingo)
 - inline stosb etc (PeterZ)
 - handle cooperative preemption models (Ingo)
 - interface and other cleanups all over (Ingo)
 (https://lore.kernel.org/lkml/20250616052223.723982-1-ankur.a.arora@oracle.com/)

v3:
 - get rid of preemption dependency (TIF_ALLOW_RESCHED); this version
   was limited to preempt=full|lazy.
 - override folio_zero_user() (Linus)
 (https://lore.kernel.org/lkml/20250414034607.762653-1-ankur.a.arora@oracle.com/)

v2:
 - addressed review comments from peterz, tglx.
 - Removed clear_user_pages(), and CONFIG_X86_32:clear_pages()
 - General code cleanup
 (https://lore.kernel.org/lkml/20230830184958.2333078-1-ankur.a.arora@oracle.com/)

Comments appreciated!

Also at:
  github.com/terminus/linux clear-pages.v5

[1] https://lore.kernel.org/lkml/fffd4dad-2cb9-4bc9-8a80-a70be687fd54@amd.com/

Ankur Arora (14):
  perf bench mem: Remove repetition around time measurement
  perf bench mem: Defer type munging of size to float
  perf bench mem: Move mem op parameters into a structure
  perf bench mem: Pull out init/fini logic
  perf bench mem: Switch from zalloc() to mmap()
  perf bench mem: Allow mapping of hugepages
  perf bench mem: Allow chunking on a memory region
  perf bench mem: Refactor mem_options
  perf bench mem: Add mmap() workloads
  x86/mm: Simplify clear_page_*
  x86/clear_page: Introduce clear_pages()
  mm: add config option for clearing page-extents
  mm: memory: support clearing page-extents
  x86/clear_pages: Support clearing of page-extents

 arch/x86/Kconfig                             |   4 +
 arch/x86/include/asm/page_32.h               |  17 +-
 arch/x86/include/asm/page_64.h               |  63 ++-
 arch/x86/lib/clear_page_64.S                 |  39 +-
 mm/Kconfig                                   |   9 +
 mm/memory.c                                  |  86 +++-
 tools/perf/bench/bench.h                     |   1 +
 tools/perf/bench/mem-functions.c             | 391 ++++++++++++++-----
 tools/perf/bench/mem-memcpy-arch.h           |   2 +-
 tools/perf/bench/mem-memcpy-x86-64-asm-def.h |   4 +
 tools/perf/bench/mem-memset-arch.h           |   2 +-
 tools/perf/bench/mem-memset-x86-64-asm-def.h |   4 +
 tools/perf/builtin-bench.c                   |   1 +
 13 files changed, 472 insertions(+), 151 deletions(-)

-- 
2.43.5