lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20260107113130.37231-1-lizhe.67@bytedance.com>
Date: Wed,  7 Jan 2026 19:31:22 +0800
From: "Li Zhe" <lizhe.67@...edance.com>
To: <muchun.song@...ux.dev>, <osalvador@...e.de>, <david@...nel.org>, 
	<akpm@...ux-foundation.org>, <fvdl@...gle.com>
Cc: <linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>, 
	<lizhe.67@...edance.com>
Subject: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

This patchset is based on this commit[1]("mm/hugetlb: optionally
pre-zero hugetlb pages").

Fresh hugetlb pages are zeroed out when they are faulted in,
just like with all other page types. This can take up a good
amount of time for larger page sizes (e.g. around 250
milliseconds for a 1G page on a Skylake machine).

This normally isn't a problem, since hugetlb pages are typically
mapped by the application for a long time, and the initial
delay when touching them isn't much of an issue.

However, there are some use cases where a large number of hugetlb
pages are touched when an application starts (such as a VM backed
by these pages), rendering the launch noticeably slow.

On an Skylake platform running v6.19-rc2, faulting in 64 × 1 GB huge
pages takes about 16 seconds, roughly 250 ms per page. Even with
Ankur’s optimizations[2], the time drops only to ~13 seconds,
~200 ms per page, still a noticeable delay.

To accelerate the above scenario, this patchset exports a per-node,
read-write "zeroable_hugepages" sysfs interface for every hugepage size.
This interface reports how many hugepages on that node can currently
be pre-zeroed and allows user space to request that any integer number
in the range [0, max] be zeroed in a single operation.

This mechanism offers the following advantages:

(1) User space gains full control over when zeroing is triggered,
enabling it to minimize the impact on both CPU and cache utilization.

(2) Applications can spawn as many zeroing processes as they need,
enabling concurrent background zeroing.

(3) By binding the process to specific CPUs, users can confine zeroing
threads to cores that do not run latency-critical tasks, eliminating
interference.

(4) A zeroing process can be interrupted at any time through standard
signal mechanisms, allowing immediate cancellation.

(5) The CPU consumption incurred by zeroing can be throttled and contained
with cgroups, ensuring that the cost is not borne system-wide.

Tested on the same Skylake platform as above, when the 64 GiB of memory
was pre-zeroed in advance by the pre-zeroing mechanism, the faulting
latency test completed in negligible time.

In user space, we can use system calls such as epoll and write to zero
huge folios as they become available, and sleep when none are ready. The
following pseudocode illustrates this approach. The pseudocode spawns
eight threads (each running thread_fun()) that wait for huge pages on
node 0 to become eligible for zeroing; whenever such pages are available,
the threads clear them in parallel.

  static void thread_fun(void)
  {
  	epoll_create();
  	epoll_ctl();
  	while (1) {
  		val = read("/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
  		if (val > 0)
  			system("echo max > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
  		epoll_wait();
  	}
  }
  
  static void start_pre_zero_thread(int thread_num)
  {
  	create_pre_zero_threads(thread_num, thread_fun)
  }
  
  int main(void)
  {
  	start_pre_zero_thread(8);
  }

[1]: https://lore.kernel.org/linux-mm/202412030519.W14yll4e-lkp@intel.com/T/#t
[2]: https://lore.kernel.org/all/20251215204922.475324-1-ankur.a.arora@oracle.com/T/#u

Li Zhe (8):
  mm/hugetlb: add pre-zeroed framework
  mm/hugetlb: convert to prep_account_new_hugetlb_folio()
  mm/hugetlb: move the huge folio to the end of the list during enqueue
  mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"
  mm/hugetlb: simplify function hugetlb_sysfs_add_hstate()
  mm/hugetlb: relocate the per-hstate struct kobject pointer
  mm/hugetlb: add epoll support for interface "zeroable_hugepages"
  mm/hugetlb: limit event generation frequency of function
    do_zero_free_notify()

 fs/hugetlbfs/inode.c    |   3 +-
 include/linux/hugetlb.h |  26 +++++
 mm/hugetlb.c            | 131 ++++++++++++++++++++++---
 mm/hugetlb_internal.h   |   6 ++
 mm/hugetlb_sysfs.c      | 206 ++++++++++++++++++++++++++++++++++++----
 5 files changed, 337 insertions(+), 35 deletions(-)

---
Changelogs:

v1->v2 :
- Use guard() to simplify function hpage_wait_zeroing(). (pointed by
  Raghu)
- Simplify the logic of zero_free_hugepages_nid() by removing
  redundant checks and exiting the loop upon encountering a
  pre-zeroed folio. (pointed by Frank)
- Include in the cover letter a performance comparison with Ankur's
  optimization patch[2]. (pointed by Andrew)

v1: https://lore.kernel.org/all/20251225082059.1632-1-lizhe.67@bytedance.com/

-- 
2.20.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ