[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20251225082059.1632-1-lizhe.67@bytedance.com>
Date: Thu, 25 Dec 2025 16:20:51 +0800
From: 李喆 <lizhe.67@...edance.com>
To: <muchun.song@...ux.dev>, <osalvador@...e.de>, <david@...nel.org>,
<akpm@...ux-foundation.org>, <fvdl@...gle.com>
Cc: <linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>,
<lizhe.67@...edance.com>
Subject: [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism
From: Li Zhe <lizhe.67@...edance.com>
This patchset is based on this commit[1]("mm/hugetlb: optionally
pre-zero hugetlb pages").
Fresh hugetlb pages are zeroed out when they are faulted in,
just like with all other page types. This can take up a good
amount of time for larger page sizes (e.g. around 40
milliseconds for a 1G page on a recent AMD-based system).
This normally isn't a problem, since hugetlb pages are typically
mapped by the application for a long time, and the initial
delay when touching them isn't much of an issue.
However, there are some use cases where a large number of hugetlb
pages are touched when an application (such as a VM backed by these
pages) starts. For 256 1G pages and 40ms per page, this would take
10 seconds, a noticeable delay.
To accelerate the above scenario, this patchset exports a per-node,
read-write zeroable_hugepages interface for every hugepage size.
This interface reports how many hugepages on that node can currently
be pre-zeroed and allows user space to request that any integer number
in the range [0, max] be zeroed in a single operation.
This mechanism offers the following advantages:
(1) User space gains full control over when zeroing is triggered,
enabling it to minimize the impact on both CPU and cache utilization.
(2) Applications can spawn as many zeroing processes as they need,
enabling concurrent background zeroing.
(3) By binding the process to specific CPUs, users can confine zeroing
threads to cores that do not run latency-critical tasks, eliminating
interference.
(4) A zeroing process can be interrupted at any time through standard
signal mechanisms, allowing immediate cancellation.
(5) The CPU consumption incurred by zeroing can be throttled and contained
with cgroups, ensuring that the cost is not borne system-wide.
On an AMD Milan platform, each 1 GB huge-page fault is shortened by at
least 25628 us (figure inherited from the test results cited herein[1]).
In user space, we can use system calls such as epoll and write to zero
huge pages as they become available, and sleep when none are ready. The
following pseudocode illustrates this approach. The pseudocode spawns
eight threads that wait for huge pages on node 0 to become eligible for
zeroing; whenever such pages are available, the threads clear them in
parallel.
static void thread_fun(void)
{
epoll_create();
epoll_ctl();
while (1) {
val = read("/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
if (val > 0)
system("echo max > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
epoll_wait();
}
}
static void start_pre_zero_thread(int thread_num)
{
create_pre_zero_threads(thread_num, thread_fun)
}
int main(void)
{
start_pre_zero_thread(8);
}
[1]: https://lore.kernel.org/linux-mm/202412030519.W14yll4e-lkp@intel.com/T/#t
Li Zhe (8):
mm/hugetlb: add pre-zeroed framework
mm/hugetlb: convert to prep_account_new_hugetlb_folio()
mm/hugetlb: move the huge folio to the end of the list during enqueue
mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages"
mm/hugetlb: simplify function hugetlb_sysfs_add_hstate()
mm/hugetlb: relocate the per-hstate struct kobject pointer
mm/hugetlb: add epoll support for interface "zeroable_hugepages"
mm/hugetlb: limit event generation frequency of function
do_zero_free_notify()
fs/hugetlbfs/inode.c | 3 +-
include/linux/hugetlb.h | 26 ++++++
mm/hugetlb.c | 133 +++++++++++++++++++++++---
mm/hugetlb_internal.h | 6 ++
mm/hugetlb_sysfs.c | 202 ++++++++++++++++++++++++++++++++++++----
5 files changed, 335 insertions(+), 35 deletions(-)
--
2.20.1
Powered by blists - more mailing lists