[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20260107081921.0e189904060f49a555142e28@linux-foundation.org>
Date: Wed, 7 Jan 2026 08:19:21 -0800
From: Andrew Morton <akpm@...ux-foundation.org>
To: "Li Zhe" <lizhe.67@...edance.com>
Cc: <muchun.song@...ux.dev>, <osalvador@...e.de>, <david@...nel.org>,
<fvdl@...gle.com>, <linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism
On Wed, 7 Jan 2026 19:31:22 +0800 "Li Zhe" <lizhe.67@...edance.com> wrote:
> This patchset is based on this commit[1]("mm/hugetlb: optionally
> pre-zero hugetlb pages").
>
> Fresh hugetlb pages are zeroed out when they are faulted in,
> just like with all other page types. This can take up a good
> amount of time for larger page sizes (e.g. around 250
> milliseconds for a 1G page on a Skylake machine).
>
> This normally isn't a problem, since hugetlb pages are typically
> mapped by the application for a long time, and the initial
> delay when touching them isn't much of an issue.
>
> However, there are some use cases where a large number of hugetlb
> pages are touched when an application starts (such as a VM backed
> by these pages), rendering the launch noticeably slow.
>
> On an Skylake platform running v6.19-rc2, faulting in 64 × 1 GB huge
> pages takes about 16 seconds, roughly 250 ms per page. Even with
> Ankur’s optimizations[2], the time drops only to ~13 seconds,
> ~200 ms per page, still a noticeable delay.
>
> To accelerate the above scenario, this patchset exports a per-node,
> read-write "zeroable_hugepages" sysfs interface for every hugepage size.
> This interface reports how many hugepages on that node can currently
> be pre-zeroed and allows user space to request that any integer number
> in the range [0, max] be zeroed in a single operation.
>
> This mechanism offers the following advantages:
>
> (1) User space gains full control over when zeroing is triggered,
> enabling it to minimize the impact on both CPU and cache utilization.
>
> (2) Applications can spawn as many zeroing processes as they need,
> enabling concurrent background zeroing.
>
> (3) By binding the process to specific CPUs, users can confine zeroing
> threads to cores that do not run latency-critical tasks, eliminating
> interference.
>
> (4) A zeroing process can be interrupted at any time through standard
> signal mechanisms, allowing immediate cancellation.
>
> (5) The CPU consumption incurred by zeroing can be throttled and contained
> with cgroups, ensuring that the cost is not borne system-wide.
>
> Tested on the same Skylake platform as above, when the 64 GiB of memory
> was pre-zeroed in advance by the pre-zeroing mechanism, the faulting
> latency test completed in negligible time.
>
> In user space, we can use system calls such as epoll and write to zero
> huge folios as they become available, and sleep when none are ready. The
> following pseudocode illustrates this approach. The pseudocode spawns
> eight threads (each running thread_fun()) that wait for huge pages on
> node 0 to become eligible for zeroing; whenever such pages are available,
> the threads clear them in parallel.
This seems to be quite a lot of messing around in userspace. Perhaps
unavoidable given the tradeoffs which are involved, and reasonable in
the sort of environments in which this will be used. I guess there are
many alternatives - let's see what others think.
> fs/hugetlbfs/inode.c | 3 +-
> include/linux/hugetlb.h | 26 +++++
> mm/hugetlb.c | 131 ++++++++++++++++++++++---
> mm/hugetlb_internal.h | 6 ++
> mm/hugetlb_sysfs.c | 206 ++++++++++++++++++++++++++++++++++++----
> 5 files changed, 337 insertions(+), 35 deletions(-)
Let's find places in Documentation/ (and Documentation/ABI) to document
the userspace interface?
Powered by blists - more mailing lists