linux-kernel - Re: [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAPTztWaiDDvgq1Q-GQjGO75Ujy30qy+m5wQXuFiTPmRLEm3aPw@mail.gmail.com>
Date: Fri, 26 Dec 2025 10:32:11 -0800
From: Frank van der Linden <fvdl@...gle.com>
To: 李喆 <lizhe.67@...edance.com>
Cc: muchun.song@...ux.dev, osalvador@...e.de, david@...nel.org, 
	akpm@...ux-foundation.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 0/8] Introduce a huge-page pre-zeroing mechanism

On Thu, Dec 25, 2025 at 12:21 AM 李喆 <lizhe.67@...edance.com> wrote:
>
> From: Li Zhe <lizhe.67@...edance.com>
>
> This patchset is based on this commit[1]("mm/hugetlb: optionally
> pre-zero hugetlb pages").
>
> Fresh hugetlb pages are zeroed out when they are faulted in,
> just like with all other page types. This can take up a good
> amount of time for larger page sizes (e.g. around 40
> milliseconds for a 1G page on a recent AMD-based system).
>
> This normally isn't a problem, since hugetlb pages are typically
> mapped by the application for a long time, and the initial
> delay when touching them isn't much of an issue.
>
> However, there are some use cases where a large number of hugetlb
> pages are touched when an application (such as a VM backed by these
> pages) starts. For 256 1G pages and 40ms per page, this would take
> 10 seconds, a noticeable delay.
>
> To accelerate the above scenario, this patchset exports a per-node,
> read-write zeroable_hugepages interface for every hugepage size.
> This interface reports how many hugepages on that node can currently
> be pre-zeroed and allows user space to request that any integer number
> in the range [0, max] be zeroed in a single operation.
>
> This mechanism offers the following advantages:
>
> (1) User space gains full control over when zeroing is triggered,
> enabling it to minimize the impact on both CPU and cache utilization.
>
> (2) Applications can spawn as many zeroing processes as they need,
> enabling concurrent background zeroing.
>
> (3) By binding the process to specific CPUs, users can confine zeroing
> threads to cores that do not run latency-critical tasks, eliminating
> interference.
>
> (4) A zeroing process can be interrupted at any time through standard
> signal mechanisms, allowing immediate cancellation.
>
> (5) The CPU consumption incurred by zeroing can be throttled and contained
> with cgroups, ensuring that the cost is not borne system-wide.
>
> On an AMD Milan platform, each 1 GB huge-page fault is shortened by at
> least 25628 us (figure inherited from the test results cited herein[1]).
>
> In user space, we can use system calls such as epoll and write to zero
> huge pages as they become available, and sleep when none are ready. The
> following pseudocode illustrates this approach. The pseudocode spawns
> eight threads that wait for huge pages on node 0 to become eligible for
> zeroing; whenever such pages are available, the threads clear them in
> parallel.
>
>   static void thread_fun(void)
>   {
>         epoll_create();
>         epoll_ctl();
>         while (1) {
>                 val = read("/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
>                 if (val > 0)
>                         system("echo max > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/zeroable_hugepages");
>                 epoll_wait();
>         }
>   }
>
>   static void start_pre_zero_thread(int thread_num)
>   {
>         create_pre_zero_threads(thread_num, thread_fun)
>   }
>
>   int main(void)
>   {
>         start_pre_zero_thread(8);
>   }
>
> [1]: https://lore.kernel.org/linux-mm/202412030519.W14yll4e-lkp@intel.com/T/#t

Thanks for taking my patches and extending them!

As far as I can see, you took what I did and then added a framework
for the zeroing to be done in user context, and possibly by multiple
threads, right? There were one or two comments on my original patch
set that objected to the zero cost being taken by a system thread, not
a user thread, so this should address that.

I'll go through them to provide comments inline.

- Frank