linux-kernel - Re: [PATCH v3 0/7] hugetlb: parallelize hugetlb page init on boot

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5c30a825-b588-e3a9-83db-f8eef4cb9012@google.com>
Date: Tue, 2 Jan 2024 17:52:53 -0800 (PST)
From: David Rientjes <rientjes@...gle.com>
To: Gang Li <gang.li@...ux.dev>
cc: David Hildenbrand <david@...hat.com>, 
    Mike Kravetz <mike.kravetz@...cle.com>, 
    Muchun Song <muchun.song@...ux.dev>, 
    Andrew Morton <akpm@...ux-foundation.org>, 
    Tim Chen <tim.c.chen@...ux.intel.com>, linux-mm@...ck.org, 
    linux-kernel@...r.kernel.org, ligang.bdlg@...edance.com
Subject: Re: [PATCH v3 0/7] hugetlb: parallelize hugetlb page init on boot

On Tue, 2 Jan 2024, Gang Li wrote:

> Hi all, hugetlb init parallelization has now been updated to v3.
> 
> This series is tested on next-20240102 and can not be applied to v6.7-rc8.
> 
> Update Summary:
> - Select CONFIG_PADATA as we use padata_do_multithreaded
> - Fix a race condition in h->next_nid_to_alloc
> - Fix local variable initialization issues
> - Remove RFC tag
> 
> Thanks to the testing by David Rientjes, we now know that this patch reduce
> hugetlb 1G initialization time from 77s to 18.3s on a 12T machine[4].
> 
> # Introduction
> Hugetlb initialization during boot takes up a considerable amount of time.
> For instance, on a 2TB system, initializing 1,800 1GB huge pages takes 1-2
> seconds out of 10 seconds. Initializing 11,776 1GB pages on a 12TB Intel
> host takes more than 1 minute[1]. This is a noteworthy figure.
> 
> Inspired by [2] and [3], hugetlb initialization can also be accelerated
> through parallelization. Kernel already has infrastructure like
> padata_do_multithreaded, this patch uses it to achieve effective results
> by minimal modifications.
> 
> [1] https://lore.kernel.org/all/783f8bac-55b8-5b95-eb6a-11a583675000@google.com/
> [2] https://lore.kernel.org/all/20200527173608.2885243-1-daniel.m.jordan@oracle.com/
> [3] https://lore.kernel.org/all/20230906112605.2286994-1-usama.arif@bytedance.com/
> [4] https://lore.kernel.org/all/76becfc1-e609-e3e8-2966-4053143170b6@google.com/
> 
> # Test result
>         test          no patch(ms)   patched(ms)   saved   
>  ------------------- -------------- ------------- -------- 
>   256c2t(4 node) 1G           4745          2024   57.34%
>   128c1t(2 node) 1G           3358          1712   49.02%
>       12t        1G          77000         18300   76.23%
> 
>   256c2t(4 node) 2M           3336          1051   68.52%
>   128c1t(2 node) 2M           1943           716   63.15%
> 

I tested 1GB hugetlb on a smaller AMD host with the following:

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3301,7 +3301,7 @@ int alloc_bootmem_huge_page(struct hstate *h, int nid)
 int __alloc_bootmem_huge_page(struct hstate *h, int nid)
 {
        struct huge_bootmem_page *m = NULL; /* initialize for clang */
-       int nr_nodes, node;
+       int nr_nodes, node = nid;
 
        /* do node specific alloc */
        if (nid != NUMA_NO_NODE) {

After the build error is fixed, feel free to add:

	Tested-by: David Rientjes <rientjes@...gle.com>

to each patch.  I think Andrew will probably take a build fix up as a
delta on top of patch 4 rather than sending a whole new series unless
there is other feedback that you receive.