linux-kernel - Re: [PATCH v4 6/7] hugetlb: parallelize 2M hugetlb allocation and initialization

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <14e38e95-2bc6-4571-b502-4e3954b4bcc4@linux.dev>
Date: Mon, 22 Jan 2024 18:12:53 +0800
From: Gang Li <gang.li@...ux.dev>
To: Muchun Song <muchun.song@...ux.dev>, David Hildenbrand
 <david@...hat.com>, David Rientjes <rientjes@...gle.com>,
 Mike Kravetz <mike.kravetz@...cle.com>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Tim Chen <tim.c.chen@...ux.intel.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
 ligang.bdlg@...edance.com
Subject: Re: [PATCH v4 6/7] hugetlb: parallelize 2M hugetlb allocation and
 initialization

On 2024/1/22 15:10, Muchun Song wrote:> On 2024/1/18 20:39, Gang Li wrote:
>> +static void __init hugetlb_alloc_node(unsigned long start, unsigned 
>> long end, void *arg)
>>   {
>> -    unsigned long i;
>> +    struct hstate *h = (struct hstate *)arg;
>> +    int i, num = end - start;
>> +    nodemask_t node_alloc_noretry;
>> +    unsigned long flags;
>> +    int next_node = 0;
> 
> This should be first_online_node which may be not zero.
> 

That's right. Thanks!

>> -    for (i = 0; i < h->max_huge_pages; ++i) {
>> -        if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
>> +    /* Bit mask controlling how hard we retry per-node allocations.*/
>> +    nodes_clear(node_alloc_noretry);
>> +
>> +    for (i = 0; i < num; ++i) {
>> +        struct folio *folio = alloc_pool_huge_folio(h, 
>> &node_states[N_MEMORY],
>> +                        &node_alloc_noretry, &next_node);
>> +        if (!folio)
>>               break;
>> +        spin_lock_irqsave(&hugetlb_lock, flags);
>  > I suspect there will more contention on this lock when parallelizing.

In the worst case, there are only 'numa node number' of threads in
contention. And in my testing, it doesn't degrade performance, but
rather improves performance due to the reduced granularity.

> I want to know why you chose to drop prep_and_add_allocated_folios()
> call in the original hugetlb_pages_alloc_boot()?

Splitting him to parallelize hugetlb_vmemmap_optimize_folios.

>> +static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
>> +{
>> +    struct padata_mt_job job = {
>> +        .fn_arg        = h,
>> +        .align        = 1,
>> +        .numa_aware    = true
>> +    };
>> +
>> +    job.thread_fn    = hugetlb_alloc_node;
>> +    job.start    = 0;
>> +    job.size    = h->max_huge_pages;
>> +    job.min_chunk    = h->max_huge_pages / num_node_state(N_MEMORY) / 2;
>> +    job.max_threads    = num_node_state(N_MEMORY) * 2;
> 
> I am curious the magic number of 2 used in assignments of ->min_chunk
> and ->max_threads, does it from your experiment? I thinke it should
> be a comment here.
> 

This is tested and I can perform more detailed tests and provide data.

> And I am also sceptical about the optimization for a small amount of
> allocation of hugepages. Given 4 hugepags needed to be allocated on UMA
> system, job.min_chunk will be 2, job.max_threads will be 2. Then, 2
> workers will be scheduled, however each worker will just allocate 2 pages,
> how much the cost of scheduling? What if allocate 4 pages in single
> worker? Do you have any numbers on parallelism vs non-parallelism in
> a small allocation case? If we cannot gain from this case, I think we shold
> assign a reasonable value to ->min_chunk based on experiment.
> 
> Thanks.
>

That's a good suggestion, I'll run some tests and choose the best
values.