linux-kernel - Re: [PATCH] mm/vmalloc: request large order pages from buddy allocator

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3f4285cf-adb8-4fc5-ad18-c3a0d6de4db0@arm.com>
Date: Thu, 11 Dec 2025 21:08:07 +0530
From: Dev Jain <dev.jain@....com>
To: Ryan Roberts <ryan.roberts@....com>,
 "Vishal Moola (Oracle)" <vishal.moola@...il.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
 Uladzislau Rezki <urezki@...il.com>,
 Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH] mm/vmalloc: request large order pages from buddy
 allocator


On 11/12/25 9:05 pm, Dev Jain wrote:
>
> On 11/12/25 8:58 pm, Ryan Roberts wrote:
>> On 10/12/2025 22:28, Vishal Moola (Oracle) wrote:
>>> On Wed, Dec 10, 2025 at 01:21:22PM +0000, Ryan Roberts wrote:
>>>> Hi Vishal,
>>>>
>>>>
>>>> On 21/10/2025 20:44, Vishal Moola (Oracle) wrote:
>>>>> Sometimes, vm_area_alloc_pages() will want many pages from the buddy
>>>>> allocator. Rather than making requests to the buddy allocator for at
>>>>> most 100 pages at a time, we can eagerly request large order pages a
>>>>> smaller number of times.
>>>>>
>>>>> We still split the large order pages down to order-0 as the rest 
>>>>> of the
>>>>> vmalloc code (and some callers) depend on it. We still defer to 
>>>>> the bulk
>>>>> allocator and fallback path in case of order-0 pages or failure.
>>>>>
>>>>> Running 1000 iterations of allocations on a small 4GB system finds:
>>>>>
>>>>> 1000 2mb allocations:
>>>>>     [Baseline]            [This patch]
>>>>>     real    46.310s            real    0m34.582
>>>>>     user    0.001s            user    0.006s
>>>>>     sys     46.058s            sys     0m34.365s
>>>>>
>>>>> 10000 200kb allocations:
>>>>>     [Baseline]            [This patch]
>>>>>     real    56.104s            real    0m43.696
>>>>>     user    0.001s            user    0.003s
>>>>>     sys     55.375s            sys     0m42.995s
>>>> I'm seeing some big vmalloc micro benchmark regressions on arm64, 
>>>> for which
>>>> bisect is pointing to this patch.
>>> Ulad had similar findings/concerns[1]. Tldr: The numbers you are seeing
>>> are expected for how the test module is currently written.
>> Hmm... simplistically, I'd say that either the tests are bad, in 
>> which case they
>> should be deleted, or they are good, in which case we shouldn't 
>> ignore the
>> regressions. Having tests that we learn to ignore is the worst of 
>> both worlds.
>
> AFAICR the test does some million-odd iterations by default, which is 
> the real problem.
> On my RFC [1] I notice that reducing the iterations reduces the 
> regression - till
> some multiple of ten thousand iterations, the regression is zero. 
> Doing this
> alloc->free a million freaking times messes up the buddy badly.
>
> [1] https://lore.kernel.org/all/20251112110807.69958-1-dev.jain@arm.com/


So this line:

__param(int, test_loop_count, 1000000,
         "Set test loop counter");

We should just change it to 20k or something and that should resolve it.


>
>>
>> But I see your point about the allocation pattern not being very 
>> realistic.
>>
>>>> The tests are all originally from the vmalloc_test module. Note 
>>>> that (R)
>>>> indicates a statistically significant regression and (I) indicates a
>>>> statistically improvement.
>>>>
>>>> p is number of pages in the allocation, h is huge. So it looks like 
>>>> the
>>>> regressions are all coming for the non-huge case, where we want to 
>>>> split to
>>>> order-0.
>>>>
>>>> +---------------------------------+----------------------------------------------------------+------------+------------------------+ 
>>>>
>>>> | Benchmark                       | Result 
>>>> Class                                             | 6-18-0 |   
>>>> 6-18-0-gc2f2b01b74be |
>>>> +=================================+==========================================================+============+========================+ 
>>>>
>>>> | micromm/vmalloc                 | fix_align_alloc_test: p:1, h:0, 
>>>> l:500000 (usec)          |  514126.58 | (R) -42.20% |
>>>> |                                 | fix_size_alloc_test: p:1, h:0, 
>>>> l:500000 (usec)           |  320458.33 |                 -0.02% |
>>>> |                                 | fix_size_alloc_test: p:4, h:0, 
>>>> l:500000 (usec)           |  399680.33 |            (R) -23.43% |
>>>> |                                 | fix_size_alloc_test: p:16, h:0, 
>>>> l:500000 (usec)          |  788723.25 |            (R) -23.66% |
>>>> |                                 | fix_size_alloc_test: p:16, h:1, 
>>>> l:500000 (usec)          |  979839.58 |                 -1.05% |
>>>> |                                 | fix_size_alloc_test: p:64, h:0, 
>>>> l:100000 (usec)          |  481454.58 |            (R) -23.99% |
>>>> |                                 | fix_size_alloc_test: p:64, h:1, 
>>>> l:100000 (usec)          |  615924.00 |              (I) 2.56% |
>>>> |                                 | fix_size_alloc_test: p:256, 
>>>> h:0, l:100000 (usec)         | 1799224.08 |            (R) -23.28% |
>>>> |                                 | fix_size_alloc_test: p:256, 
>>>> h:1, l:100000 (usec)         | 2313859.25 |              (I) 3.43% |
>>>> |                                 | fix_size_alloc_test: p:512, 
>>>> h:0, l:100000 (usec)         | 3541904.75 |            (R) -23.86% |
>>>> |                                 | fix_size_alloc_test: p:512, 
>>>> h:1, l:100000 (usec)         | 3597577.25 |             (R) -2.97% |
>>>> |                                 | full_fit_alloc_test: p:1, h:0, 
>>>> l:500000 (usec)           |  487021.83 |              (I) 4.95% |
>>>> |                                 | kvfree_rcu_1_arg_vmalloc_test: 
>>>> p:1, h:0, l:500000 (usec) | 344466.33 |                 -0.65% |
>>>> |                                 | kvfree_rcu_2_arg_vmalloc_test: 
>>>> p:1, h:0, l:500000 (usec) | 342484.25 |                 -1.58% |
>>>> |                                 | long_busy_list_alloc_test: p:1, 
>>>> h:0, l:500000 (usec)     | 4034901.17 |            (R) -25.35% |
>>>> |                                 | pcpu_alloc_test: p:1, h:0, 
>>>> l:500000 (usec)               |  195973.42 |                  0.57% |
>>>> |                                 | random_size_align_alloc_test: 
>>>> p:1, h:0, l:500000 (usec)  | 643489.33 |            (R) -47.63% |
>>>> |                                 | random_size_alloc_test: p:1, 
>>>> h:0, l:500000 (usec)        | 2029261.33 | (R) -27.88% |
>>>> |                                 | vm_map_ram_test: p:1, h:0, 
>>>> l:500000 (usec)               |   83557.08 |                 -0.22% |
>>>> +---------------------------------+----------------------------------------------------------+------------+------------------------+ 
>>>>
>>>>
>>>> I have a couple of thoughts from looking at the patch:
>>>>
>>>>   - Perhaps split_page() is the bulk of the cost? Previously for 
>>>> this case we
>>>>     were allocating order-0 so there was no split to do. For h=1, 
>>>> split would
>>>>     have already been called so that would explain why no 
>>>> regression for that
>>>>     case?
>>> For h=1, this patch shouldn't change (as long as nr_pages <
>>> arch_vmap_{pte,pmd}_supported_shift). This is why you don't see 
>>> regressions
>>> in those cases.
>> arm64 supports 64K contigous-mappings with vmalloc so once nr_pages 
>> >= 16 we can
>> take the huge path.
>>
>>>>   - I guess we are bypassing the pcpu cache? Could this be having 
>>>> an effect? Dev
>>>>     (cc'ed) did some similar investigation a while back and saw 
>>>> increased vmalloc
>>>>     latencies when bypassing pcpu cache.
>>> I'd say this is more a case of this test module targeting the pcpu
>>> cache. The module allocates then frees one at a time, which promotes
>>> reusing pcpu pages. [1] Has some numbers after modifying the test such
>>> that all the allocations are made before freeing any.
>> OK fair enough.
>>
>> We are seeing a bunch of other regressions in higher level benchmarks 
>> too; but
>> haven't yet concluded what's causing those. I'll report back if this 
>> patch looks
>> connected.
>>
>> Thanks,
>> Ryan
>>
>>
>>>>   - Philosophically is allocating physically contiguous memory when 
>>>> it is not
>>>>     strictly needed the right thing to do? Large physically 
>>>> contiguous blocks are
>>>>     a scarce resource so we don't want to waste them. Although I 
>>>> guess it could
>>>>     be argued that this actually preserves the contiguous blocks 
>>>> because the
>>>>     lifetime of all the pages is tied together. Anyway, I doubt 
>>>> this is the
>>> This was the primary incentive for this patch :)
>>>
>>>>     reason for the slow down, since those benchmarks are not under 
>>>> memory
>>>>     pressure.
>>>>
>>>> Anyway, it would be good to resolve the performance regressions if 
>>>> we can.
>>> Imo, the appropriate way to address these is to modify the test module
>>> as seen in [1].
>>>
>>> [1] https://lore.kernel.org/linux-mm/aPJ6lLf24TfW_1n7@milan/
>>
>