[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d108a8ce-8919-459d-aeca-dfa75cab54e7@arm.com>
Date: Thu, 11 Dec 2025 15:28:56 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: "Vishal Moola (Oracle)" <vishal.moola@...il.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
Uladzislau Rezki <urezki@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH] mm/vmalloc: request large order pages from buddy
allocator
On 10/12/2025 22:28, Vishal Moola (Oracle) wrote:
> On Wed, Dec 10, 2025 at 01:21:22PM +0000, Ryan Roberts wrote:
>> Hi Vishal,
>>
>>
>> On 21/10/2025 20:44, Vishal Moola (Oracle) wrote:
>>> Sometimes, vm_area_alloc_pages() will want many pages from the buddy
>>> allocator. Rather than making requests to the buddy allocator for at
>>> most 100 pages at a time, we can eagerly request large order pages a
>>> smaller number of times.
>>>
>>> We still split the large order pages down to order-0 as the rest of the
>>> vmalloc code (and some callers) depend on it. We still defer to the bulk
>>> allocator and fallback path in case of order-0 pages or failure.
>>>
>>> Running 1000 iterations of allocations on a small 4GB system finds:
>>>
>>> 1000 2mb allocations:
>>> [Baseline] [This patch]
>>> real 46.310s real 0m34.582
>>> user 0.001s user 0.006s
>>> sys 46.058s sys 0m34.365s
>>>
>>> 10000 200kb allocations:
>>> [Baseline] [This patch]
>>> real 56.104s real 0m43.696
>>> user 0.001s user 0.003s
>>> sys 55.375s sys 0m42.995s
>>
>> I'm seeing some big vmalloc micro benchmark regressions on arm64, for which
>> bisect is pointing to this patch.
>
> Ulad had similar findings/concerns[1]. Tldr: The numbers you are seeing
> are expected for how the test module is currently written.
Hmm... simplistically, I'd say that either the tests are bad, in which case they
should be deleted, or they are good, in which case we shouldn't ignore the
regressions. Having tests that we learn to ignore is the worst of both worlds.
But I see your point about the allocation pattern not being very realistic.
>
>> The tests are all originally from the vmalloc_test module. Note that (R)
>> indicates a statistically significant regression and (I) indicates a
>> statistically improvement.
>>
>> p is number of pages in the allocation, h is huge. So it looks like the
>> regressions are all coming for the non-huge case, where we want to split to
>> order-0.
>>
>> +---------------------------------+----------------------------------------------------------+------------+------------------------+
>> | Benchmark | Result Class | 6-18-0 | 6-18-0-gc2f2b01b74be |
>> +=================================+==========================================================+============+========================+
>> | micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec) | 514126.58 | (R) -42.20% |
>> | | fix_size_alloc_test: p:1, h:0, l:500000 (usec) | 320458.33 | -0.02% |
>> | | fix_size_alloc_test: p:4, h:0, l:500000 (usec) | 399680.33 | (R) -23.43% |
>> | | fix_size_alloc_test: p:16, h:0, l:500000 (usec) | 788723.25 | (R) -23.66% |
>> | | fix_size_alloc_test: p:16, h:1, l:500000 (usec) | 979839.58 | -1.05% |
>> | | fix_size_alloc_test: p:64, h:0, l:100000 (usec) | 481454.58 | (R) -23.99% |
>> | | fix_size_alloc_test: p:64, h:1, l:100000 (usec) | 615924.00 | (I) 2.56% |
>> | | fix_size_alloc_test: p:256, h:0, l:100000 (usec) | 1799224.08 | (R) -23.28% |
>> | | fix_size_alloc_test: p:256, h:1, l:100000 (usec) | 2313859.25 | (I) 3.43% |
>> | | fix_size_alloc_test: p:512, h:0, l:100000 (usec) | 3541904.75 | (R) -23.86% |
>> | | fix_size_alloc_test: p:512, h:1, l:100000 (usec) | 3597577.25 | (R) -2.97% |
>> | | full_fit_alloc_test: p:1, h:0, l:500000 (usec) | 487021.83 | (I) 4.95% |
>> | | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | 344466.33 | -0.65% |
>> | | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | 342484.25 | -1.58% |
>> | | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec) | 4034901.17 | (R) -25.35% |
>> | | pcpu_alloc_test: p:1, h:0, l:500000 (usec) | 195973.42 | 0.57% |
>> | | random_size_align_alloc_test: p:1, h:0, l:500000 (usec) | 643489.33 | (R) -47.63% |
>> | | random_size_alloc_test: p:1, h:0, l:500000 (usec) | 2029261.33 | (R) -27.88% |
>> | | vm_map_ram_test: p:1, h:0, l:500000 (usec) | 83557.08 | -0.22% |
>> +---------------------------------+----------------------------------------------------------+------------+------------------------+
>>
>> I have a couple of thoughts from looking at the patch:
>>
>> - Perhaps split_page() is the bulk of the cost? Previously for this case we
>> were allocating order-0 so there was no split to do. For h=1, split would
>> have already been called so that would explain why no regression for that
>> case?
>
> For h=1, this patch shouldn't change (as long as nr_pages <
> arch_vmap_{pte,pmd}_supported_shift). This is why you don't see regressions
> in those cases.
arm64 supports 64K contigous-mappings with vmalloc so once nr_pages >= 16 we can
take the huge path.
>
>> - I guess we are bypassing the pcpu cache? Could this be having an effect? Dev
>> (cc'ed) did some similar investigation a while back and saw increased vmalloc
>> latencies when bypassing pcpu cache.
>
> I'd say this is more a case of this test module targeting the pcpu
> cache. The module allocates then frees one at a time, which promotes
> reusing pcpu pages. [1] Has some numbers after modifying the test such
> that all the allocations are made before freeing any.
OK fair enough.
We are seeing a bunch of other regressions in higher level benchmarks too; but
haven't yet concluded what's causing those. I'll report back if this patch looks
connected.
Thanks,
Ryan
>
>> - Philosophically is allocating physically contiguous memory when it is not
>> strictly needed the right thing to do? Large physically contiguous blocks are
>> a scarce resource so we don't want to waste them. Although I guess it could
>> be argued that this actually preserves the contiguous blocks because the
>> lifetime of all the pages is tied together. Anyway, I doubt this is the
>
> This was the primary incentive for this patch :)
>
>> reason for the slow down, since those benchmarks are not under memory
>> pressure.
>>
>> Anyway, it would be good to resolve the performance regressions if we can.
>
> Imo, the appropriate way to address these is to modify the test module
> as seen in [1].
>
> [1] https://lore.kernel.org/linux-mm/aPJ6lLf24TfW_1n7@milan/
Powered by blists - more mailing lists