[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3f4285cf-adb8-4fc5-ad18-c3a0d6de4db0@arm.com>
Date: Thu, 11 Dec 2025 21:08:07 +0530
From: Dev Jain <dev.jain@....com>
To: Ryan Roberts <ryan.roberts@....com>,
"Vishal Moola (Oracle)" <vishal.moola@...il.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
Uladzislau Rezki <urezki@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH] mm/vmalloc: request large order pages from buddy
allocator
On 11/12/25 9:05 pm, Dev Jain wrote:
>
> On 11/12/25 8:58 pm, Ryan Roberts wrote:
>> On 10/12/2025 22:28, Vishal Moola (Oracle) wrote:
>>> On Wed, Dec 10, 2025 at 01:21:22PM +0000, Ryan Roberts wrote:
>>>> Hi Vishal,
>>>>
>>>>
>>>> On 21/10/2025 20:44, Vishal Moola (Oracle) wrote:
>>>>> Sometimes, vm_area_alloc_pages() will want many pages from the buddy
>>>>> allocator. Rather than making requests to the buddy allocator for at
>>>>> most 100 pages at a time, we can eagerly request large order pages a
>>>>> smaller number of times.
>>>>>
>>>>> We still split the large order pages down to order-0 as the rest
>>>>> of the
>>>>> vmalloc code (and some callers) depend on it. We still defer to
>>>>> the bulk
>>>>> allocator and fallback path in case of order-0 pages or failure.
>>>>>
>>>>> Running 1000 iterations of allocations on a small 4GB system finds:
>>>>>
>>>>> 1000 2mb allocations:
>>>>> [Baseline] [This patch]
>>>>> real 46.310s real 0m34.582
>>>>> user 0.001s user 0.006s
>>>>> sys 46.058s sys 0m34.365s
>>>>>
>>>>> 10000 200kb allocations:
>>>>> [Baseline] [This patch]
>>>>> real 56.104s real 0m43.696
>>>>> user 0.001s user 0.003s
>>>>> sys 55.375s sys 0m42.995s
>>>> I'm seeing some big vmalloc micro benchmark regressions on arm64,
>>>> for which
>>>> bisect is pointing to this patch.
>>> Ulad had similar findings/concerns[1]. Tldr: The numbers you are seeing
>>> are expected for how the test module is currently written.
>> Hmm... simplistically, I'd say that either the tests are bad, in
>> which case they
>> should be deleted, or they are good, in which case we shouldn't
>> ignore the
>> regressions. Having tests that we learn to ignore is the worst of
>> both worlds.
>
> AFAICR the test does some million-odd iterations by default, which is
> the real problem.
> On my RFC [1] I notice that reducing the iterations reduces the
> regression - till
> some multiple of ten thousand iterations, the regression is zero.
> Doing this
> alloc->free a million freaking times messes up the buddy badly.
>
> [1] https://lore.kernel.org/all/20251112110807.69958-1-dev.jain@arm.com/
So this line:
__param(int, test_loop_count, 1000000,
"Set test loop counter");
We should just change it to 20k or something and that should resolve it.
>
>>
>> But I see your point about the allocation pattern not being very
>> realistic.
>>
>>>> The tests are all originally from the vmalloc_test module. Note
>>>> that (R)
>>>> indicates a statistically significant regression and (I) indicates a
>>>> statistically improvement.
>>>>
>>>> p is number of pages in the allocation, h is huge. So it looks like
>>>> the
>>>> regressions are all coming for the non-huge case, where we want to
>>>> split to
>>>> order-0.
>>>>
>>>> +---------------------------------+----------------------------------------------------------+------------+------------------------+
>>>>
>>>> | Benchmark | Result
>>>> Class | 6-18-0 |
>>>> 6-18-0-gc2f2b01b74be |
>>>> +=================================+==========================================================+============+========================+
>>>>
>>>> | micromm/vmalloc | fix_align_alloc_test: p:1, h:0,
>>>> l:500000 (usec) | 514126.58 | (R) -42.20% |
>>>> | | fix_size_alloc_test: p:1, h:0,
>>>> l:500000 (usec) | 320458.33 | -0.02% |
>>>> | | fix_size_alloc_test: p:4, h:0,
>>>> l:500000 (usec) | 399680.33 | (R) -23.43% |
>>>> | | fix_size_alloc_test: p:16, h:0,
>>>> l:500000 (usec) | 788723.25 | (R) -23.66% |
>>>> | | fix_size_alloc_test: p:16, h:1,
>>>> l:500000 (usec) | 979839.58 | -1.05% |
>>>> | | fix_size_alloc_test: p:64, h:0,
>>>> l:100000 (usec) | 481454.58 | (R) -23.99% |
>>>> | | fix_size_alloc_test: p:64, h:1,
>>>> l:100000 (usec) | 615924.00 | (I) 2.56% |
>>>> | | fix_size_alloc_test: p:256,
>>>> h:0, l:100000 (usec) | 1799224.08 | (R) -23.28% |
>>>> | | fix_size_alloc_test: p:256,
>>>> h:1, l:100000 (usec) | 2313859.25 | (I) 3.43% |
>>>> | | fix_size_alloc_test: p:512,
>>>> h:0, l:100000 (usec) | 3541904.75 | (R) -23.86% |
>>>> | | fix_size_alloc_test: p:512,
>>>> h:1, l:100000 (usec) | 3597577.25 | (R) -2.97% |
>>>> | | full_fit_alloc_test: p:1, h:0,
>>>> l:500000 (usec) | 487021.83 | (I) 4.95% |
>>>> | | kvfree_rcu_1_arg_vmalloc_test:
>>>> p:1, h:0, l:500000 (usec) | 344466.33 | -0.65% |
>>>> | | kvfree_rcu_2_arg_vmalloc_test:
>>>> p:1, h:0, l:500000 (usec) | 342484.25 | -1.58% |
>>>> | | long_busy_list_alloc_test: p:1,
>>>> h:0, l:500000 (usec) | 4034901.17 | (R) -25.35% |
>>>> | | pcpu_alloc_test: p:1, h:0,
>>>> l:500000 (usec) | 195973.42 | 0.57% |
>>>> | | random_size_align_alloc_test:
>>>> p:1, h:0, l:500000 (usec) | 643489.33 | (R) -47.63% |
>>>> | | random_size_alloc_test: p:1,
>>>> h:0, l:500000 (usec) | 2029261.33 | (R) -27.88% |
>>>> | | vm_map_ram_test: p:1, h:0,
>>>> l:500000 (usec) | 83557.08 | -0.22% |
>>>> +---------------------------------+----------------------------------------------------------+------------+------------------------+
>>>>
>>>>
>>>> I have a couple of thoughts from looking at the patch:
>>>>
>>>> - Perhaps split_page() is the bulk of the cost? Previously for
>>>> this case we
>>>> were allocating order-0 so there was no split to do. For h=1,
>>>> split would
>>>> have already been called so that would explain why no
>>>> regression for that
>>>> case?
>>> For h=1, this patch shouldn't change (as long as nr_pages <
>>> arch_vmap_{pte,pmd}_supported_shift). This is why you don't see
>>> regressions
>>> in those cases.
>> arm64 supports 64K contigous-mappings with vmalloc so once nr_pages
>> >= 16 we can
>> take the huge path.
>>
>>>> - I guess we are bypassing the pcpu cache? Could this be having
>>>> an effect? Dev
>>>> (cc'ed) did some similar investigation a while back and saw
>>>> increased vmalloc
>>>> latencies when bypassing pcpu cache.
>>> I'd say this is more a case of this test module targeting the pcpu
>>> cache. The module allocates then frees one at a time, which promotes
>>> reusing pcpu pages. [1] Has some numbers after modifying the test such
>>> that all the allocations are made before freeing any.
>> OK fair enough.
>>
>> We are seeing a bunch of other regressions in higher level benchmarks
>> too; but
>> haven't yet concluded what's causing those. I'll report back if this
>> patch looks
>> connected.
>>
>> Thanks,
>> Ryan
>>
>>
>>>> - Philosophically is allocating physically contiguous memory when
>>>> it is not
>>>> strictly needed the right thing to do? Large physically
>>>> contiguous blocks are
>>>> a scarce resource so we don't want to waste them. Although I
>>>> guess it could
>>>> be argued that this actually preserves the contiguous blocks
>>>> because the
>>>> lifetime of all the pages is tied together. Anyway, I doubt
>>>> this is the
>>> This was the primary incentive for this patch :)
>>>
>>>> reason for the slow down, since those benchmarks are not under
>>>> memory
>>>> pressure.
>>>>
>>>> Anyway, it would be good to resolve the performance regressions if
>>>> we can.
>>> Imo, the appropriate way to address these is to modify the test module
>>> as seen in [1].
>>>
>>> [1] https://lore.kernel.org/linux-mm/aPJ6lLf24TfW_1n7@milan/
>>
>
Powered by blists - more mailing lists