[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dd370f92e9100e785aeafdc4d31f8cb5@kenip.in>
Date: Tue, 01 Jul 2025 18:53:47 +0530
From: siddhartha@...ip.in
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Dev Jain <dev.jain@....com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, mgorman@...e.de
Subject: Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
On 2025-07-01 18:09, Lorenzo Stoakes wrote:
> On Tue, Jul 01, 2025 at 05:45:51PM +0530, siddhartha@...ip.in wrote:
>> 🧩 1. Does the patch cause VMAs to be merged eventually?
>> You're correct: VMA merging only happens at mmap() time (via
>> __mmap_region()). What the patch affects is the behavior of
>> thp_get_unmapped_area_vmflags() before the mmap is placed.
>
> [...]
>
>>
>> 📐 2. Why aren’t the VMAs mergeable before the patch?
>> Great question. Even if the VMA flags are identical, gaps introduced
>> by
>> forced alignment from get_unmapped_area() break the precondition for
>> merging:
>
> [...]
>
>> 💡 4. Why this patch complements Rik’s rather than contradicts it:
>
> I'm really perplexed as to why you felt the need to (seemingly via LLM)
> reply with the explanation I've already provided here?...
>
> There's errors in things you say here too.
>
> With respect, please don't do this.
>
> (I'm the co-maintainer of pretty much all the relevant code here and
> wrote
> the VMA merge logic you're referring to.)
>
>>
>> 🤖 3. How does this impact AI workloads like Hugging Face Transformers?
>> Tokenization and dynamic batching create non-deterministic memory
>> allocation
>> patterns:
>>
>> Models like BERT and T5 dynamically allocate intermediate buffers per
>> token-length, batch size, and attention window.
>>
>> Hugging Face + ONNX Runtime uses multiple small-ish anonymous mmap()s,
>> often
>> 512KB–1.8MB.
>>
>> These allocations come in bursts — but due to forced alignment, the
>> kernel
>> was placing them with artificial gaps, defeating THP eligibility
>> entirely.
>>
>> By not force-aligning non-PMD-sized mappings, we avoid injecting gaps.
>> The
>> result is that:
>>
>> a. VMAs remain adjacent → mergeable
>>
>> b. Physical memory is contiguous → eligible for khugepaged collapse
>>
>> c. THP utilization increases → fewer TLB misses → lower latency →
>> higher
>> throughput
>>
>
> This is very useful information and it's appreciated! Let's not drown
> this
> out with restatements of stuff already covered.
>
>>
>> ⚙️ 5. mTHP note
>> Although this patch doesn’t target mTHP directly, I believe a similar
>> logic
>> tweak could apply there too — especially with shmem-backed workloads
>> (common
>> in model servers using shared tensor memory). I’d be happy to help
>> test any
>> changes proposed there to derive the consequent results.
>
> Dev - could we hold off on any effort to do something like this until
> I've
> had a chance to refactor THP somewhat? This is already a mess and I'd
> like
> to avoid us piling on more complexity.
>
> We can revisit this at a later stage.
>
>>
>> Thanks again for the detailed discussion. Let me know if you’d like a
>> trace
>> or VMA map from a Hugging Face benchmarked run (happy to generate one
>> locally).
>>
>
> Thanks! Much appreciated.
>
> Cheers, Lorenzo
Hi Lorenzo,
Thanks for your clarification, and I appreciate your patience —
especially given your role in maintaining and designing the VMA merge
logic.
I understand now that my earlier phrasing may have repeated your
explanation for VMA adjacency, and I regret unintentionally restating
it.
I’ll make sure to be more careful and direct going forward.
As for the THP alignment condition now being `IS_ALIGNED(len,
PMD_SIZE)`, I agree this resolves the regressions by removing alignment
for non-aligned sizes, which was exactly what broke workloads like
cactusBSSN or some AI inference loads.
Thanks again for the guidance — I’m learning a lot from this thread.
Best Regards,
Siddhartha Sharma
Powered by blists - more mailing lists