[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <787639a1e6a27c0f3b0e3ae658e1b8e7@kenip.in>
Date: Tue, 01 Jul 2025 17:45:51 +0530
From: siddhartha@...ip.in
To: Dev Jain <dev.jain@....com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, mgorman@...e.de
Subject: Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
On 2025-07-01 12:28, Dev Jain wrote:
> On 01/07/25 12:20 pm, Lorenzo Stoakes wrote:
>> On Tue, Jul 01, 2025 at 12:00:21PM +0530, Dev Jain wrote:
>>> On 01/07/25 11:23 am, Lorenzo Stoakes wrote:
>>>> On Tue, Jul 01, 2025 at 11:15:25AM +0530, Dev Jain wrote:
>>>>> Sorry I am not following, don't know in detail about the VMA merge
>>>>> stuff.
>>>>> Are you saying the after the patch, the VMAs will eventually get
>>>>> merged?
>>>>> Is it possible in the kernel to get a merge in the "future"; as I
>>>>> understand
>>>>> it only happens at mmap() time?
>>>>>
>>>>> Suppose before the patch, you have two consecutive VMAs between
>>>>> (PMD, 2*PMD) size.
>>>>> If they are able to get merged after the patch, why won't they be
>>>>> merged before the patch,
>>>>> since the VMA characteristics are the same?
>>>>>
>>>>>
>>>> Rik's patch aligned each to 2 MiB boundary. So you'd get gaps:
>>>>
>>>>
>>>> 0 2MB 4MB 6MB
>>>> 8MB 10MB
>>>> |-------------.------| |-------------.------|
>>>> |-------------.------|
>>>> | . | | . |
>>>> | . |
>>>> | . | | . |
>>>> | . |
>>>> |-------------.------| |-------------.------|
>>>> |-------------.------|
>>>> huge mapped 4k m'd
>>> The effort to draw this is appreciated!
>>>
>>> I understood the alignment, what I am asking is this:
>>>
>>> In __get_unmapped_area(), we will return a THP-aligned addr from
>>> thp_get_unmapped_area_vmflags(). Now for the diagram you have
>>> drawn, suppose that before the patch, we first mmap() the
>>> 8MB-start chunk. Then we mmap the 4MB start chunk.
>>> We go to __mmap_region(), and we see that the 8MB-start chunk
>>> has mergeable characteristics, so we merge. So the gap goes away?
>> No because there's a gap, we only merge immedaitely adjacent VMAs. And
>> obviously
>> gaps mean page tables wouldn't be adjacent either...
>
> Ah shoot. That is prev->vm_end == vmg->start in can_vma_merge_left().
> Thanks.
>
>>
>> The get_unmmaped_area() would have otherwise given adjacent mappings.
>> Vlasta's
>> patch means in this case we no longer bother trying to align these
>> because their
>> _length_ isn't PMD aligned.
Hi Lorenzo, Dev, all
Thank you for raising excellent points — I’ll respond to each in order
to clarify the mechanics and relevance of this behavior in the context
of AI inference workloads.
🧩 1. Does the patch cause VMAs to be merged eventually?
You're correct: VMA merging only happens at mmap() time (via
__mmap_region()). What the patch affects is the behavior of
thp_get_unmapped_area_vmflags() before the mmap is placed.
Before the patch (with Rik’s logic):
Every mmap() returned an address rounded up to the next 2MB boundary —
regardless of whether the requested size was 2MB-aligned.
Result: even consecutive mmap()s (e.g., 1.5MB + 1.5MB) are now
non-adjacent, so merging is impossible, even if their VMA flags match.
After this patch:
If the allocation is not PMD-aligned in size, the returned address is
not forcibly aligned, increasing the likelihood that the next mmap()
lands directly after the previous one → enabling merging.
So, to be clear: this patch doesn’t cause merging, but it prevents
unnecessary pre-mmap gaps, which previously blocked merges from ever
happening exactly like a deadlock which has been cleared now.
📐 2. Why aren’t the VMAs mergeable before the patch?
Great question. Even if the VMA flags are identical, gaps introduced by
forced alignment from get_unmapped_area() break the precondition for
merging:
can_vma_merge_left()
→ return prev->vm_end == vma->vm_start
With Rik’s patch in place:
Suppose you mmap() 1.5MB → gets aligned to 2MB
Next 1.5MB → gets aligned to 4MB
→ The kernel sees: prev->vm_end = 3.5MB, vma->vm_start = 4MB
→ No merge
With this patch, non-aligned lengths don’t get forcibly aligned, so
consecutive mmap()s often fall exactly after the previous, and merging
becomes possible again.
🤖 3. How does this impact AI workloads like Hugging Face Transformers?
Tokenization and dynamic batching create non-deterministic memory
allocation patterns:
Models like BERT and T5 dynamically allocate intermediate buffers per
token-length, batch size, and attention window.
Hugging Face + ONNX Runtime uses multiple small-ish anonymous mmap()s,
often 512KB–1.8MB.
These allocations come in bursts — but due to forced alignment, the
kernel was placing them with artificial gaps, defeating THP eligibility
entirely.
By not force-aligning non-PMD-sized mappings, we avoid injecting gaps.
The result is that:
a. VMAs remain adjacent → mergeable
b. Physical memory is contiguous → eligible for khugepaged collapse
c. THP utilization increases → fewer TLB misses → lower latency → higher
throughput
💡 4. Why this patch complements Rik’s rather than contradicts it:
Rik's patch made it easier to guarantee alignment for workloads that
benefit from explicit huge pages — but at the cost of breaking
coalescence in workloads with non-PMD-sized mappings, like ML inference.
This patch simply refines that logic:
If the length is PMD-aligned → keep alignment
If it’s not → don’t inject alignment gaps that block merging
So, for workloads that can’t benefit from THP due to misalignment, this
patch removes artificial fragmentation without harming the original
intent.
⚙️ 5. mTHP note
Although this patch doesn’t target mTHP directly, I believe a similar
logic tweak could apply there too — especially with shmem-backed
workloads (common in model servers using shared tensor memory). I’d be
happy to help test any changes proposed there to derive the consequent
results.
Thanks again for the detailed discussion. Let me know if you’d like a
trace or VMA map from a Hugging Face benchmarked run (happy to generate
one locally).
Best Regards,
Siddhartha Sharma
+91 9015185601
Powered by blists - more mailing lists