linux-kernel - Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <787639a1e6a27c0f3b0e3ae658e1b8e7@kenip.in>
Date: Tue, 01 Jul 2025 17:45:51 +0530
From: siddhartha@...ip.in
To: Dev Jain <dev.jain@....com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, mgorman@...e.de
Subject: Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads

On 2025-07-01 12:28, Dev Jain wrote:
> On 01/07/25 12:20 pm, Lorenzo Stoakes wrote:
>> On Tue, Jul 01, 2025 at 12:00:21PM +0530, Dev Jain wrote:
>>> On 01/07/25 11:23 am, Lorenzo Stoakes wrote:
>>>> On Tue, Jul 01, 2025 at 11:15:25AM +0530, Dev Jain wrote:
>>>>> Sorry I am not following, don't know in detail about the VMA merge 
>>>>> stuff.
>>>>> Are you saying the after the patch, the VMAs will eventually get 
>>>>> merged?
>>>>> Is it possible in the kernel to get a merge in the "future"; as I 
>>>>> understand
>>>>> it only happens at mmap() time?
>>>>> 
>>>>> Suppose before the patch, you have two consecutive VMAs between 
>>>>> (PMD, 2*PMD) size.
>>>>> If they are able to get merged after the patch, why won't they be 
>>>>> merged before the patch,
>>>>> since the VMA characteristics are the same?
>>>>> 
>>>>> 
>>>> Rik's patch aligned each to 2 MiB boundary. So you'd get gaps:
>>>> 
>>>> 
>>>>     0            2MB                 4MB           6MB               
>>>>        8MB          10MB
>>>>     |-------------.------|            |-------------.------|         
>>>>         |-------------.------|
>>>>     |             .      |		 |             .      |                 
>>>> |             .      |
>>>>     |             .      |		 |             .      |                 
>>>> |             .      |
>>>>     |-------------.------|		 |-------------.------|                 
>>>> |-------------.------|
>>>>       huge mapped  4k m'd
>>> The effort to draw this is appreciated!
>>> 
>>> I understood the alignment, what I am asking is this:
>>> 
>>> In __get_unmapped_area(), we will return a THP-aligned addr from
>>> thp_get_unmapped_area_vmflags(). Now for the diagram you have
>>> drawn, suppose that before the patch, we first mmap() the
>>> 8MB-start chunk. Then we mmap the 4MB start chunk.
>>> We go to __mmap_region(), and we see that the 8MB-start chunk
>>> has mergeable characteristics, so we merge. So the gap goes away?
>> No because there's a gap, we only merge immedaitely adjacent VMAs. And 
>> obviously
>> gaps mean page tables wouldn't be adjacent either...
> 
> Ah shoot. That is prev->vm_end == vmg->start in can_vma_merge_left(). 
> Thanks.
> 
>> 
>> The get_unmmaped_area() would have otherwise given adjacent mappings. 
>> Vlasta's
>> patch means in this case we no longer bother trying to align these 
>> because their
>> _length_ isn't PMD aligned.

Hi Lorenzo, Dev, all

Thank you for raising excellent points — I’ll respond to each in order 
to clarify the mechanics and relevance of this behavior in the context 
of AI inference workloads.

🧩 1. Does the patch cause VMAs to be merged eventually?
You're correct: VMA merging only happens at mmap() time (via 
__mmap_region()). What the patch affects is the behavior of 
thp_get_unmapped_area_vmflags() before the mmap is placed.

Before the patch (with Rik’s logic):

Every mmap() returned an address rounded up to the next 2MB boundary — 
regardless of whether the requested size was 2MB-aligned.

Result: even consecutive mmap()s (e.g., 1.5MB + 1.5MB) are now 
non-adjacent, so merging is impossible, even if their VMA flags match.

After this patch:

If the allocation is not PMD-aligned in size, the returned address is 
not forcibly aligned, increasing the likelihood that the next mmap() 
lands directly after the previous one → enabling merging.

So, to be clear: this patch doesn’t cause merging, but it prevents 
unnecessary pre-mmap gaps, which previously blocked merges from ever 
happening exactly like a deadlock which has been cleared now.

📐 2. Why aren’t the VMAs mergeable before the patch?
Great question. Even if the VMA flags are identical, gaps introduced by 
forced alignment from get_unmapped_area() break the precondition for 
merging:

can_vma_merge_left()
  → return prev->vm_end == vma->vm_start

With Rik’s patch in place:

Suppose you mmap() 1.5MB → gets aligned to 2MB

Next 1.5MB → gets aligned to 4MB
→ The kernel sees: prev->vm_end = 3.5MB, vma->vm_start = 4MB
→ No merge

With this patch, non-aligned lengths don’t get forcibly aligned, so 
consecutive mmap()s often fall exactly after the previous, and merging 
becomes possible again.

🤖 3. How does this impact AI workloads like Hugging Face Transformers?
Tokenization and dynamic batching create non-deterministic memory 
allocation patterns:

Models like BERT and T5 dynamically allocate intermediate buffers per 
token-length, batch size, and attention window.

Hugging Face + ONNX Runtime uses multiple small-ish anonymous mmap()s, 
often 512KB–1.8MB.

These allocations come in bursts — but due to forced alignment, the 
kernel was placing them with artificial gaps, defeating THP eligibility 
entirely.

By not force-aligning non-PMD-sized mappings, we avoid injecting gaps. 
The result is that:

a. VMAs remain adjacent → mergeable

b. Physical memory is contiguous → eligible for khugepaged collapse

c. THP utilization increases → fewer TLB misses → lower latency → higher 
throughput

💡 4. Why this patch complements Rik’s rather than contradicts it:

Rik's patch made it easier to guarantee alignment for workloads that 
benefit from explicit huge pages — but at the cost of breaking 
coalescence in workloads with non-PMD-sized mappings, like ML inference.

This patch simply refines that logic:

If the length is PMD-aligned → keep alignment

If it’s not → don’t inject alignment gaps that block merging

So, for workloads that can’t benefit from THP due to misalignment, this 
patch removes artificial fragmentation without harming the original 
intent.

⚙️ 5. mTHP note
Although this patch doesn’t target mTHP directly, I believe a similar 
logic tweak could apply there too — especially with shmem-backed 
workloads (common in model servers using shared tensor memory). I’d be 
happy to help test any changes proposed there to derive the consequent 
results.

Thanks again for the detailed discussion. Let me know if you’d like a 
trace or VMA map from a Hugging Face benchmarked run (happy to generate 
one locally).

Best Regards,
Siddhartha Sharma
+91 9015185601