linux-kernel - Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <415ab4bb-3c7a-47f1-937a-5b324d761f64@arm.com>
Date: Mon, 30 Jun 2025 10:58:27 +0530
From: Dev Jain <dev.jain@....com>
To: siddhartha@...ip.in
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, mgorman@...e.de
Subject: Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads


On 30/06/25 10:55 am, Dev Jain wrote:
>
> On 30/06/25 6:13 am, siddhartha@...ip.in wrote:
>> On 2025-06-28 09:19, Dev Jain wrote:
>>> On 27/06/25 9:00 pm, Lorenzo Stoakes wrote:
>>>> +cc Vlata
>>>>
>>>> On Fri, Jun 27, 2025 at 04:09:16PM +0530, siddhartha@...ip.in wrote:
>>>>> Hi all,
>>>>>
>>>>> I wanted to share validation data from a Hugging Face-based AI 
>>>>> inferencing
>>>>> workload,
>>>>> which was significantly impacted by the THP alignment logic 
>>>>> introduced in
>>>>> commit efa7df3e3bb5.
>>>>>
>>>>> Using transformer models with dynamic input lengths on Intel Xeon 
>>>>> (Cooper
>>>>> Lake),
>>>>> we observed up to a 3200% throughput improvement after applying 
>>>>> the patch
>>>>> from Oct 2024:
>>>>>
>>>>>    mm: limit THP alignment of anonymous mappings to PMD-aligned sizes
>>>> All congratulations are owed to Vlastimil Babka for doing this, 
>>>> cc'd :)
>>>>
>>>> I gather he enjoys novelty beer mugs as tokens of thanks ;)
>>>
>>> I was wondering how the change can get us such a big optimization - the
>>> alignment causes us to gain at most 1 extra PMD-THP mapping. Is there
>>> something else I am missing?
>>>
>>> I ask because when I was reading the code I was thinking whether a 
>>> similar
>>> change can be done for mTHPs.
>>>
>>>>
>>>>> Metrics:
>>>>> - Model: BERT-base
>>>>> - Inference engine: Transformers + ONNX Runtime
>>>>> - Kernel: 6.6 vs patched 6.6.8
>>>>> - Batch size: 8-32, input length: 64-512 tokens
>>>>> - Metric: inference throughput (samples/sec)
>>>>>
>>>>> Thanks for the fix -- this change had real impact on a 
>>>>> production-relevant
>>>>> workload.
>>>>>
>>>>> Best Regards,
>>>>> Siddhartha Sharma
>>>>> ISV @ Kenip
>>>>> Solution Link: 
>>>>> https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html
>>>>>
>>
>> Hi Dev Jain,
>>
>> Thank you for reviewing and for your thoughtful question.
>>
>> You're absolutely right that, in isolation, gaining one additional 
>> PMD-THP mapping wouldn't explain a 3200% speedup. But in our use case 
>> (Hugging Face inference workloads with dynamic input sizes and many 
>> allocations), the original PMD alignment logic caused a cascade of 
>> side effects:
>>
>> The performance improvement comes from how that interacts with 
>> dynamic memory allocation patterns in AI inference workloads, 
>> especially those using frameworks like Hugging Face Transformers.
>>
>> In our specific use case, the workloads were running on Intel 
>> Developer Cloud, but I no longer have access to that particular 
>> environment or the original profiling output. However, I’d like to 
>> highlight why this patch had such an outsized effect:
>>
>> 🔹 1. Fragmentation Avoidance
>> In model shard loading (e.g., large BERT or GPT2 models split into 
>> multiple memory segments), many medium-sized anonymous allocations 
>> occur in rapid succession. These workloads tend to allocate many 512 
>> KB – 1.5 MB buffers dynamically (token buffers, intermediate 
>> tensors). Aligning each one to PMD size, even when their length 
>> wasn’t PMD-aligned, led to gaps between them — defeating natural 
>> coalescing into a single THP.
>>
>> 🔹 2. TLB aliasing and cache index pressure
>>
>> These fragmented mappings caused frequent TLB misses and poor L1/L2 
>> cache reuse.
>>
>> The result was what looks like “memory thrashing,” with slow memory 
>> access dominating total inference time.
>> When every mapping is PMD-aligned (even if not PMD-sized), the gaps 
>> between them prevent Transparent Huge Pages (THPs) from activating 
>> effectively.
>>
>> This breaks THP coalescence and causes fragmented page tables and 
>> higher memory overhead per shard.
>>
>> 🔹 3. Latency & Throughput Penalty from Memory Misalignment
>> This leads to higher TLB miss rates, especially under multi-threaded 
>> load, which dramatically slows down token embedding and attention 
>> calculations.
>>
>> When loading model shards, memory initialization becomes 
>> cache-unfriendly, with poor reuse across cores.
>>
>> This affects not only inference latency but also model cold-start 
>> time — which is critical in autoscaling deployments.
>>
>> 🔹 4. Qualitative Observation
>> Without this patch: shard loading stuttered, warm-up was slow, and we 
>> saw CPU cycles dominated by page_fault and TLB miss handlers.
>>
>> With this patch: shard loading smoothed out, THPs were correctly 
>> applied (based on smaps), and throughput shot up by an order of 
>> magnitude.
>>
>> 🔹 5. Measured Impact
>> On Intel Xeon (Cooper Lake), a 6.6.0 kernel with PMD alignment on 
>> non-aligned sizes showed 11–32× worse performance.
>>
>> With the patched kernel (which skips alignment unless the length is 
>> PMD-aligned), memory layout was contiguous again and THP was 
>> consistently utilized.
>>
>> This isn’t about one extra THP — it’s about preventing widespread THP 
>> fragmentation and the resulting dramatic cache/TLB degradation. For 
>> AI workloads with high concurrency and dynamic shapes, this small 
>> patch has a massive effect on layout and locality.
>>
>> So, it's not just “1 more huge page” — it's avoiding massive 
>> fragmentation that leads to:
>>
>> 1. TLB miss storms
>>
>> 2. Poor locality
>>
>> 3. Cache index thrashing
>>
>> 4. Improvement in latency and throughput
>>
>> This applies across many adjacent, odd-length allocations typical of 
>> AI inference workloads.
>>
>> The original alignment logic created a pattern of broken contiguity — 
>> defeating THP benefits altogether.
>>
>> In AI workloads using Hugging Face Transformers, model shards and 
>> intermediate tensors are dynamically allocated during inference. 
>> These allocations often fall just below or above the 2MB threshold 
>> that THP relies on. Misalignment or forced alignment to PMD 
>> boundaries causes fragmentation and disrupts huge page coalescence, 
>> affecting performance.
>>
>> 📊 Memory Allocation Pattern Diagram
>>
>> Without Patch (PMD Alignment Forced):
>>
>> |<--2MB-->|<--Gap-->|<--2MB-->|<--Gap-->|<--2MB-->
>> | Alloc A |         | Alloc B |         | Alloc C |
>>
>> Each allocation is PMD-aligned, even if it’s not PMD-sized
>>
>> Gaps prevent THP coalescence → TLB/cache fragmentation
>>
>> With Patch (PMD Alignment Conditional):
>>
>> |<---------6MB Contiguous Region--------->|
>> |  Alloc A  | Alloc B | Alloc C | Padding |
>>
>> Contiguous anonymous memory region
>>
>> Coalesced into one or more THPs
>>
>> Improved locality and TLB efficiency
>>
>> While I regret not having the raw perf output at hand, I’d be happy 
>> to replicate a similar test locally and share reproducible results if 
>> helpful.
>>
>> Best Regards,
>>
>> Siddhartha Sharma
>
> Thanks for your detailed explanation! I misunderstood that the 
> optimization you were talking about
>
> was due to efa7df3e3bb5, instead it was due to the alignment. Your 
> explanation makes a lot of
>
> sense!
>
>
> For this workload, do you enable mTHPs on your system? My plan is to 
> make a similar patch for
>
> the mTHP case and I'd be grateful if you can get me some results : )

Oh I see that you are using the 6.6 kernel, which probably won't have 
the mTHP patches.


>
>>
>>
>