linux-kernel - Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d8ffe547-5516-43e5-9f33-56b2698a0b4f@arm.com>
Date: Mon, 30 Jun 2025 10:55:52 +0530
From: Dev Jain <dev.jain@....com>
To: siddhartha@...ip.in
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, mgorman@...e.de
Subject: Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads


On 30/06/25 6:13 am, siddhartha@...ip.in wrote:
> On 2025-06-28 09:19, Dev Jain wrote:
>> On 27/06/25 9:00 pm, Lorenzo Stoakes wrote:
>>> +cc Vlata
>>>
>>> On Fri, Jun 27, 2025 at 04:09:16PM +0530, siddhartha@...ip.in wrote:
>>>> Hi all,
>>>>
>>>> I wanted to share validation data from a Hugging Face-based AI 
>>>> inferencing
>>>> workload,
>>>> which was significantly impacted by the THP alignment logic 
>>>> introduced in
>>>> commit efa7df3e3bb5.
>>>>
>>>> Using transformer models with dynamic input lengths on Intel Xeon 
>>>> (Cooper
>>>> Lake),
>>>> we observed up to a 3200% throughput improvement after applying the 
>>>> patch
>>>> from Oct 2024:
>>>>
>>>>    mm: limit THP alignment of anonymous mappings to PMD-aligned sizes
>>> All congratulations are owed to Vlastimil Babka for doing this, cc'd :)
>>>
>>> I gather he enjoys novelty beer mugs as tokens of thanks ;)
>>
>> I was wondering how the change can get us such a big optimization - the
>> alignment causes us to gain at most 1 extra PMD-THP mapping. Is there
>> something else I am missing?
>>
>> I ask because when I was reading the code I was thinking whether a 
>> similar
>> change can be done for mTHPs.
>>
>>>
>>>> Metrics:
>>>> - Model: BERT-base
>>>> - Inference engine: Transformers + ONNX Runtime
>>>> - Kernel: 6.6 vs patched 6.6.8
>>>> - Batch size: 8-32, input length: 64-512 tokens
>>>> - Metric: inference throughput (samples/sec)
>>>>
>>>> Thanks for the fix -- this change had real impact on a 
>>>> production-relevant
>>>> workload.
>>>>
>>>> Best Regards,
>>>> Siddhartha Sharma
>>>> ISV @ Kenip
>>>> Solution Link: 
>>>> https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html
>>>>
>
> Hi Dev Jain,
>
> Thank you for reviewing and for your thoughtful question.
>
> You're absolutely right that, in isolation, gaining one additional 
> PMD-THP mapping wouldn't explain a 3200% speedup. But in our use case 
> (Hugging Face inference workloads with dynamic input sizes and many 
> allocations), the original PMD alignment logic caused a cascade of 
> side effects:
>
> The performance improvement comes from how that interacts with dynamic 
> memory allocation patterns in AI inference workloads, especially those 
> using frameworks like Hugging Face Transformers.
>
> In our specific use case, the workloads were running on Intel 
> Developer Cloud, but I no longer have access to that particular 
> environment or the original profiling output. However, I’d like to 
> highlight why this patch had such an outsized effect:
>
> 🔹 1. Fragmentation Avoidance
> In model shard loading (e.g., large BERT or GPT2 models split into 
> multiple memory segments), many medium-sized anonymous allocations 
> occur in rapid succession. These workloads tend to allocate many 512 
> KB – 1.5 MB buffers dynamically (token buffers, intermediate tensors). 
> Aligning each one to PMD size, even when their length wasn’t 
> PMD-aligned, led to gaps between them — defeating natural coalescing 
> into a single THP.
>
> 🔹 2. TLB aliasing and cache index pressure
>
> These fragmented mappings caused frequent TLB misses and poor L1/L2 
> cache reuse.
>
> The result was what looks like “memory thrashing,” with slow memory 
> access dominating total inference time.
> When every mapping is PMD-aligned (even if not PMD-sized), the gaps 
> between them prevent Transparent Huge Pages (THPs) from activating 
> effectively.
>
> This breaks THP coalescence and causes fragmented page tables and 
> higher memory overhead per shard.
>
> 🔹 3. Latency & Throughput Penalty from Memory Misalignment
> This leads to higher TLB miss rates, especially under multi-threaded 
> load, which dramatically slows down token embedding and attention 
> calculations.
>
> When loading model shards, memory initialization becomes 
> cache-unfriendly, with poor reuse across cores.
>
> This affects not only inference latency but also model cold-start time 
> — which is critical in autoscaling deployments.
>
> 🔹 4. Qualitative Observation
> Without this patch: shard loading stuttered, warm-up was slow, and we 
> saw CPU cycles dominated by page_fault and TLB miss handlers.
>
> With this patch: shard loading smoothed out, THPs were correctly 
> applied (based on smaps), and throughput shot up by an order of 
> magnitude.
>
> 🔹 5. Measured Impact
> On Intel Xeon (Cooper Lake), a 6.6.0 kernel with PMD alignment on 
> non-aligned sizes showed 11–32× worse performance.
>
> With the patched kernel (which skips alignment unless the length is 
> PMD-aligned), memory layout was contiguous again and THP was 
> consistently utilized.
>
> This isn’t about one extra THP — it’s about preventing widespread THP 
> fragmentation and the resulting dramatic cache/TLB degradation. For AI 
> workloads with high concurrency and dynamic shapes, this small patch 
> has a massive effect on layout and locality.
>
> So, it's not just “1 more huge page” — it's avoiding massive 
> fragmentation that leads to:
>
> 1. TLB miss storms
>
> 2. Poor locality
>
> 3. Cache index thrashing
>
> 4. Improvement in latency and throughput
>
> This applies across many adjacent, odd-length allocations typical of 
> AI inference workloads.
>
> The original alignment logic created a pattern of broken contiguity — 
> defeating THP benefits altogether.
>
> In AI workloads using Hugging Face Transformers, model shards and 
> intermediate tensors are dynamically allocated during inference. These 
> allocations often fall just below or above the 2MB threshold that THP 
> relies on. Misalignment or forced alignment to PMD boundaries causes 
> fragmentation and disrupts huge page coalescence, affecting performance.
>
> 📊 Memory Allocation Pattern Diagram
>
> Without Patch (PMD Alignment Forced):
>
> |<--2MB-->|<--Gap-->|<--2MB-->|<--Gap-->|<--2MB-->
> | Alloc A |         | Alloc B |         | Alloc C |
>
> Each allocation is PMD-aligned, even if it’s not PMD-sized
>
> Gaps prevent THP coalescence → TLB/cache fragmentation
>
> With Patch (PMD Alignment Conditional):
>
> |<---------6MB Contiguous Region--------->|
> |  Alloc A  | Alloc B | Alloc C | Padding |
>
> Contiguous anonymous memory region
>
> Coalesced into one or more THPs
>
> Improved locality and TLB efficiency
>
> While I regret not having the raw perf output at hand, I’d be happy to 
> replicate a similar test locally and share reproducible results if 
> helpful.
>
> Best Regards,
>
> Siddhartha Sharma

Thanks for your detailed explanation! I misunderstood that the 
optimization you were talking about

was due to efa7df3e3bb5, instead it was due to the alignment. Your 
explanation makes a lot of

sense!


For this workload, do you enable mTHPs on your system? My plan is to 
make a similar patch for

the mTHP case and I'd be grateful if you can get me some results : )

>
>