[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <415ab4bb-3c7a-47f1-937a-5b324d761f64@arm.com>
Date: Mon, 30 Jun 2025 10:58:27 +0530
From: Dev Jain <dev.jain@....com>
To: siddhartha@...ip.in
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, mgorman@...e.de
Subject: Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
On 30/06/25 10:55 am, Dev Jain wrote:
>
> On 30/06/25 6:13 am, siddhartha@...ip.in wrote:
>> On 2025-06-28 09:19, Dev Jain wrote:
>>> On 27/06/25 9:00 pm, Lorenzo Stoakes wrote:
>>>> +cc Vlata
>>>>
>>>> On Fri, Jun 27, 2025 at 04:09:16PM +0530, siddhartha@...ip.in wrote:
>>>>> Hi all,
>>>>>
>>>>> I wanted to share validation data from a Hugging Face-based AI
>>>>> inferencing
>>>>> workload,
>>>>> which was significantly impacted by the THP alignment logic
>>>>> introduced in
>>>>> commit efa7df3e3bb5.
>>>>>
>>>>> Using transformer models with dynamic input lengths on Intel Xeon
>>>>> (Cooper
>>>>> Lake),
>>>>> we observed up to a 3200% throughput improvement after applying
>>>>> the patch
>>>>> from Oct 2024:
>>>>>
>>>>> mm: limit THP alignment of anonymous mappings to PMD-aligned sizes
>>>> All congratulations are owed to Vlastimil Babka for doing this,
>>>> cc'd :)
>>>>
>>>> I gather he enjoys novelty beer mugs as tokens of thanks ;)
>>>
>>> I was wondering how the change can get us such a big optimization - the
>>> alignment causes us to gain at most 1 extra PMD-THP mapping. Is there
>>> something else I am missing?
>>>
>>> I ask because when I was reading the code I was thinking whether a
>>> similar
>>> change can be done for mTHPs.
>>>
>>>>
>>>>> Metrics:
>>>>> - Model: BERT-base
>>>>> - Inference engine: Transformers + ONNX Runtime
>>>>> - Kernel: 6.6 vs patched 6.6.8
>>>>> - Batch size: 8-32, input length: 64-512 tokens
>>>>> - Metric: inference throughput (samples/sec)
>>>>>
>>>>> Thanks for the fix -- this change had real impact on a
>>>>> production-relevant
>>>>> workload.
>>>>>
>>>>> Best Regards,
>>>>> Siddhartha Sharma
>>>>> ISV @ Kenip
>>>>> Solution Link:
>>>>> https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html
>>>>>
>>
>> Hi Dev Jain,
>>
>> Thank you for reviewing and for your thoughtful question.
>>
>> You're absolutely right that, in isolation, gaining one additional
>> PMD-THP mapping wouldn't explain a 3200% speedup. But in our use case
>> (Hugging Face inference workloads with dynamic input sizes and many
>> allocations), the original PMD alignment logic caused a cascade of
>> side effects:
>>
>> The performance improvement comes from how that interacts with
>> dynamic memory allocation patterns in AI inference workloads,
>> especially those using frameworks like Hugging Face Transformers.
>>
>> In our specific use case, the workloads were running on Intel
>> Developer Cloud, but I no longer have access to that particular
>> environment or the original profiling output. However, I’d like to
>> highlight why this patch had such an outsized effect:
>>
>> 🔹 1. Fragmentation Avoidance
>> In model shard loading (e.g., large BERT or GPT2 models split into
>> multiple memory segments), many medium-sized anonymous allocations
>> occur in rapid succession. These workloads tend to allocate many 512
>> KB – 1.5 MB buffers dynamically (token buffers, intermediate
>> tensors). Aligning each one to PMD size, even when their length
>> wasn’t PMD-aligned, led to gaps between them — defeating natural
>> coalescing into a single THP.
>>
>> 🔹 2. TLB aliasing and cache index pressure
>>
>> These fragmented mappings caused frequent TLB misses and poor L1/L2
>> cache reuse.
>>
>> The result was what looks like “memory thrashing,” with slow memory
>> access dominating total inference time.
>> When every mapping is PMD-aligned (even if not PMD-sized), the gaps
>> between them prevent Transparent Huge Pages (THPs) from activating
>> effectively.
>>
>> This breaks THP coalescence and causes fragmented page tables and
>> higher memory overhead per shard.
>>
>> 🔹 3. Latency & Throughput Penalty from Memory Misalignment
>> This leads to higher TLB miss rates, especially under multi-threaded
>> load, which dramatically slows down token embedding and attention
>> calculations.
>>
>> When loading model shards, memory initialization becomes
>> cache-unfriendly, with poor reuse across cores.
>>
>> This affects not only inference latency but also model cold-start
>> time — which is critical in autoscaling deployments.
>>
>> 🔹 4. Qualitative Observation
>> Without this patch: shard loading stuttered, warm-up was slow, and we
>> saw CPU cycles dominated by page_fault and TLB miss handlers.
>>
>> With this patch: shard loading smoothed out, THPs were correctly
>> applied (based on smaps), and throughput shot up by an order of
>> magnitude.
>>
>> 🔹 5. Measured Impact
>> On Intel Xeon (Cooper Lake), a 6.6.0 kernel with PMD alignment on
>> non-aligned sizes showed 11–32× worse performance.
>>
>> With the patched kernel (which skips alignment unless the length is
>> PMD-aligned), memory layout was contiguous again and THP was
>> consistently utilized.
>>
>> This isn’t about one extra THP — it’s about preventing widespread THP
>> fragmentation and the resulting dramatic cache/TLB degradation. For
>> AI workloads with high concurrency and dynamic shapes, this small
>> patch has a massive effect on layout and locality.
>>
>> So, it's not just “1 more huge page” — it's avoiding massive
>> fragmentation that leads to:
>>
>> 1. TLB miss storms
>>
>> 2. Poor locality
>>
>> 3. Cache index thrashing
>>
>> 4. Improvement in latency and throughput
>>
>> This applies across many adjacent, odd-length allocations typical of
>> AI inference workloads.
>>
>> The original alignment logic created a pattern of broken contiguity —
>> defeating THP benefits altogether.
>>
>> In AI workloads using Hugging Face Transformers, model shards and
>> intermediate tensors are dynamically allocated during inference.
>> These allocations often fall just below or above the 2MB threshold
>> that THP relies on. Misalignment or forced alignment to PMD
>> boundaries causes fragmentation and disrupts huge page coalescence,
>> affecting performance.
>>
>> 📊 Memory Allocation Pattern Diagram
>>
>> Without Patch (PMD Alignment Forced):
>>
>> |<--2MB-->|<--Gap-->|<--2MB-->|<--Gap-->|<--2MB-->
>> | Alloc A | | Alloc B | | Alloc C |
>>
>> Each allocation is PMD-aligned, even if it’s not PMD-sized
>>
>> Gaps prevent THP coalescence → TLB/cache fragmentation
>>
>> With Patch (PMD Alignment Conditional):
>>
>> |<---------6MB Contiguous Region--------->|
>> | Alloc A | Alloc B | Alloc C | Padding |
>>
>> Contiguous anonymous memory region
>>
>> Coalesced into one or more THPs
>>
>> Improved locality and TLB efficiency
>>
>> While I regret not having the raw perf output at hand, I’d be happy
>> to replicate a similar test locally and share reproducible results if
>> helpful.
>>
>> Best Regards,
>>
>> Siddhartha Sharma
>
> Thanks for your detailed explanation! I misunderstood that the
> optimization you were talking about
>
> was due to efa7df3e3bb5, instead it was due to the alignment. Your
> explanation makes a lot of
>
> sense!
>
>
> For this workload, do you enable mTHPs on your system? My plan is to
> make a similar patch for
>
> the mTHP case and I'd be grateful if you can get me some results : )
Oh I see that you are using the 6.6 kernel, which probably won't have
the mTHP patches.
>
>>
>>
>
Powered by blists - more mailing lists