[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d8ffe547-5516-43e5-9f33-56b2698a0b4f@arm.com>
Date: Mon, 30 Jun 2025 10:55:52 +0530
From: Dev Jain <dev.jain@....com>
To: siddhartha@...ip.in
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, mgorman@...e.de
Subject: Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
On 30/06/25 6:13 am, siddhartha@...ip.in wrote:
> On 2025-06-28 09:19, Dev Jain wrote:
>> On 27/06/25 9:00 pm, Lorenzo Stoakes wrote:
>>> +cc Vlata
>>>
>>> On Fri, Jun 27, 2025 at 04:09:16PM +0530, siddhartha@...ip.in wrote:
>>>> Hi all,
>>>>
>>>> I wanted to share validation data from a Hugging Face-based AI
>>>> inferencing
>>>> workload,
>>>> which was significantly impacted by the THP alignment logic
>>>> introduced in
>>>> commit efa7df3e3bb5.
>>>>
>>>> Using transformer models with dynamic input lengths on Intel Xeon
>>>> (Cooper
>>>> Lake),
>>>> we observed up to a 3200% throughput improvement after applying the
>>>> patch
>>>> from Oct 2024:
>>>>
>>>> mm: limit THP alignment of anonymous mappings to PMD-aligned sizes
>>> All congratulations are owed to Vlastimil Babka for doing this, cc'd :)
>>>
>>> I gather he enjoys novelty beer mugs as tokens of thanks ;)
>>
>> I was wondering how the change can get us such a big optimization - the
>> alignment causes us to gain at most 1 extra PMD-THP mapping. Is there
>> something else I am missing?
>>
>> I ask because when I was reading the code I was thinking whether a
>> similar
>> change can be done for mTHPs.
>>
>>>
>>>> Metrics:
>>>> - Model: BERT-base
>>>> - Inference engine: Transformers + ONNX Runtime
>>>> - Kernel: 6.6 vs patched 6.6.8
>>>> - Batch size: 8-32, input length: 64-512 tokens
>>>> - Metric: inference throughput (samples/sec)
>>>>
>>>> Thanks for the fix -- this change had real impact on a
>>>> production-relevant
>>>> workload.
>>>>
>>>> Best Regards,
>>>> Siddhartha Sharma
>>>> ISV @ Kenip
>>>> Solution Link:
>>>> https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html
>>>>
>
> Hi Dev Jain,
>
> Thank you for reviewing and for your thoughtful question.
>
> You're absolutely right that, in isolation, gaining one additional
> PMD-THP mapping wouldn't explain a 3200% speedup. But in our use case
> (Hugging Face inference workloads with dynamic input sizes and many
> allocations), the original PMD alignment logic caused a cascade of
> side effects:
>
> The performance improvement comes from how that interacts with dynamic
> memory allocation patterns in AI inference workloads, especially those
> using frameworks like Hugging Face Transformers.
>
> In our specific use case, the workloads were running on Intel
> Developer Cloud, but I no longer have access to that particular
> environment or the original profiling output. However, I’d like to
> highlight why this patch had such an outsized effect:
>
> 🔹 1. Fragmentation Avoidance
> In model shard loading (e.g., large BERT or GPT2 models split into
> multiple memory segments), many medium-sized anonymous allocations
> occur in rapid succession. These workloads tend to allocate many 512
> KB – 1.5 MB buffers dynamically (token buffers, intermediate tensors).
> Aligning each one to PMD size, even when their length wasn’t
> PMD-aligned, led to gaps between them — defeating natural coalescing
> into a single THP.
>
> 🔹 2. TLB aliasing and cache index pressure
>
> These fragmented mappings caused frequent TLB misses and poor L1/L2
> cache reuse.
>
> The result was what looks like “memory thrashing,” with slow memory
> access dominating total inference time.
> When every mapping is PMD-aligned (even if not PMD-sized), the gaps
> between them prevent Transparent Huge Pages (THPs) from activating
> effectively.
>
> This breaks THP coalescence and causes fragmented page tables and
> higher memory overhead per shard.
>
> 🔹 3. Latency & Throughput Penalty from Memory Misalignment
> This leads to higher TLB miss rates, especially under multi-threaded
> load, which dramatically slows down token embedding and attention
> calculations.
>
> When loading model shards, memory initialization becomes
> cache-unfriendly, with poor reuse across cores.
>
> This affects not only inference latency but also model cold-start time
> — which is critical in autoscaling deployments.
>
> 🔹 4. Qualitative Observation
> Without this patch: shard loading stuttered, warm-up was slow, and we
> saw CPU cycles dominated by page_fault and TLB miss handlers.
>
> With this patch: shard loading smoothed out, THPs were correctly
> applied (based on smaps), and throughput shot up by an order of
> magnitude.
>
> 🔹 5. Measured Impact
> On Intel Xeon (Cooper Lake), a 6.6.0 kernel with PMD alignment on
> non-aligned sizes showed 11–32× worse performance.
>
> With the patched kernel (which skips alignment unless the length is
> PMD-aligned), memory layout was contiguous again and THP was
> consistently utilized.
>
> This isn’t about one extra THP — it’s about preventing widespread THP
> fragmentation and the resulting dramatic cache/TLB degradation. For AI
> workloads with high concurrency and dynamic shapes, this small patch
> has a massive effect on layout and locality.
>
> So, it's not just “1 more huge page” — it's avoiding massive
> fragmentation that leads to:
>
> 1. TLB miss storms
>
> 2. Poor locality
>
> 3. Cache index thrashing
>
> 4. Improvement in latency and throughput
>
> This applies across many adjacent, odd-length allocations typical of
> AI inference workloads.
>
> The original alignment logic created a pattern of broken contiguity —
> defeating THP benefits altogether.
>
> In AI workloads using Hugging Face Transformers, model shards and
> intermediate tensors are dynamically allocated during inference. These
> allocations often fall just below or above the 2MB threshold that THP
> relies on. Misalignment or forced alignment to PMD boundaries causes
> fragmentation and disrupts huge page coalescence, affecting performance.
>
> 📊 Memory Allocation Pattern Diagram
>
> Without Patch (PMD Alignment Forced):
>
> |<--2MB-->|<--Gap-->|<--2MB-->|<--Gap-->|<--2MB-->
> | Alloc A | | Alloc B | | Alloc C |
>
> Each allocation is PMD-aligned, even if it’s not PMD-sized
>
> Gaps prevent THP coalescence → TLB/cache fragmentation
>
> With Patch (PMD Alignment Conditional):
>
> |<---------6MB Contiguous Region--------->|
> | Alloc A | Alloc B | Alloc C | Padding |
>
> Contiguous anonymous memory region
>
> Coalesced into one or more THPs
>
> Improved locality and TLB efficiency
>
> While I regret not having the raw perf output at hand, I’d be happy to
> replicate a similar test locally and share reproducible results if
> helpful.
>
> Best Regards,
>
> Siddhartha Sharma
Thanks for your detailed explanation! I misunderstood that the
optimization you were talking about
was due to efa7df3e3bb5, instead it was due to the alignment. Your
explanation makes a lot of
sense!
For this workload, do you enable mTHPs on your system? My plan is to
make a similar patch for
the mTHP case and I'd be grateful if you can get me some results : )
>
>
Powered by blists - more mailing lists