[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3ee2e7fea6f263aa884e3e715632b09f@kenip.in>
Date: Mon, 30 Jun 2025 06:13:28 +0530
From: siddhartha@...ip.in
To: Dev Jain <dev.jain@....com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, mgorman@...e.de
Subject: Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads
On 2025-06-28 09:19, Dev Jain wrote:
> On 27/06/25 9:00 pm, Lorenzo Stoakes wrote:
>> +cc Vlata
>>
>> On Fri, Jun 27, 2025 at 04:09:16PM +0530, siddhartha@...ip.in wrote:
>>> Hi all,
>>>
>>> I wanted to share validation data from a Hugging Face-based AI
>>> inferencing
>>> workload,
>>> which was significantly impacted by the THP alignment logic
>>> introduced in
>>> commit efa7df3e3bb5.
>>>
>>> Using transformer models with dynamic input lengths on Intel Xeon
>>> (Cooper
>>> Lake),
>>> we observed up to a 3200% throughput improvement after applying the
>>> patch
>>> from Oct 2024:
>>>
>>> mm: limit THP alignment of anonymous mappings to PMD-aligned sizes
>> All congratulations are owed to Vlastimil Babka for doing this, cc'd
>> :)
>>
>> I gather he enjoys novelty beer mugs as tokens of thanks ;)
>
> I was wondering how the change can get us such a big optimization - the
> alignment causes us to gain at most 1 extra PMD-THP mapping. Is there
> something else I am missing?
>
> I ask because when I was reading the code I was thinking whether a
> similar
> change can be done for mTHPs.
>
>>
>>> Metrics:
>>> - Model: BERT-base
>>> - Inference engine: Transformers + ONNX Runtime
>>> - Kernel: 6.6 vs patched 6.6.8
>>> - Batch size: 8-32, input length: 64-512 tokens
>>> - Metric: inference throughput (samples/sec)
>>>
>>> Thanks for the fix -- this change had real impact on a
>>> production-relevant
>>> workload.
>>>
>>> Best Regards,
>>> Siddhartha Sharma
>>> ISV @ Kenip
>>> Solution Link:
>>> https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html
>>>
Hi Dev Jain,
Thank you for reviewing and for your thoughtful question.
You're absolutely right that, in isolation, gaining one additional
PMD-THP mapping wouldn't explain a 3200% speedup. But in our use case
(Hugging Face inference workloads with dynamic input sizes and many
allocations), the original PMD alignment logic caused a cascade of side
effects:
The performance improvement comes from how that interacts with dynamic
memory allocation patterns in AI inference workloads, especially those
using frameworks like Hugging Face Transformers.
In our specific use case, the workloads were running on Intel Developer
Cloud, but I no longer have access to that particular environment or the
original profiling output. However, I’d like to highlight why this patch
had such an outsized effect:
🔹 1. Fragmentation Avoidance
In model shard loading (e.g., large BERT or GPT2 models split into
multiple memory segments), many medium-sized anonymous allocations occur
in rapid succession. These workloads tend to allocate many 512 KB – 1.5
MB buffers dynamically (token buffers, intermediate tensors). Aligning
each one to PMD size, even when their length wasn’t PMD-aligned, led to
gaps between them — defeating natural coalescing into a single THP.
🔹 2. TLB aliasing and cache index pressure
These fragmented mappings caused frequent TLB misses and poor L1/L2
cache reuse.
The result was what looks like “memory thrashing,” with slow memory
access dominating total inference time.
When every mapping is PMD-aligned (even if not PMD-sized), the gaps
between them prevent Transparent Huge Pages (THPs) from activating
effectively.
This breaks THP coalescence and causes fragmented page tables and higher
memory overhead per shard.
🔹 3. Latency & Throughput Penalty from Memory Misalignment
This leads to higher TLB miss rates, especially under multi-threaded
load, which dramatically slows down token embedding and attention
calculations.
When loading model shards, memory initialization becomes
cache-unfriendly, with poor reuse across cores.
This affects not only inference latency but also model cold-start time —
which is critical in autoscaling deployments.
🔹 4. Qualitative Observation
Without this patch: shard loading stuttered, warm-up was slow, and we
saw CPU cycles dominated by page_fault and TLB miss handlers.
With this patch: shard loading smoothed out, THPs were correctly applied
(based on smaps), and throughput shot up by an order of magnitude.
🔹 5. Measured Impact
On Intel Xeon (Cooper Lake), a 6.6.0 kernel with PMD alignment on
non-aligned sizes showed 11–32× worse performance.
With the patched kernel (which skips alignment unless the length is
PMD-aligned), memory layout was contiguous again and THP was
consistently utilized.
This isn’t about one extra THP — it’s about preventing widespread THP
fragmentation and the resulting dramatic cache/TLB degradation. For AI
workloads with high concurrency and dynamic shapes, this small patch has
a massive effect on layout and locality.
So, it's not just “1 more huge page” — it's avoiding massive
fragmentation that leads to:
1. TLB miss storms
2. Poor locality
3. Cache index thrashing
4. Improvement in latency and throughput
This applies across many adjacent, odd-length allocations typical of AI
inference workloads.
The original alignment logic created a pattern of broken contiguity —
defeating THP benefits altogether.
In AI workloads using Hugging Face Transformers, model shards and
intermediate tensors are dynamically allocated during inference. These
allocations often fall just below or above the 2MB threshold that THP
relies on. Misalignment or forced alignment to PMD boundaries causes
fragmentation and disrupts huge page coalescence, affecting performance.
📊 Memory Allocation Pattern Diagram
Without Patch (PMD Alignment Forced):
|<--2MB-->|<--Gap-->|<--2MB-->|<--Gap-->|<--2MB-->
| Alloc A | | Alloc B | | Alloc C |
Each allocation is PMD-aligned, even if it’s not PMD-sized
Gaps prevent THP coalescence → TLB/cache fragmentation
With Patch (PMD Alignment Conditional):
|<---------6MB Contiguous Region--------->|
| Alloc A | Alloc B | Alloc C | Padding |
Contiguous anonymous memory region
Coalesced into one or more THPs
Improved locality and TLB efficiency
While I regret not having the raw perf output at hand, I’d be happy to
replicate a similar test locally and share reproducible results if
helpful.
Best Regards,
Siddhartha Sharma
Powered by blists - more mailing lists