linux-kernel - Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3ee2e7fea6f263aa884e3e715632b09f@kenip.in>
Date: Mon, 30 Jun 2025 06:13:28 +0530
From: siddhartha@...ip.in
To: Dev Jain <dev.jain@....com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, mgorman@...e.de
Subject: Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads

On 2025-06-28 09:19, Dev Jain wrote:
> On 27/06/25 9:00 pm, Lorenzo Stoakes wrote:
>> +cc Vlata
>> 
>> On Fri, Jun 27, 2025 at 04:09:16PM +0530, siddhartha@...ip.in wrote:
>>> Hi all,
>>> 
>>> I wanted to share validation data from a Hugging Face-based AI 
>>> inferencing
>>> workload,
>>> which was significantly impacted by the THP alignment logic 
>>> introduced in
>>> commit efa7df3e3bb5.
>>> 
>>> Using transformer models with dynamic input lengths on Intel Xeon 
>>> (Cooper
>>> Lake),
>>> we observed up to a 3200% throughput improvement after applying the 
>>> patch
>>> from Oct 2024:
>>> 
>>>    mm: limit THP alignment of anonymous mappings to PMD-aligned sizes
>> All congratulations are owed to Vlastimil Babka for doing this, cc'd 
>> :)
>> 
>> I gather he enjoys novelty beer mugs as tokens of thanks ;)
> 
> I was wondering how the change can get us such a big optimization - the
> alignment causes us to gain at most 1 extra PMD-THP mapping. Is there
> something else I am missing?
> 
> I ask because when I was reading the code I was thinking whether a 
> similar
> change can be done for mTHPs.
> 
>> 
>>> Metrics:
>>> - Model: BERT-base
>>> - Inference engine: Transformers + ONNX Runtime
>>> - Kernel: 6.6 vs patched 6.6.8
>>> - Batch size: 8-32, input length: 64-512 tokens
>>> - Metric: inference throughput (samples/sec)
>>> 
>>> Thanks for the fix -- this change had real impact on a 
>>> production-relevant
>>> workload.
>>> 
>>> Best Regards,
>>> Siddhartha Sharma
>>> ISV @ Kenip
>>> Solution Link: 
>>> https://www.intel.com/content/www/us/en/partner/showcase/offering/a5bHo00000045YUIAY/deadlock-clearance.html
>>> 

Hi Dev Jain,

Thank you for reviewing and for your thoughtful question.

You're absolutely right that, in isolation, gaining one additional 
PMD-THP mapping wouldn't explain a 3200% speedup. But in our use case 
(Hugging Face inference workloads with dynamic input sizes and many 
allocations), the original PMD alignment logic caused a cascade of side 
effects:

The performance improvement comes from how that interacts with dynamic 
memory allocation patterns in AI inference workloads, especially those 
using frameworks like Hugging Face Transformers.

In our specific use case, the workloads were running on Intel Developer 
Cloud, but I no longer have access to that particular environment or the 
original profiling output. However, I’d like to highlight why this patch 
had such an outsized effect:

🔹 1. Fragmentation Avoidance
In model shard loading (e.g., large BERT or GPT2 models split into 
multiple memory segments), many medium-sized anonymous allocations occur 
in rapid succession. These workloads tend to allocate many 512 KB – 1.5 
MB buffers dynamically (token buffers, intermediate tensors). Aligning 
each one to PMD size, even when their length wasn’t PMD-aligned, led to 
gaps between them — defeating natural coalescing into a single THP.

🔹 2. TLB aliasing and cache index pressure

These fragmented mappings caused frequent TLB misses and poor L1/L2 
cache reuse.

The result was what looks like “memory thrashing,” with slow memory 
access dominating total inference time.
When every mapping is PMD-aligned (even if not PMD-sized), the gaps 
between them prevent Transparent Huge Pages (THPs) from activating 
effectively.

This breaks THP coalescence and causes fragmented page tables and higher 
memory overhead per shard.

🔹 3. Latency & Throughput Penalty from Memory Misalignment
This leads to higher TLB miss rates, especially under multi-threaded 
load, which dramatically slows down token embedding and attention 
calculations.

When loading model shards, memory initialization becomes 
cache-unfriendly, with poor reuse across cores.

This affects not only inference latency but also model cold-start time — 
which is critical in autoscaling deployments.

🔹 4. Qualitative Observation
Without this patch: shard loading stuttered, warm-up was slow, and we 
saw CPU cycles dominated by page_fault and TLB miss handlers.

With this patch: shard loading smoothed out, THPs were correctly applied 
(based on smaps), and throughput shot up by an order of magnitude.

🔹 5. Measured Impact
On Intel Xeon (Cooper Lake), a 6.6.0 kernel with PMD alignment on 
non-aligned sizes showed 11–32× worse performance.

With the patched kernel (which skips alignment unless the length is 
PMD-aligned), memory layout was contiguous again and THP was 
consistently utilized.

This isn’t about one extra THP — it’s about preventing widespread THP 
fragmentation and the resulting dramatic cache/TLB degradation. For AI 
workloads with high concurrency and dynamic shapes, this small patch has 
a massive effect on layout and locality.

So, it's not just “1 more huge page” — it's avoiding massive 
fragmentation that leads to:

1. TLB miss storms

2. Poor locality

3. Cache index thrashing

4. Improvement in latency and throughput

This applies across many adjacent, odd-length allocations typical of AI 
inference workloads.

The original alignment logic created a pattern of broken contiguity — 
defeating THP benefits altogether.

In AI workloads using Hugging Face Transformers, model shards and 
intermediate tensors are dynamically allocated during inference. These 
allocations often fall just below or above the 2MB threshold that THP 
relies on. Misalignment or forced alignment to PMD boundaries causes 
fragmentation and disrupts huge page coalescence, affecting performance.

📊 Memory Allocation Pattern Diagram

Without Patch (PMD Alignment Forced):

|<--2MB-->|<--Gap-->|<--2MB-->|<--Gap-->|<--2MB-->
| Alloc A |         | Alloc B |         | Alloc C |

Each allocation is PMD-aligned, even if it’s not PMD-sized

Gaps prevent THP coalescence → TLB/cache fragmentation

With Patch (PMD Alignment Conditional):

|<---------6MB Contiguous Region--------->|
|  Alloc A  | Alloc B | Alloc C | Padding |

Contiguous anonymous memory region

Coalesced into one or more THPs

Improved locality and TLB efficiency

While I regret not having the raw perf output at hand, I’d be happy to 
replicate a similar test locally and share reproducible results if 
helpful.

Best Regards,

Siddhartha Sharma