linux-kernel - Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <dd370f92e9100e785aeafdc4d31f8cb5@kenip.in>
Date: Tue, 01 Jul 2025 18:53:47 +0530
From: siddhartha@...ip.in
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Dev Jain <dev.jain@....com>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, mgorman@...e.de
Subject: Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads

On 2025-07-01 18:09, Lorenzo Stoakes wrote:
> On Tue, Jul 01, 2025 at 05:45:51PM +0530, siddhartha@...ip.in wrote:
>> 🧩 1. Does the patch cause VMAs to be merged eventually?
>> You're correct: VMA merging only happens at mmap() time (via
>> __mmap_region()). What the patch affects is the behavior of
>> thp_get_unmapped_area_vmflags() before the mmap is placed.
> 
> [...]
> 
>> 
>> 📐 2. Why aren’t the VMAs mergeable before the patch?
>> Great question. Even if the VMA flags are identical, gaps introduced 
>> by
>> forced alignment from get_unmapped_area() break the precondition for
>> merging:
> 
> [...]
> 
>> 💡 4. Why this patch complements Rik’s rather than contradicts it:
> 
> I'm really perplexed as to why you felt the need to (seemingly via LLM)
> reply with the explanation I've already provided here?...
> 
> There's errors in things you say here too.
> 
> With respect, please don't do this.
> 
> (I'm the co-maintainer of pretty much all the relevant code here and 
> wrote
> the VMA merge logic you're referring to.)
> 
>> 
>> 🤖 3. How does this impact AI workloads like Hugging Face Transformers?
>> Tokenization and dynamic batching create non-deterministic memory 
>> allocation
>> patterns:
>> 
>> Models like BERT and T5 dynamically allocate intermediate buffers per
>> token-length, batch size, and attention window.
>> 
>> Hugging Face + ONNX Runtime uses multiple small-ish anonymous mmap()s, 
>> often
>> 512KB–1.8MB.
>> 
>> These allocations come in bursts — but due to forced alignment, the 
>> kernel
>> was placing them with artificial gaps, defeating THP eligibility 
>> entirely.
>> 
>> By not force-aligning non-PMD-sized mappings, we avoid injecting gaps. 
>> The
>> result is that:
>> 
>> a. VMAs remain adjacent → mergeable
>> 
>> b. Physical memory is contiguous → eligible for khugepaged collapse
>> 
>> c. THP utilization increases → fewer TLB misses → lower latency → 
>> higher
>> throughput
>> 
> 
> This is very useful information and it's appreciated! Let's not drown 
> this
> out with restatements of stuff already covered.
> 
>> 
>> ⚙️ 5. mTHP note
>> Although this patch doesn’t target mTHP directly, I believe a similar 
>> logic
>> tweak could apply there too — especially with shmem-backed workloads 
>> (common
>> in model servers using shared tensor memory). I’d be happy to help 
>> test any
>> changes proposed there to derive the consequent results.
> 
> Dev - could we hold off on any effort to do something like this until 
> I've
> had a chance to refactor THP somewhat? This is already a mess and I'd 
> like
> to avoid us piling on more complexity.
> 
> We can revisit this at a later stage.
> 
>> 
>> Thanks again for the detailed discussion. Let me know if you’d like a 
>> trace
>> or VMA map from a Hugging Face benchmarked run (happy to generate one
>> locally).
>> 
> 
> Thanks! Much appreciated.
> 
> Cheers, Lorenzo

Hi Lorenzo,

Thanks for your clarification, and I appreciate your patience — 
especially given your role in maintaining and designing the VMA merge 
logic.

I understand now that my earlier phrasing may have repeated your 
explanation for VMA adjacency, and I regret unintentionally restating 
it.

I’ll make sure to be more careful and direct going forward.

As for the THP alignment condition now being `IS_ALIGNED(len, 
PMD_SIZE)`, I agree this resolves the regressions by removing alignment 
for non-aligned sizes, which was exactly what broke workloads like 
cactusBSSN or some AI inference loads.

Thanks again for the guidance — I’m learning a lot from this thread.

Best Regards,
Siddhartha Sharma