linux-kernel - Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6eaaa2e4-9067-47bc-8dd4-d8ef56c26b3b@arm.com>
Date: Tue, 1 Jul 2025 21:50:38 +0530
From: Dev Jain <dev.jain@....com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, siddhartha@...ip.in
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org, mgorman@...e.de
Subject: Re: [PATCH] mm: limit THP alignment – performance gain observed in AI inference workloads


On 01/07/25 6:09 pm, Lorenzo Stoakes wrote:
> On Tue, Jul 01, 2025 at 05:45:51PM +0530, siddhartha@...ip.in wrote:
>> 🧩 1. Does the patch cause VMAs to be merged eventually?
>> You're correct: VMA merging only happens at mmap() time (via
>> __mmap_region()). What the patch affects is the behavior of
>> thp_get_unmapped_area_vmflags() before the mmap is placed.
> [...]
>
>> 📐 2. Why aren’t the VMAs mergeable before the patch?
>> Great question. Even if the VMA flags are identical, gaps introduced by
>> forced alignment from get_unmapped_area() break the precondition for
>> merging:
> [...]
>
>> 💡 4. Why this patch complements Rik’s rather than contradicts it:
> I'm really perplexed as to why you felt the need to (seemingly via LLM)
> reply with the explanation I've already provided here?...
>
> There's errors in things you say here too.
>
> With respect, please don't do this.
>
> (I'm the co-maintainer of pretty much all the relevant code here and wrote
> the VMA merge logic you're referring to.)
>
>> 🤖 3. How does this impact AI workloads like Hugging Face Transformers?
>> Tokenization and dynamic batching create non-deterministic memory allocation
>> patterns:
>>
>> Models like BERT and T5 dynamically allocate intermediate buffers per
>> token-length, batch size, and attention window.
>>
>> Hugging Face + ONNX Runtime uses multiple small-ish anonymous mmap()s, often
>> 512KB–1.8MB.
>>
>> These allocations come in bursts — but due to forced alignment, the kernel
>> was placing them with artificial gaps, defeating THP eligibility entirely.
>>
>> By not force-aligning non-PMD-sized mappings, we avoid injecting gaps. The
>> result is that:
>>
>> a. VMAs remain adjacent → mergeable
>>
>> b. Physical memory is contiguous → eligible for khugepaged collapse
>>
>> c. THP utilization increases → fewer TLB misses → lower latency → higher
>> throughput
>>
> This is very useful information and it's appreciated! Let's not drown this
> out with restatements of stuff already covered.
>
>> ⚙️ 5. mTHP note
>> Although this patch doesn’t target mTHP directly, I believe a similar logic
>> tweak could apply there too — especially with shmem-backed workloads (common
>> in model servers using shared tensor memory). I’d be happy to help test any
>> changes proposed there to derive the consequent results.
> Dev - could we hold off on any effort to do something like this until I've
> had a chance to refactor THP somewhat? This is already a mess and I'd like
> to avoid us piling on more complexity.
>
> We can revisit this at a later stage.

Yes of course. I had run a small benchmark on a quick dumb patch I wrote and I
don't see any measurable perf improvement, probably because the highest THP order
getting chosen is always PMD size.

Out of curiosity, where do you plan to do the refactoring?

>
>> Thanks again for the detailed discussion. Let me know if you’d like a trace
>> or VMA map from a Hugging Face benchmarked run (happy to generate one
>> locally).
>>
> Thanks! Much appreciated.
>
> Cheers, Lorenzo