[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <E99A40AF-1535-4FC0-BEE5-6F0F5B3FF840@nvidia.com>
Date: Wed, 21 Jan 2026 15:31:59 -0500
From: Zi Yan <ziy@...dia.com>
To: Vlastimil Babka <vbabka@...e.cz>
Cc: Kiryl Shutsemau <kas@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Muchun Song <muchun.song@...ux.dev>, David Hildenbrand <david@...nel.org>,
Matthew Wilcox <willy@...radead.org>, Usama Arif <usamaarif642@...il.com>,
Frank van der Linden <fvdl@...gle.com>, Oscar Salvador <osalvador@...e.de>,
Mike Rapoport <rppt@...nel.org>,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Baoquan He <bhe@...hat.com>,
Michal Hocko <mhocko@...e.com>, Johannes Weiner <hannes@...xchg.org>,
Jonathan Corbet <corbet@....net>, kernel-team@...a.com, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org
Subject: Re: [PATCHv4 00/14] mm: Eliminate fake head pages from vmemmap
optimization
On 21 Jan 2026, at 13:44, Vlastimil Babka wrote:
> On 1/21/26 17:22, Kiryl Shutsemau wrote:
>> This series removes "fake head pages" from the HugeTLB vmemmap
>> optimization (HVO) by changing how tail pages encode their relationship
>> to the head page.
>>
>> It simplifies compound_head() and page_ref_add_unless(). Both are in the
>> hot path.
>
> We never got the definitive answer in the previous version discussions
> whether it's worth to do this now with the upcoming memdesc stuff, right?
>
>> Background
>> ==========
>>
>> HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages
>> and remapping the freed virtual addresses to a single physical page.
>> Previously, all tail page vmemmap entries were remapped to the first
>> vmemmap page (containing the head struct page), creating "fake heads" -
>> tail pages that appear to have PG_head set when accessed through the
>> deduplicated vmemmap.
>>
>> This required special handling in compound_head() to detect and work
>> around fake heads, adding complexity and overhead to a very hot path.
>
> So a very stupid question, why did we remap everything to the first page,
> and not instead create two pages, where the first one would contain the head
> and the first batch of tails, and the second one would be used for the rest
> of the tails? I'd expect it wouldn't make the memory savings that much
> worse, and eliminate most of the issues?
I think it was using 2 pages before[1]. The benefit of using one page is:
“
It further reduces the overhead of struct
page by 12.5% for a 2MB HugeTLB compared to the previous approach,
which means 2GB per 1TB HugeTLB (2MB type).
“
[1] https://lore.kernel.org/all/20211101031651.75851-1-songmuchun@bytedance.com/T/#u
>
>> New Approach
>> ============
>>
>> For architectures/configs where sizeof(struct page) is a power of 2 (the
>> common case), this series changes how position of the head page is encoded
>> in the tail pages.
>>
>> Instead of storing a pointer to the head page, the ->compound_info
>> (renamed from ->compound_head) now stores a mask.
>>
>> The mask can be applied to any tail page's virtual address to compute
>> the head page address. Critically, all tail pages of the same order now
>> have identical compound_info values, regardless of which compound page
>> they belong to.
>>
>> The key insight is that all tail pages of the same order now have
>> identical compound_info values, regardless of which compound page they
>> belong to. This allows a single page of tail struct pages to be shared
>> across all huge pages of the same order on a NUMA node.
>>
>> Benefits
>> ========
>>
>> 1. Simplified compound_head(): No fake head detection needed, can be
>> implemented in a branchless manner.
>>
>> 2. Simplified page_ref_add_unless(): RCU protection removed since there's
>> no race with fake head remapping.
>>
>> 3. Cleaner architecture: The shared tail pages are truly read-only and
>> contain valid tail page metadata.
>>
>> If sizeof(struct page) is not power-of-2, there are no functional changes.
>> HVO is not supported in this configuration.
>>
>> I had hoped to see performance improvement, but my testing thus far has
>> shown either no change or only a slight improvement within the noise.
>>
>> Series Organization
>> ===================
>>
>> Patch 1: Preparation - move MAX_FOLIO_ORDER to mmzone.h
>> Patches 2-4: Refactoring - interface changes, field rename, code movement
>> Patch 5: Core change - new mask-based compound_head() encoding
>> Patch 6: Correctness fix - page_zonenum() must use head page
>> Patch 7: Add memmap alignment check for compound_info_has_mask()
>> Patch 8: Refactor vmemmap_walk for new design
>> Patch 9: Eliminate fake heads with shared tail pages
>> Patches 10-13: Cleanup - remove fake head infrastructure
>> Patch 14: Documentation update
>>
>> Changes in v4:
>> ==============
>> - Fix build issues due to linux/mmzone.h <-> linux/pgtable.h
>> dependency loop by avoiding including linux/pgtable.h into
>> linux/mmzone.h
>>
>> - Rework vmemmap_remap_alloc() interface. (Muchun)
>>
>> - Use &folio->page instead of folio address for optimization
>> target. (Muchun)
>>
>> Changes in v3:
>> ==============
>> - Fixed error recovery path in vmemmap_remap_free() to pass correct start
>> address for TLB flush. (Muchun)
>>
>> - Wrapped the mask-based compound_info encoding within CONFIG_SPARSEMEM_VMEMMAP
>> check via compound_info_has_mask(). For other memory models, alignment
>> guarantees are harder to verify. (Muchun)
>>
>> - Updated vmemmap_dedup.rst documentation wording: changed "vmemmap_tail
>> shared for the struct hstate" to "A single, per-node page frame shared
>> among all hugepages of the same size". (Muchun)
>>
>> - Fixed build error with MAX_FOLIO_ORDER expanding to undefined PUD_ORDER
>> in certain configurations. (kernel test robot)
>>
>> Changes in v2:
>> ==============
>>
>> - Handle boot-allocated huge pages correctly. (Frank)
>>
>> - Changed from per-hstate vmemmap_tail to per-node vmemmap_tails[] array
>> in pglist_data. (Muchun)
>>
>> - Added spin_lock(&hugetlb_lock) protection in vmemmap_get_tail() to fix
>> a race condition where two threads could both allocate tail pages.
>> The losing thread now properly frees its allocated page. (Usama)
>>
>> - Add warning if memmap is not aligned to MAX_FOLIO_SIZE, which is
>> required for the mask approach. (Muchun)
>>
>> - Make page_zonenum() use head page - correctness fix since shared
>> tail pages cannot have valid zone information. (Muchun)
>>
>> - Added 'const' qualifier to head parameter in set_compound_head() and
>> prep_compound_tail(). (Usama)
>>
>> - Updated commit messages.
>>
>> Kiryl Shutsemau (14):
>> mm: Move MAX_FOLIO_ORDER definition to mmzone.h
>> mm: Change the interface of prep_compound_tail()
>> mm: Rename the 'compound_head' field in the 'struct page' to
>> 'compound_info'
>> mm: Move set/clear_compound_head() next to compound_head()
>> mm: Rework compound_head() for power-of-2 sizeof(struct page)
>> mm: Make page_zonenum() use head page
>> mm/sparse: Check memmap alignment for compound_info_has_mask()
>> mm/hugetlb: Refactor code around vmemmap_walk
>> mm/hugetlb: Remove fake head pages
>> mm: Drop fake head checks
>> hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU
>> mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key
>> mm: Remove the branch from compound_head()
>> hugetlb: Update vmemmap_dedup.rst
>>
>> .../admin-guide/kdump/vmcoreinfo.rst | 2 +-
>> Documentation/mm/vmemmap_dedup.rst | 62 ++--
>> include/linux/mm.h | 31 --
>> include/linux/mm_types.h | 20 +-
>> include/linux/mmzone.h | 47 +++
>> include/linux/page-flags.h | 167 +++++-----
>> include/linux/page_ref.h | 8 +-
>> include/linux/types.h | 2 +-
>> kernel/vmcore_info.c | 2 +-
>> mm/hugetlb.c | 8 +-
>> mm/hugetlb_vmemmap.c | 300 ++++++++----------
>> mm/internal.h | 12 +-
>> mm/mm_init.c | 2 +-
>> mm/page_alloc.c | 4 +-
>> mm/slab.h | 2 +-
>> mm/sparse-vmemmap.c | 44 ++-
>> mm/sparse.c | 5 +
>> mm/util.c | 16 +-
>> 18 files changed, 369 insertions(+), 365 deletions(-)
>>
Best Regards,
Yan, Zi
Powered by blists - more mailing lists