linux-kernel - Re: [PATCHv4 00/14] mm: Eliminate fake head pages from vmemmap optimization

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <E99A40AF-1535-4FC0-BEE5-6F0F5B3FF840@nvidia.com>
Date: Wed, 21 Jan 2026 15:31:59 -0500
From: Zi Yan <ziy@...dia.com>
To: Vlastimil Babka <vbabka@...e.cz>
Cc: Kiryl Shutsemau <kas@...nel.org>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Muchun Song <muchun.song@...ux.dev>, David Hildenbrand <david@...nel.org>,
 Matthew Wilcox <willy@...radead.org>, Usama Arif <usamaarif642@...il.com>,
 Frank van der Linden <fvdl@...gle.com>, Oscar Salvador <osalvador@...e.de>,
 Mike Rapoport <rppt@...nel.org>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Baoquan He <bhe@...hat.com>,
 Michal Hocko <mhocko@...e.com>, Johannes Weiner <hannes@...xchg.org>,
 Jonathan Corbet <corbet@....net>, kernel-team@...a.com, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org
Subject: Re: [PATCHv4 00/14] mm: Eliminate fake head pages from vmemmap
 optimization

On 21 Jan 2026, at 13:44, Vlastimil Babka wrote:

> On 1/21/26 17:22, Kiryl Shutsemau wrote:
>> This series removes "fake head pages" from the HugeTLB vmemmap
>> optimization (HVO) by changing how tail pages encode their relationship
>> to the head page.
>>
>> It simplifies compound_head() and page_ref_add_unless(). Both are in the
>> hot path.
>
> We never got the definitive answer in the previous version discussions
> whether it's worth to do this now with the upcoming memdesc stuff, right?
>
>> Background
>> ==========
>>
>> HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages
>> and remapping the freed virtual addresses to a single physical page.
>> Previously, all tail page vmemmap entries were remapped to the first
>> vmemmap page (containing the head struct page), creating "fake heads" -
>> tail pages that appear to have PG_head set when accessed through the
>> deduplicated vmemmap.
>>
>> This required special handling in compound_head() to detect and work
>> around fake heads, adding complexity and overhead to a very hot path.
>
> So a very stupid question, why did we remap everything to the first page,
> and not instead create two pages, where the first one would contain the head
> and the first batch of tails, and the second one would be used for the rest
> of the tails? I'd expect it wouldn't make the memory savings that much
> worse, and eliminate most of the issues?

I think it was using 2 pages before[1]. The benefit of using one page is:
“
It further reduces the overhead of struct
page by 12.5% for a 2MB HugeTLB compared to the previous approach,
which means 2GB per 1TB HugeTLB (2MB type).
“

[1] https://lore.kernel.org/all/20211101031651.75851-1-songmuchun@bytedance.com/T/#u

>
>> New Approach
>> ============
>>
>> For architectures/configs where sizeof(struct page) is a power of 2 (the
>> common case), this series changes how position of the head page is encoded
>> in the tail pages.
>>
>> Instead of storing a pointer to the head page, the ->compound_info
>> (renamed from ->compound_head) now stores a mask.
>>
>> The mask can be applied to any tail page's virtual address to compute
>> the head page address. Critically, all tail pages of the same order now
>> have identical compound_info values, regardless of which compound page
>> they belong to.
>>
>> The key insight is that all tail pages of the same order now have
>> identical compound_info values, regardless of which compound page they
>> belong to. This allows a single page of tail struct pages to be shared
>> across all huge pages of the same order on a NUMA node.
>>
>> Benefits
>> ========
>>
>> 1. Simplified compound_head(): No fake head detection needed, can be
>>    implemented in a branchless manner.
>>
>> 2. Simplified page_ref_add_unless(): RCU protection removed since there's
>>    no race with fake head remapping.
>>
>> 3. Cleaner architecture: The shared tail pages are truly read-only and
>>    contain valid tail page metadata.
>>
>> If sizeof(struct page) is not power-of-2, there are no functional changes.
>> HVO is not supported in this configuration.
>>
>> I had hoped to see performance improvement, but my testing thus far has
>> shown either no change or only a slight improvement within the noise.
>>
>> Series Organization
>> ===================
>>
>> Patch 1: Preparation - move MAX_FOLIO_ORDER to mmzone.h
>> Patches 2-4: Refactoring - interface changes, field rename, code movement
>> Patch 5: Core change - new mask-based compound_head() encoding
>> Patch 6: Correctness fix - page_zonenum() must use head page
>> Patch 7: Add memmap alignment check for compound_info_has_mask()
>> Patch 8: Refactor vmemmap_walk for new design
>> Patch 9: Eliminate fake heads with shared tail pages
>> Patches 10-13: Cleanup - remove fake head infrastructure
>> Patch 14: Documentation update
>>
>> Changes in v4:
>> ==============
>>   - Fix build issues due to linux/mmzone.h <-> linux/pgtable.h
>>     dependency loop by avoiding including linux/pgtable.h into
>>     linux/mmzone.h
>>
>>   - Rework vmemmap_remap_alloc() interface. (Muchun)
>>
>>   - Use &folio->page instead of folio address for optimization
>>     target. (Muchun)
>>
>> Changes in v3:
>> ==============
>>   - Fixed error recovery path in vmemmap_remap_free() to pass correct start
>>     address for TLB flush. (Muchun)
>>
>>   - Wrapped the mask-based compound_info encoding within CONFIG_SPARSEMEM_VMEMMAP
>>     check via compound_info_has_mask(). For other memory models, alignment
>>     guarantees are harder to verify. (Muchun)
>>
>>   - Updated vmemmap_dedup.rst documentation wording: changed "vmemmap_tail
>>     shared for the struct hstate" to "A single, per-node page frame shared
>>     among all hugepages of the same size". (Muchun)
>>
>>   - Fixed build error with MAX_FOLIO_ORDER expanding to undefined PUD_ORDER
>>     in certain configurations. (kernel test robot)
>>
>> Changes in v2:
>> ==============
>>
>> - Handle boot-allocated huge pages correctly. (Frank)
>>
>> - Changed from per-hstate vmemmap_tail to per-node vmemmap_tails[] array
>>   in pglist_data. (Muchun)
>>
>> - Added spin_lock(&hugetlb_lock) protection in vmemmap_get_tail() to fix
>>   a race condition where two threads could both allocate tail pages.
>>   The losing thread now properly frees its allocated page. (Usama)
>>
>> - Add warning if memmap is not aligned to MAX_FOLIO_SIZE, which is
>>   required for the mask approach. (Muchun)
>>
>> - Make page_zonenum() use head page - correctness fix since shared
>>   tail pages cannot have valid zone information. (Muchun)
>>
>> - Added 'const' qualifier to head parameter in set_compound_head() and
>>   prep_compound_tail(). (Usama)
>>
>> - Updated commit messages.
>>
>> Kiryl Shutsemau (14):
>>   mm: Move MAX_FOLIO_ORDER definition to mmzone.h
>>   mm: Change the interface of prep_compound_tail()
>>   mm: Rename the 'compound_head' field in the 'struct page' to
>>     'compound_info'
>>   mm: Move set/clear_compound_head() next to compound_head()
>>   mm: Rework compound_head() for power-of-2 sizeof(struct page)
>>   mm: Make page_zonenum() use head page
>>   mm/sparse: Check memmap alignment for compound_info_has_mask()
>>   mm/hugetlb: Refactor code around vmemmap_walk
>>   mm/hugetlb: Remove fake head pages
>>   mm: Drop fake head checks
>>   hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU
>>   mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key
>>   mm: Remove the branch from compound_head()
>>   hugetlb: Update vmemmap_dedup.rst
>>
>>  .../admin-guide/kdump/vmcoreinfo.rst          |   2 +-
>>  Documentation/mm/vmemmap_dedup.rst            |  62 ++--
>>  include/linux/mm.h                            |  31 --
>>  include/linux/mm_types.h                      |  20 +-
>>  include/linux/mmzone.h                        |  47 +++
>>  include/linux/page-flags.h                    | 167 +++++-----
>>  include/linux/page_ref.h                      |   8 +-
>>  include/linux/types.h                         |   2 +-
>>  kernel/vmcore_info.c                          |   2 +-
>>  mm/hugetlb.c                                  |   8 +-
>>  mm/hugetlb_vmemmap.c                          | 300 ++++++++----------
>>  mm/internal.h                                 |  12 +-
>>  mm/mm_init.c                                  |   2 +-
>>  mm/page_alloc.c                               |   4 +-
>>  mm/slab.h                                     |   2 +-
>>  mm/sparse-vmemmap.c                           |  44 ++-
>>  mm/sparse.c                                   |   5 +
>>  mm/util.c                                     |  16 +-
>>  18 files changed, 369 insertions(+), 365 deletions(-)
>>


Best Regards,
Yan, Zi