linux-kernel - [RFC PATCH v2 00/51] 1G page support for guest

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <cover.1747264138.git.ackerleytng@google.com>
Date: Wed, 14 May 2025 16:41:39 -0700
From: Ackerley Tng <ackerleytng@...gle.com>
To: kvm@...r.kernel.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	x86@...nel.org, linux-fsdevel@...r.kernel.org
Cc: ackerleytng@...gle.com, aik@....com, ajones@...tanamicro.com, 
	akpm@...ux-foundation.org, amoorthy@...gle.com, anthony.yznaga@...cle.com, 
	anup@...infault.org, aou@...s.berkeley.edu, bfoster@...hat.com, 
	binbin.wu@...ux.intel.com, brauner@...nel.org, catalin.marinas@....com, 
	chao.p.peng@...el.com, chenhuacai@...nel.org, dave.hansen@...el.com, 
	david@...hat.com, dmatlack@...gle.com, dwmw@...zon.co.uk, 
	erdemaktas@...gle.com, fan.du@...el.com, fvdl@...gle.com, graf@...zon.com, 
	haibo1.xu@...el.com, hch@...radead.org, hughd@...gle.com, ira.weiny@...el.com, 
	isaku.yamahata@...el.com, jack@...e.cz, james.morse@....com, 
	jarkko@...nel.org, jgg@...pe.ca, jgowans@...zon.com, jhubbard@...dia.com, 
	jroedel@...e.de, jthoughton@...gle.com, jun.miao@...el.com, 
	kai.huang@...el.com, keirf@...gle.com, kent.overstreet@...ux.dev, 
	kirill.shutemov@...el.com, liam.merwick@...cle.com, 
	maciej.wieczor-retman@...el.com, mail@...iej.szmigiero.name, maz@...nel.org, 
	mic@...ikod.net, michael.roth@....com, mpe@...erman.id.au, 
	muchun.song@...ux.dev, nikunj@....com, nsaenz@...zon.es, 
	oliver.upton@...ux.dev, palmer@...belt.com, pankaj.gupta@....com, 
	paul.walmsley@...ive.com, pbonzini@...hat.com, pdurrant@...zon.co.uk, 
	peterx@...hat.com, pgonda@...gle.com, pvorel@...e.cz, qperret@...gle.com, 
	quic_cvanscha@...cinc.com, quic_eberman@...cinc.com, 
	quic_mnalajal@...cinc.com, quic_pderrin@...cinc.com, quic_pheragu@...cinc.com, 
	quic_svaddagi@...cinc.com, quic_tsoni@...cinc.com, richard.weiyang@...il.com, 
	rick.p.edgecombe@...el.com, rientjes@...gle.com, roypat@...zon.co.uk, 
	rppt@...nel.org, seanjc@...gle.com, shuah@...nel.org, steven.price@....com, 
	steven.sistare@...cle.com, suzuki.poulose@....com, tabba@...gle.com, 
	thomas.lendacky@....com, usama.arif@...edance.com, vannapurve@...gle.com, 
	vbabka@...e.cz, viro@...iv.linux.org.uk, vkuznets@...hat.com, 
	wei.w.wang@...el.com, will@...nel.org, willy@...radead.org, 
	xiaoyao.li@...el.com, yan.y.zhao@...el.com, yilun.xu@...el.com, 
	yuzenghui@...wei.com, zhiquan1.li@...el.com
Subject: [RFC PATCH v2 00/51] 1G page support for guest_memfd

Hello,

This patchset builds upon discussion at LPC 2024 and many guest_memfd
upstream calls to provide 1G page support for guest_memfd by taking
pages from HugeTLB.

This patchset is based on Linux v6.15-rc6, and requires the mmap support
for guest_memfd patchset (Thanks Fuad!) [1].

For ease of testing, this series is also available, stitched together,
at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2

This patchset can be divided into two sections:

(a) Patches from the beginning up to and including "KVM: selftests:
    Update script to map shared memory from guest_memfd" are a modified
    version of "conversion support for guest_memfd", which Fuad is
    managing [2].

(b) Patches after "KVM: selftests: Update script to map shared memory
    from guest_memfd" till the end are patches that actually bring in 1G
    page support for guest_memfd.

These are the significant differences between (a) and [2]:

+ [2] uses an xarray to track sharability, but I used a maple tree
  because for 1G pages, iterating pagewise to update shareability was
  prohibitively slow even for testing. I was choosing from among
  multi-index xarrays, interval trees and maple trees [3], and picked
  maple trees because
    + Maple trees were easier to figure out since I didn't have to
      compute the correct multi-index order and handle edge cases if the
      converted range wasn't a neat power of 2.
    + Maple trees were easier to figure out as compared to updating
      parts of a multi-index xarray.
    + Maple trees had an easier API to use than interval trees.
+ [2] doesn't yet have a conversion ioctl, but I needed it to test 1G
  support end-to-end.
+ (a) Removes guest_memfd from participating in LRU, which I needed, to
  get conversion selftests to work as expected, since participation in
  LRU was causing some unexpected refcounts on folios which was blocking
  conversions.

I am sending (a) in emails as well, as opposed to just leaving it on
GitHub, so that we can discuss by commenting inline on emails. If you'd
like to just look at 1G page support, here are some key takeaways from
the first section (a):

+ If GUEST_MEMFD_FLAG_SUPPORT_SHARED is requested during guest_memfd
  creation, guest_memfd will
    + Track shareability (whether an index in the inode is guest-only or
      if the host is allowed to fault memory at a given index).
    + Always be used for guest faults - specifically, kvm_gmem_get_pfn()
      will be used to provide pages for the guest.
    + Always be used by KVM to check private/shared status of a gfn.
+ guest_memfd now has conversion ioctls, allowing conversion to
  private/shared
    + Conversion can fail if there are unexpected refcounts on any
      folios in the range.

Focusing on (b) 1G page support, here's an overview:

1. A bunch of refactoring patches for HugeTLB that isolates the
   allocation of a HugeTLB folio from other HugeTLB concepts such as
   VMA-level reservations, and HugeTLBfs-specific concepts, such as
   where memory policy is stored in the VMA, or where the subpool is
   stored on the inode.
2. A few patches that add a guestmem_hugetlb allocator within mm/. The
   guestmem_hugetlb allocator is a wrapper around HugeTLB to modularize
   the memory management functions, and to cleanly handle cleanup, so
   that folio cleanup can happen after the guest_memfd inode (and even
   KVM) goes away.
3. Some updates to guest_memfd to use the guestmem_hugetlb allocator.
4. Selftests for 1G page support.

Here are some remaining issues/TODOs:

1. Memory error handling such as machine check errors have not been
   implemented.
2. I've not looked into preparedness of pages, only zeroing has been
   considered.
3. When allocating HugeTLB pages, if two threads allocate indices
   mapping to the same huge page, the utilization in guest_memfd inode's
   subpool may momentarily go over the subpool limit (the requested size
   of the inode at guest_memfd creation time), causing one of the two
   threads to get -ENOMEM. Suggestions to solve this are appreciated!
4. max_usage_in_bytes statistic (cgroups v1) for guest_memfd HugeTLB
   pages should be correct but needs testing and could be wrong.
5. memcg charging (charge_memcg()) for cgroups v2 for guest_memfd
   HugeTLB pages after splitting should be correct but needs testing and
   could be wrong.
6. Page cache accounting: When a hugetlb page is split, guest_memfd will
   incur page count in both NR_HUGETLB (counted at hugetlb allocation
   time) and NR_FILE_PAGES stats (counted when split pages are added to
   the filemap). Is this aligned with what people expect?

Here are some optimizations that could be explored in future series:

1. Pages could be split from 1G to 2M first and only split to 4K if
   necessary.
2. Zeroing could be skipped for Coco VMs if hardware already zeroes the
   pages.

Here's RFC v1 [4] if you're interested in the motivation behind choosing
HugeTLB, or the history of this patch series.

[1] https://lore.kernel.org/all/20250513163438.3942405-11-tabba@google.com/T/
[2] https://lore.kernel.org/all/20250328153133.3504118-1-tabba@google.com/T/
[3] https://lore.kernel.org/all/diqzzfih8q7r.fsf@ackerleytng-ctop.c.googlers.com/
[4] https://lore.kernel.org/all/cover.1726009989.git.ackerleytng@google.com/T/

---

Ackerley Tng (49):
  KVM: guest_memfd: Make guest mem use guest mem inodes instead of
    anonymous inodes
  KVM: guest_memfd: Introduce and use shareability to guard faulting
  KVM: selftests: Update guest_memfd_test for INIT_PRIVATE flag
  KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
  KVM: guest_memfd: Skip LRU for guest_memfd folios
  KVM: Query guest_memfd for private/shared status
  KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION
  KVM: selftests: Test flag validity after guest_memfd supports
    conversions
  KVM: selftests: Test faulting with respect to
    GUEST_MEMFD_FLAG_INIT_PRIVATE
  KVM: selftests: Refactor vm_mem_add to be more flexible
  KVM: selftests: Allow cleanup of ucall_pool from host
  KVM: selftests: Test conversion flows for guest_memfd
  KVM: selftests: Add script to exercise private_mem_conversions_test
  KVM: selftests: Update private_mem_conversions_test to mmap
    guest_memfd
  KVM: selftests: Update script to map shared memory from guest_memfd
  mm: hugetlb: Consolidate interpretation of gbl_chg within
    alloc_hugetlb_folio()
  mm: hugetlb: Cleanup interpretation of gbl_chg in
    alloc_hugetlb_folio()
  mm: hugetlb: Cleanup interpretation of map_chg_state within
    alloc_hugetlb_folio()
  mm: hugetlb: Rename alloc_surplus_hugetlb_folio
  mm: mempolicy: Refactor out policy_node_nodemask()
  mm: hugetlb: Inline huge_node() into callers
  mm: hugetlb: Refactor hugetlb allocation functions
  mm: hugetlb: Refactor out hugetlb_alloc_folio()
  mm: hugetlb: Add option to create new subpool without using surplus
  mm: truncate: Expose preparation steps for truncate_inode_pages_final
  mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
  mm: Introduce guestmem_hugetlb to support folio_put() handling of
    guestmem pages
  mm: guestmem_hugetlb: Wrap HugeTLB as an allocator for guest_memfd
  mm: truncate: Expose truncate_inode_folio()
  KVM: x86: Set disallow_lpage on base_gfn and guest_memfd pgoff
    misalignment
  KVM: guest_memfd: Support guestmem_hugetlb as custom allocator
  KVM: guest_memfd: Allocate and truncate from custom allocator
  mm: hugetlb: Add functions to add/delete folio from hugetlb lists
  mm: guestmem_hugetlb: Add support for splitting and merging pages
  mm: Convert split_folio() macro to function
  KVM: guest_memfd: Split allocator pages for guest_memfd use
  KVM: guest_memfd: Merge and truncate on fallocate(PUNCH_HOLE)
  KVM: guest_memfd: Update kvm_gmem_mapping_order to account for page
    status
  KVM: Add CAP to indicate support for HugeTLB as custom allocator
  KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
  KVM: selftests: Update conversion flows test for HugeTLB
  KVM: selftests: Test truncation paths of guest_memfd
  KVM: selftests: Test allocation and conversion of subfolios
  KVM: selftests: Test that guest_memfd usage is reported via hugetlb
  KVM: selftests: Support various types of backing sources for private
    memory
  KVM: selftests: Update test for various private memory backing source
    types
  KVM: selftests: Update private_mem_conversions_test.sh to test with
    HugeTLB pages
  KVM: selftests: Add script to test HugeTLB statistics
  KVM: selftests: Test guest_memfd for accuracy of st_blocks

Elliot Berman (1):
  filemap: Pass address_space mapping to ->free_folio()

Fuad Tabba (1):
  mm: Consolidate freeing of typed folios on final folio_put()

 Documentation/filesystems/locking.rst         |    2 +-
 Documentation/filesystems/vfs.rst             |   15 +-
 Documentation/virt/kvm/api.rst                |    5 +
 arch/arm64/include/asm/kvm_host.h             |    5 -
 arch/x86/include/asm/kvm_host.h               |   10 -
 arch/x86/kvm/x86.c                            |   53 +-
 fs/hugetlbfs/inode.c                          |    2 +-
 fs/nfs/dir.c                                  |    9 +-
 fs/orangefs/inode.c                           |    3 +-
 include/linux/fs.h                            |    2 +-
 include/linux/guestmem.h                      |   23 +
 include/linux/huge_mm.h                       |    6 +-
 include/linux/hugetlb.h                       |   19 +-
 include/linux/kvm_host.h                      |   32 +-
 include/linux/mempolicy.h                     |   11 +-
 include/linux/mm.h                            |    2 +
 include/linux/page-flags.h                    |   32 +
 include/uapi/linux/guestmem.h                 |   29 +
 include/uapi/linux/kvm.h                      |   16 +
 include/uapi/linux/magic.h                    |    1 +
 mm/Kconfig                                    |   13 +
 mm/Makefile                                   |    1 +
 mm/debug.c                                    |    1 +
 mm/filemap.c                                  |   12 +-
 mm/guestmem_hugetlb.c                         |  512 +++++
 mm/guestmem_hugetlb.h                         |    9 +
 mm/hugetlb.c                                  |  488 ++---
 mm/internal.h                                 |    1 -
 mm/memcontrol.c                               |    2 +
 mm/memory.c                                   |    1 +
 mm/mempolicy.c                                |   44 +-
 mm/secretmem.c                                |    3 +-
 mm/swap.c                                     |   32 +-
 mm/truncate.c                                 |   27 +-
 mm/vmscan.c                                   |    4 +-
 tools/testing/selftests/kvm/Makefile.kvm      |    2 +
 .../kvm/guest_memfd_conversions_test.c        |  797 ++++++++
 .../kvm/guest_memfd_hugetlb_reporting_test.c  |  384 ++++
 ...uest_memfd_provide_hugetlb_cgroup_mount.sh |   36 +
 .../testing/selftests/kvm/guest_memfd_test.c  |  293 ++-
 ...memfd_wrap_test_check_hugetlb_reporting.sh |   95 +
 .../testing/selftests/kvm/include/kvm_util.h  |  104 +-
 .../testing/selftests/kvm/include/test_util.h |   20 +-
 .../selftests/kvm/include/ucall_common.h      |    1 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  465 +++--
 tools/testing/selftests/kvm/lib/test_util.c   |  102 +
 .../testing/selftests/kvm/lib/ucall_common.c  |   16 +-
 .../kvm/x86/private_mem_conversions_test.c    |  195 +-
 .../kvm/x86/private_mem_conversions_test.sh   |  100 +
 virt/kvm/Kconfig                              |    5 +
 virt/kvm/guest_memfd.c                        | 1655 ++++++++++++++++-
 virt/kvm/kvm_main.c                           |   14 +-
 virt/kvm/kvm_mm.h                             |    9 +-
 53 files changed, 5080 insertions(+), 640 deletions(-)
 create mode 100644 include/linux/guestmem.h
 create mode 100644 include/uapi/linux/guestmem.h
 create mode 100644 mm/guestmem_hugetlb.c
 create mode 100644 mm/guestmem_hugetlb.h
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_conversions_test.c
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
 create mode 100755 tools/testing/selftests/kvm/guest_memfd_provide_hugetlb_cgroup_mount.sh
 create mode 100755 tools/testing/selftests/kvm/guest_memfd_wrap_test_check_hugetlb_reporting.sh
 create mode 100755 tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh

--
2.49.0.1045.g170613ef41-goog