[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <cover.1747264138.git.ackerleytng@google.com>
Date: Wed, 14 May 2025 16:41:39 -0700
From: Ackerley Tng <ackerleytng@...gle.com>
To: kvm@...r.kernel.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
x86@...nel.org, linux-fsdevel@...r.kernel.org
Cc: ackerleytng@...gle.com, aik@....com, ajones@...tanamicro.com,
akpm@...ux-foundation.org, amoorthy@...gle.com, anthony.yznaga@...cle.com,
anup@...infault.org, aou@...s.berkeley.edu, bfoster@...hat.com,
binbin.wu@...ux.intel.com, brauner@...nel.org, catalin.marinas@....com,
chao.p.peng@...el.com, chenhuacai@...nel.org, dave.hansen@...el.com,
david@...hat.com, dmatlack@...gle.com, dwmw@...zon.co.uk,
erdemaktas@...gle.com, fan.du@...el.com, fvdl@...gle.com, graf@...zon.com,
haibo1.xu@...el.com, hch@...radead.org, hughd@...gle.com, ira.weiny@...el.com,
isaku.yamahata@...el.com, jack@...e.cz, james.morse@....com,
jarkko@...nel.org, jgg@...pe.ca, jgowans@...zon.com, jhubbard@...dia.com,
jroedel@...e.de, jthoughton@...gle.com, jun.miao@...el.com,
kai.huang@...el.com, keirf@...gle.com, kent.overstreet@...ux.dev,
kirill.shutemov@...el.com, liam.merwick@...cle.com,
maciej.wieczor-retman@...el.com, mail@...iej.szmigiero.name, maz@...nel.org,
mic@...ikod.net, michael.roth@....com, mpe@...erman.id.au,
muchun.song@...ux.dev, nikunj@....com, nsaenz@...zon.es,
oliver.upton@...ux.dev, palmer@...belt.com, pankaj.gupta@....com,
paul.walmsley@...ive.com, pbonzini@...hat.com, pdurrant@...zon.co.uk,
peterx@...hat.com, pgonda@...gle.com, pvorel@...e.cz, qperret@...gle.com,
quic_cvanscha@...cinc.com, quic_eberman@...cinc.com,
quic_mnalajal@...cinc.com, quic_pderrin@...cinc.com, quic_pheragu@...cinc.com,
quic_svaddagi@...cinc.com, quic_tsoni@...cinc.com, richard.weiyang@...il.com,
rick.p.edgecombe@...el.com, rientjes@...gle.com, roypat@...zon.co.uk,
rppt@...nel.org, seanjc@...gle.com, shuah@...nel.org, steven.price@....com,
steven.sistare@...cle.com, suzuki.poulose@....com, tabba@...gle.com,
thomas.lendacky@....com, usama.arif@...edance.com, vannapurve@...gle.com,
vbabka@...e.cz, viro@...iv.linux.org.uk, vkuznets@...hat.com,
wei.w.wang@...el.com, will@...nel.org, willy@...radead.org,
xiaoyao.li@...el.com, yan.y.zhao@...el.com, yilun.xu@...el.com,
yuzenghui@...wei.com, zhiquan1.li@...el.com
Subject: [RFC PATCH v2 00/51] 1G page support for guest_memfd
Hello,
This patchset builds upon discussion at LPC 2024 and many guest_memfd
upstream calls to provide 1G page support for guest_memfd by taking
pages from HugeTLB.
This patchset is based on Linux v6.15-rc6, and requires the mmap support
for guest_memfd patchset (Thanks Fuad!) [1].
For ease of testing, this series is also available, stitched together,
at https://github.com/googleprodkernel/linux-cc/tree/gmem-1g-page-support-rfc-v2
This patchset can be divided into two sections:
(a) Patches from the beginning up to and including "KVM: selftests:
Update script to map shared memory from guest_memfd" are a modified
version of "conversion support for guest_memfd", which Fuad is
managing [2].
(b) Patches after "KVM: selftests: Update script to map shared memory
from guest_memfd" till the end are patches that actually bring in 1G
page support for guest_memfd.
These are the significant differences between (a) and [2]:
+ [2] uses an xarray to track sharability, but I used a maple tree
because for 1G pages, iterating pagewise to update shareability was
prohibitively slow even for testing. I was choosing from among
multi-index xarrays, interval trees and maple trees [3], and picked
maple trees because
+ Maple trees were easier to figure out since I didn't have to
compute the correct multi-index order and handle edge cases if the
converted range wasn't a neat power of 2.
+ Maple trees were easier to figure out as compared to updating
parts of a multi-index xarray.
+ Maple trees had an easier API to use than interval trees.
+ [2] doesn't yet have a conversion ioctl, but I needed it to test 1G
support end-to-end.
+ (a) Removes guest_memfd from participating in LRU, which I needed, to
get conversion selftests to work as expected, since participation in
LRU was causing some unexpected refcounts on folios which was blocking
conversions.
I am sending (a) in emails as well, as opposed to just leaving it on
GitHub, so that we can discuss by commenting inline on emails. If you'd
like to just look at 1G page support, here are some key takeaways from
the first section (a):
+ If GUEST_MEMFD_FLAG_SUPPORT_SHARED is requested during guest_memfd
creation, guest_memfd will
+ Track shareability (whether an index in the inode is guest-only or
if the host is allowed to fault memory at a given index).
+ Always be used for guest faults - specifically, kvm_gmem_get_pfn()
will be used to provide pages for the guest.
+ Always be used by KVM to check private/shared status of a gfn.
+ guest_memfd now has conversion ioctls, allowing conversion to
private/shared
+ Conversion can fail if there are unexpected refcounts on any
folios in the range.
Focusing on (b) 1G page support, here's an overview:
1. A bunch of refactoring patches for HugeTLB that isolates the
allocation of a HugeTLB folio from other HugeTLB concepts such as
VMA-level reservations, and HugeTLBfs-specific concepts, such as
where memory policy is stored in the VMA, or where the subpool is
stored on the inode.
2. A few patches that add a guestmem_hugetlb allocator within mm/. The
guestmem_hugetlb allocator is a wrapper around HugeTLB to modularize
the memory management functions, and to cleanly handle cleanup, so
that folio cleanup can happen after the guest_memfd inode (and even
KVM) goes away.
3. Some updates to guest_memfd to use the guestmem_hugetlb allocator.
4. Selftests for 1G page support.
Here are some remaining issues/TODOs:
1. Memory error handling such as machine check errors have not been
implemented.
2. I've not looked into preparedness of pages, only zeroing has been
considered.
3. When allocating HugeTLB pages, if two threads allocate indices
mapping to the same huge page, the utilization in guest_memfd inode's
subpool may momentarily go over the subpool limit (the requested size
of the inode at guest_memfd creation time), causing one of the two
threads to get -ENOMEM. Suggestions to solve this are appreciated!
4. max_usage_in_bytes statistic (cgroups v1) for guest_memfd HugeTLB
pages should be correct but needs testing and could be wrong.
5. memcg charging (charge_memcg()) for cgroups v2 for guest_memfd
HugeTLB pages after splitting should be correct but needs testing and
could be wrong.
6. Page cache accounting: When a hugetlb page is split, guest_memfd will
incur page count in both NR_HUGETLB (counted at hugetlb allocation
time) and NR_FILE_PAGES stats (counted when split pages are added to
the filemap). Is this aligned with what people expect?
Here are some optimizations that could be explored in future series:
1. Pages could be split from 1G to 2M first and only split to 4K if
necessary.
2. Zeroing could be skipped for Coco VMs if hardware already zeroes the
pages.
Here's RFC v1 [4] if you're interested in the motivation behind choosing
HugeTLB, or the history of this patch series.
[1] https://lore.kernel.org/all/20250513163438.3942405-11-tabba@google.com/T/
[2] https://lore.kernel.org/all/20250328153133.3504118-1-tabba@google.com/T/
[3] https://lore.kernel.org/all/diqzzfih8q7r.fsf@ackerleytng-ctop.c.googlers.com/
[4] https://lore.kernel.org/all/cover.1726009989.git.ackerleytng@google.com/T/
---
Ackerley Tng (49):
KVM: guest_memfd: Make guest mem use guest mem inodes instead of
anonymous inodes
KVM: guest_memfd: Introduce and use shareability to guard faulting
KVM: selftests: Update guest_memfd_test for INIT_PRIVATE flag
KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls
KVM: guest_memfd: Skip LRU for guest_memfd folios
KVM: Query guest_memfd for private/shared status
KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION
KVM: selftests: Test flag validity after guest_memfd supports
conversions
KVM: selftests: Test faulting with respect to
GUEST_MEMFD_FLAG_INIT_PRIVATE
KVM: selftests: Refactor vm_mem_add to be more flexible
KVM: selftests: Allow cleanup of ucall_pool from host
KVM: selftests: Test conversion flows for guest_memfd
KVM: selftests: Add script to exercise private_mem_conversions_test
KVM: selftests: Update private_mem_conversions_test to mmap
guest_memfd
KVM: selftests: Update script to map shared memory from guest_memfd
mm: hugetlb: Consolidate interpretation of gbl_chg within
alloc_hugetlb_folio()
mm: hugetlb: Cleanup interpretation of gbl_chg in
alloc_hugetlb_folio()
mm: hugetlb: Cleanup interpretation of map_chg_state within
alloc_hugetlb_folio()
mm: hugetlb: Rename alloc_surplus_hugetlb_folio
mm: mempolicy: Refactor out policy_node_nodemask()
mm: hugetlb: Inline huge_node() into callers
mm: hugetlb: Refactor hugetlb allocation functions
mm: hugetlb: Refactor out hugetlb_alloc_folio()
mm: hugetlb: Add option to create new subpool without using surplus
mm: truncate: Expose preparation steps for truncate_inode_pages_final
mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
mm: Introduce guestmem_hugetlb to support folio_put() handling of
guestmem pages
mm: guestmem_hugetlb: Wrap HugeTLB as an allocator for guest_memfd
mm: truncate: Expose truncate_inode_folio()
KVM: x86: Set disallow_lpage on base_gfn and guest_memfd pgoff
misalignment
KVM: guest_memfd: Support guestmem_hugetlb as custom allocator
KVM: guest_memfd: Allocate and truncate from custom allocator
mm: hugetlb: Add functions to add/delete folio from hugetlb lists
mm: guestmem_hugetlb: Add support for splitting and merging pages
mm: Convert split_folio() macro to function
KVM: guest_memfd: Split allocator pages for guest_memfd use
KVM: guest_memfd: Merge and truncate on fallocate(PUNCH_HOLE)
KVM: guest_memfd: Update kvm_gmem_mapping_order to account for page
status
KVM: Add CAP to indicate support for HugeTLB as custom allocator
KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
KVM: selftests: Update conversion flows test for HugeTLB
KVM: selftests: Test truncation paths of guest_memfd
KVM: selftests: Test allocation and conversion of subfolios
KVM: selftests: Test that guest_memfd usage is reported via hugetlb
KVM: selftests: Support various types of backing sources for private
memory
KVM: selftests: Update test for various private memory backing source
types
KVM: selftests: Update private_mem_conversions_test.sh to test with
HugeTLB pages
KVM: selftests: Add script to test HugeTLB statistics
KVM: selftests: Test guest_memfd for accuracy of st_blocks
Elliot Berman (1):
filemap: Pass address_space mapping to ->free_folio()
Fuad Tabba (1):
mm: Consolidate freeing of typed folios on final folio_put()
Documentation/filesystems/locking.rst | 2 +-
Documentation/filesystems/vfs.rst | 15 +-
Documentation/virt/kvm/api.rst | 5 +
arch/arm64/include/asm/kvm_host.h | 5 -
arch/x86/include/asm/kvm_host.h | 10 -
arch/x86/kvm/x86.c | 53 +-
fs/hugetlbfs/inode.c | 2 +-
fs/nfs/dir.c | 9 +-
fs/orangefs/inode.c | 3 +-
include/linux/fs.h | 2 +-
include/linux/guestmem.h | 23 +
include/linux/huge_mm.h | 6 +-
include/linux/hugetlb.h | 19 +-
include/linux/kvm_host.h | 32 +-
include/linux/mempolicy.h | 11 +-
include/linux/mm.h | 2 +
include/linux/page-flags.h | 32 +
include/uapi/linux/guestmem.h | 29 +
include/uapi/linux/kvm.h | 16 +
include/uapi/linux/magic.h | 1 +
mm/Kconfig | 13 +
mm/Makefile | 1 +
mm/debug.c | 1 +
mm/filemap.c | 12 +-
mm/guestmem_hugetlb.c | 512 +++++
mm/guestmem_hugetlb.h | 9 +
mm/hugetlb.c | 488 ++---
mm/internal.h | 1 -
mm/memcontrol.c | 2 +
mm/memory.c | 1 +
mm/mempolicy.c | 44 +-
mm/secretmem.c | 3 +-
mm/swap.c | 32 +-
mm/truncate.c | 27 +-
mm/vmscan.c | 4 +-
tools/testing/selftests/kvm/Makefile.kvm | 2 +
.../kvm/guest_memfd_conversions_test.c | 797 ++++++++
.../kvm/guest_memfd_hugetlb_reporting_test.c | 384 ++++
...uest_memfd_provide_hugetlb_cgroup_mount.sh | 36 +
.../testing/selftests/kvm/guest_memfd_test.c | 293 ++-
...memfd_wrap_test_check_hugetlb_reporting.sh | 95 +
.../testing/selftests/kvm/include/kvm_util.h | 104 +-
.../testing/selftests/kvm/include/test_util.h | 20 +-
.../selftests/kvm/include/ucall_common.h | 1 +
tools/testing/selftests/kvm/lib/kvm_util.c | 465 +++--
tools/testing/selftests/kvm/lib/test_util.c | 102 +
.../testing/selftests/kvm/lib/ucall_common.c | 16 +-
.../kvm/x86/private_mem_conversions_test.c | 195 +-
.../kvm/x86/private_mem_conversions_test.sh | 100 +
virt/kvm/Kconfig | 5 +
virt/kvm/guest_memfd.c | 1655 ++++++++++++++++-
virt/kvm/kvm_main.c | 14 +-
virt/kvm/kvm_mm.h | 9 +-
53 files changed, 5080 insertions(+), 640 deletions(-)
create mode 100644 include/linux/guestmem.h
create mode 100644 include/uapi/linux/guestmem.h
create mode 100644 mm/guestmem_hugetlb.c
create mode 100644 mm/guestmem_hugetlb.h
create mode 100644 tools/testing/selftests/kvm/guest_memfd_conversions_test.c
create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
create mode 100755 tools/testing/selftests/kvm/guest_memfd_provide_hugetlb_cgroup_mount.sh
create mode 100755 tools/testing/selftests/kvm/guest_memfd_wrap_test_check_hugetlb_reporting.sh
create mode 100755 tools/testing/selftests/kvm/x86/private_mem_conversions_test.sh
--
2.49.0.1045.g170613ef41-goog
Powered by blists - more mailing lists