lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <MN0PR11MB61813367958D393369C0AD8399662@MN0PR11MB6181.namprd11.prod.outlook.com>
Date: Sat, 14 Sep 2024 01:08:26 +0000
From: "Du, Fan" <fan.du@...el.com>
To: Ackerley Tng <ackerleytng@...gle.com>, "tabba@...gle.com"
	<tabba@...gle.com>, "quic_eberman@...cinc.com" <quic_eberman@...cinc.com>,
	"roypat@...zon.co.uk" <roypat@...zon.co.uk>, "jgg@...dia.com"
	<jgg@...dia.com>, "peterx@...hat.com" <peterx@...hat.com>, "david@...hat.com"
	<david@...hat.com>, "rientjes@...gle.com" <rientjes@...gle.com>,
	"fvdl@...gle.com" <fvdl@...gle.com>, "jthoughton@...gle.com"
	<jthoughton@...gle.com>, "seanjc@...gle.com" <seanjc@...gle.com>,
	"pbonzini@...hat.com" <pbonzini@...hat.com>, "Li, Zhiquan1"
	<zhiquan1.li@...el.com>, "Miao, Jun" <jun.miao@...el.com>, "Yamahata, Isaku"
	<isaku.yamahata@...el.com>, "muchun.song@...ux.dev" <muchun.song@...ux.dev>,
	"mike.kravetz@...cle.com" <mike.kravetz@...cle.com>
CC: "Aktas, Erdem" <erdemaktas@...gle.com>, "Annapurve, Vishal"
	<vannapurve@...gle.com>, "qperret@...gle.com" <qperret@...gle.com>,
	"jhubbard@...dia.com" <jhubbard@...dia.com>, "willy@...radead.org"
	<willy@...radead.org>, "shuah@...nel.org" <shuah@...nel.org>,
	"brauner@...nel.org" <brauner@...nel.org>, "bfoster@...hat.com"
	<bfoster@...hat.com>, "kent.overstreet@...ux.dev"
	<kent.overstreet@...ux.dev>, "pvorel@...e.cz" <pvorel@...e.cz>,
	"rppt@...nel.org" <rppt@...nel.org>, "richard.weiyang@...il.com"
	<richard.weiyang@...il.com>, "anup@...infault.org" <anup@...infault.org>,
	"Xu, Haibo1" <haibo1.xu@...el.com>, "ajones@...tanamicro.com"
	<ajones@...tanamicro.com>, "vkuznets@...hat.com" <vkuznets@...hat.com>,
	"Wieczor-Retman, Maciej" <maciej.wieczor-retman@...el.com>,
	"pgonda@...gle.com" <pgonda@...gle.com>, "oliver.upton@...ux.dev"
	<oliver.upton@...ux.dev>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"linux-kselftest@...r.kernel.org" <linux-kselftest@...r.kernel.org>,
	"linux-fsdevel@...ck.org" <linux-fsdevel@...ck.org>, "Du, Fan"
	<fan.du@...el.com>
Subject: RE: [RFC PATCH 00/39] 1G page support for guest_memfd



> -----Original Message-----
> From: Ackerley Tng <ackerleytng@...gle.com>
> Sent: Wednesday, September 11, 2024 7:44 AM
> To: tabba@...gle.com; quic_eberman@...cinc.com; roypat@...zon.co.uk;
> jgg@...dia.com; peterx@...hat.com; david@...hat.com;
> rientjes@...gle.com; fvdl@...gle.com; jthoughton@...gle.com;
> seanjc@...gle.com; pbonzini@...hat.com; Li, Zhiquan1
> <zhiquan1.li@...el.com>; Du, Fan <fan.du@...el.com>; Miao, Jun
> <jun.miao@...el.com>; Yamahata, Isaku <isaku.yamahata@...el.com>;
> muchun.song@...ux.dev; mike.kravetz@...cle.com
> Cc: Aktas, Erdem <erdemaktas@...gle.com>; Annapurve, Vishal
> <vannapurve@...gle.com>; ackerleytng@...gle.com; qperret@...gle.com;
> jhubbard@...dia.com; willy@...radead.org; shuah@...nel.org;
> brauner@...nel.org; bfoster@...hat.com; kent.overstreet@...ux.dev;
> pvorel@...e.cz; rppt@...nel.org; richard.weiyang@...il.com;
> anup@...infault.org; Xu, Haibo1 <haibo1.xu@...el.com>;
> ajones@...tanamicro.com; vkuznets@...hat.com; Wieczor-Retman, Maciej
> <maciej.wieczor-retman@...el.com>; pgonda@...gle.com;
> oliver.upton@...ux.dev; linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> kvm@...r.kernel.org; linux-kselftest@...r.kernel.org; linux-
> fsdevel@...ck.org
> Subject: [RFC PATCH 00/39] 1G page support for guest_memfd
> 
> Hello,
> 
> This patchset is our exploration of how to support 1G pages in guest_memfd,
> and
> how the pages will be used in Confidential VMs.
> 
> The patchset covers:
> 
> + How to get 1G pages
> + Allowing mmap() of guest_memfd to userspace so that both private and
> shared

Hi Ackerley

Thanks for posting new version :)

W.r.t above description and below patch snippet from Patch 26-29,
Does this new design aim to backup shared and private GPA with a single
Hugetlb spool which equal VM instance total memory?

By my understanding, before this new changes, shared memfd and gmem fd
has dedicate hugetlb pool, that's two copy/reservation of hugetlb spool.

Does Qemu require new changes as well? I'd like to have a test of this series
if you can share Qemu branch?

> + Patches 26-28
>     + Allow mmap() in guest_memfd and faulting in shared pages
> + Patch 29
>     + Enables conversion between private/shared pages

Thanks!

>   memory can use the same physical pages
> + Splitting and reconstructing pages to support conversions and mmap()
> + How the VM, userspace and guest_memfd interact to support conversions
> + Selftests to test all the above
>     + Selftests also demonstrate the conversion flow between VM, userspace
> and
>       guest_memfd.
> 
> Why 1G pages in guest memfd?
> 
> Bring guest_memfd to performance and memory savings parity with VMs that
> are
> backed by HugeTLBfs.
> 
> + Performance is improved with 1G pages by more TLB hits and faster page
> walks
>   on TLB misses.
> + Memory savings from 1G pages comes from HugeTLB Vmemmap
> Optimization (HVO).
> 
> Options for 1G page support:
> 
> 1. HugeTLB
> 2. Contiguous Memory Allocator (CMA)
> 3. Other suggestions are welcome!
> 
> Comparison between options:
> 
> 1. HugeTLB
>     + Refactor HugeTLB to separate allocator from the rest of HugeTLB
>     + Pro: Graceful transition for VMs backed with HugeTLB to guest_memfd
>         + Near term: Allows co-tenancy of HugeTLB and guest_memfd backed
> VMs
>     + Pro: Can provide iterative steps toward new future allocator
>         + Unexplored: Managing userspace-visible changes
>             + e.g. HugeTLB's free_hugepages will decrease if HugeTLB is used,
>               but not when future allocator is used
> 2. CMA
>     + Port some HugeTLB features to be applied on CMA
>     + Pro: Clean slate
> 
> What would refactoring HugeTLB involve?
> 
> (Some refactoring was done in this RFC, more can be done.)
> 
> 1. Broadly involves separating the HugeTLB allocator from the rest of HugeTLB
>     + Brings more modularity to HugeTLB
>     + No functionality change intended
>     + Likely step towards HugeTLB's integration into core-mm
> 2. guest_memfd will use just the allocator component of HugeTLB, not
> including
>    the complex parts of HugeTLB like
>     + Userspace reservations (resv_map)
>     + Shared PMD mappings
>     + Special page walkers
> 
> What features would need to be ported to CMA?
> 
> + Improved allocation guarantees
>     + Per NUMA node pool of huge pages
>     + Subpools per guest_memfd
> + Memory savings
>     + Something like HugeTLB Vmemmap Optimization
> + Configuration/reporting features
>     + Configuration of number of pages available (and per NUMA node) at and
>       after host boot
>     + Reporting of memory usage/availability statistics at runtime
> 
> HugeTLB was picked as the source of 1G pages for this RFC because it allows a
> graceful transition, and retains memory savings from HVO.
> 
> To illustrate this, if a host machine uses HugeTLBfs to back VMs, and a
> confidential VM were to be scheduled on that host, some HugeTLBfs pages
> would
> have to be given up and returned to CMA for guest_memfd pages to be
> rebuilt from
> that memory. This requires memory to be reserved for HVO to be removed
> and
> reapplied on the new guest_memfd memory. This not only slows down
> memory
> allocation but also trims the benefits of HVO. Memory would have to be
> reserved
> on the host to facilitate these transitions.
> 
> Improving how guest_memfd uses the allocator in a future revision of this
> RFC:
> 
> To provide an easier transition away from HugeTLB, guest_memfd's use of
> HugeTLB
> should be limited to these allocator functions:
> 
> + reserve(node, page_size, num_pages) => opaque handle
>     + Used when a guest_memfd inode is created to reserve memory from
> backend
>       allocator
> + allocate(handle, mempolicy, page_size) => folio
>     + To allocate a folio from guest_memfd's reservation
> + split(handle, folio, target_page_size) => void
>     + To take a huge folio, and split it to smaller folios, restore to filemap
> + reconstruct(handle, first_folio, nr_pages) => void
>     + To take a folio, and reconstruct a huge folio out of nr_pages from the
>       first_folio
> + free(handle, folio) => void
>     + To return folio to guest_memfd's reservation
> + error(handle, folio) => void
>     + To handle memory errors
> + unreserve(handle) => void
>     + To return guest_memfd's reservation to allocator backend
> 
> Userspace should only provide a page size when creating a guest_memfd and
> should
> not have to specify HugeTLB.
> 
> Overview of patches:
> 
> + Patches 01-12
>     + Many small changes to HugeTLB, mostly to separate HugeTLBfs concepts
> from
>       HugeTLB, and to expose HugeTLB functions.
> + Patches 13-16
>     + Letting guest_memfd use HugeTLB
>     + Creation of each guest_memfd reserves pages from HugeTLB's global
> hstate
>       and puts it into the guest_memfd inode's subpool
>     + Each folio allocation takes a page from the guest_memfd inode's subpool
> + Patches 17-21
>     + Selftests for new HugeTLB features in guest_memfd
> + Patches 22-24
>     + More small changes on the HugeTLB side to expose functions needed by
>       guest_memfd
> + Patch 25:
>     + Uses the newly available functions from patches 22-24 to split HugeTLB
>       pages. In this patch, HugeTLB folios are always split to 4K before any
>       usage, private or shared.
> + Patches 26-28
>     + Allow mmap() in guest_memfd and faulting in shared pages
> + Patch 29
>     + Enables conversion between private/shared pages
> + Patch 30
>     + Required to zero folios after conversions to avoid leaking initialized
>       kernel memory
> + Patch 31-38
>     + Add selftests to test mapping pages to userspace, guest/host memory
>       sharing and update conversions tests
>     + Patch 33 illustrates the conversion flow between
> VM/userspace/guest_memfd
> + Patch 39
>     + Dynamically split and reconstruct HugeTLB pages instead of always
>       splitting before use. All earlier selftests are expected to still pass.
> 
> TODOs:
> 
> + Add logic to wait for safe_refcount [1]
> + Look into lazy splitting/reconstruction of pages
>     + Currently, when the KVM_SET_MEMORY_ATTRIBUTES is invoked, not only
> is the
>       mem_attr_array and faultability updated, the pages in the requested
> range
>       are also split/reconstructed as necessary. We want to look into delaying
>       splitting/reconstruction to fault time.
> + Solve race between folios being faulted in and being truncated
>     + When running private_mem_conversions_test with more than 1 vCPU, a
> folio
>       getting truncated may get faulted in by another process, causing elevated
>       mapcounts when the folio is freed (VM_BUG_ON_FOLIO).
> + Add intermediate splits (1G should first split to 2M and not split directly to
>   4K)
> + Use guest's lock instead of hugetlb_lock
> + Use multi-index xarray/replace xarray with some other data struct for
>   faultability flag
> + Refactor HugeTLB better, present generic allocator interface
> 
> Please let us know your thoughts on:
> 
> + HugeTLB as the choice of transitional allocator backend
> + Refactoring HugeTLB to provide generic allocator interface
> + Shared/private conversion flow
>     + Requiring user to request kernel to unmap pages from userspace using
>       madvise(MADV_DONTNEED)
>     + Failing conversion on elevated mapcounts/pincounts/refcounts
> + Process of splitting/reconstructing page
> + Anything else!
> 
> [1] https://lore.kernel.org/all/20240829-guest-memfd-lib-v2-0-
> b9afc1ff3656@...cinc.com/T/
> 
> Ackerley Tng (37):
>   mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma()
>   mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv()
>   mm: hugetlb: Remove unnecessary check for avoid_reserve
>   mm: mempolicy: Refactor out policy_node_nodemask()
>   mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to
>     interpret mempolicy instead of vma
>   mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol
>   mm: hugetlb: Refactor out hugetlb_alloc_folio
>   mm: truncate: Expose preparation steps for truncate_inode_pages_final
>   mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
>   mm: hugetlb: Add option to create new subpool without using surplus
>   mm: hugetlb: Expose hugetlb_acct_memory()
>   mm: hugetlb: Move and expose hugetlb_zero_partial_page()
>   KVM: guest_memfd: Make guest mem use guest mem inodes instead of
>     anonymous inodes
>   KVM: guest_memfd: hugetlb: initialization and cleanup
>   KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
>   KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd
>   KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
>   KVM: selftests: Support various types of backing sources for private
>     memory
>   KVM: selftests: Update test for various private memory backing source
>     types
>   KVM: selftests: Add private_mem_conversions_test.sh
>   KVM: selftests: Test that guest_memfd usage is reported via hugetlb
>   mm: hugetlb: Expose vmemmap optimization functions
>   mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages
>   mm: hugetlb: Add functions to add/move/remove from hugetlb lists
>   KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
>   KVM: guest_memfd: Allow mmapping guest_memfd files
>   KVM: guest_memfd: Use vm_type to determine default faultability
>   KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl
>   KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
>   KVM: selftests: Allow vm_set_memory_attributes to be used without
>     asserting return value of 0
>   KVM: selftests: Test using guest_memfd memory from userspace
>   KVM: selftests: Test guest_memfd memory sharing between guest and host
>   KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able
>     guest_memfd
>   KVM: selftests: Test that pinned pages block KVM from setting memory
>     attributes to PRIVATE
>   KVM: selftests: Refactor vm_mem_add to be more flexible
>   KVM: selftests: Add helper to perform madvise by memslots
>   KVM: selftests: Update private_mem_conversions_test for mmap()able
>     guest_memfd
> 
> Vishal Annapurve (2):
>   KVM: guest_memfd: Split HugeTLB pages for guest_memfd use
>   KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
> 
>  fs/hugetlbfs/inode.c                          |   35 +-
>  include/linux/hugetlb.h                       |   54 +-
>  include/linux/kvm_host.h                      |    1 +
>  include/linux/mempolicy.h                     |    2 +
>  include/linux/mm.h                            |    1 +
>  include/uapi/linux/kvm.h                      |   26 +
>  include/uapi/linux/magic.h                    |    1 +
>  mm/hugetlb.c                                  |  346 ++--
>  mm/hugetlb_vmemmap.h                          |   11 -
>  mm/mempolicy.c                                |   36 +-
>  mm/truncate.c                                 |   26 +-
>  tools/include/linux/kernel.h                  |    4 +-
>  tools/testing/selftests/kvm/Makefile          |    3 +
>  .../kvm/guest_memfd_hugetlb_reporting_test.c  |  222 +++
>  .../selftests/kvm/guest_memfd_pin_test.c      |  104 ++
>  .../selftests/kvm/guest_memfd_sharing_test.c  |  160 ++
>  .../testing/selftests/kvm/guest_memfd_test.c  |  238 ++-
>  .../testing/selftests/kvm/include/kvm_util.h  |   45 +-
>  .../testing/selftests/kvm/include/test_util.h |   18 +
>  tools/testing/selftests/kvm/lib/kvm_util.c    |  443 +++--
>  tools/testing/selftests/kvm/lib/test_util.c   |   99 ++
>  .../kvm/x86_64/private_mem_conversions_test.c |  158 +-
>  .../x86_64/private_mem_conversions_test.sh    |   91 +
>  .../kvm/x86_64/private_mem_kvm_exits_test.c   |   11 +-
>  virt/kvm/guest_memfd.c                        | 1563 ++++++++++++++++-
>  virt/kvm/kvm_main.c                           |   17 +
>  virt/kvm/kvm_mm.h                             |   16 +
>  27 files changed, 3288 insertions(+), 443 deletions(-)
>  create mode 100644
> tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
>  create mode 100644 tools/testing/selftests/kvm/guest_memfd_pin_test.c
>  create mode 100644 tools/testing/selftests/kvm/guest_memfd_sharing_test.c
>  create mode 100755
> tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh
> 
> --
> 2.46.0.598.g6f2099f65c-goog

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ