linux-kernel - Re: [RFC PATCH v2 35/51] mm: guestmem_hugetlb: Add support for splitting and merging pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250916222801.dlew6mq7kog2q5ni@amd.com>
Date: Tue, 16 Sep 2025 17:28:01 -0500
From: Michael Roth <michael.roth@....com>
To: Ackerley Tng <ackerleytng@...gle.com>
CC: <kvm@...r.kernel.org>, <linux-mm@...ck.org>,
	<linux-kernel@...r.kernel.org>, <x86@...nel.org>,
	<linux-fsdevel@...r.kernel.org>, <aik@....com>, <ajones@...tanamicro.com>,
	<akpm@...ux-foundation.org>, <amoorthy@...gle.com>,
	<anthony.yznaga@...cle.com>, <anup@...infault.org>, <aou@...s.berkeley.edu>,
	<bfoster@...hat.com>, <binbin.wu@...ux.intel.com>, <brauner@...nel.org>,
	<catalin.marinas@....com>, <chao.p.peng@...el.com>, <chenhuacai@...nel.org>,
	<dave.hansen@...el.com>, <david@...hat.com>, <dmatlack@...gle.com>,
	<dwmw@...zon.co.uk>, <erdemaktas@...gle.com>, <fan.du@...el.com>,
	<fvdl@...gle.com>, <graf@...zon.com>, <haibo1.xu@...el.com>,
	<hch@...radead.org>, <hughd@...gle.com>, <ira.weiny@...el.com>,
	<isaku.yamahata@...el.com>, <jack@...e.cz>, <james.morse@....com>,
	<jarkko@...nel.org>, <jgg@...pe.ca>, <jgowans@...zon.com>,
	<jhubbard@...dia.com>, <jroedel@...e.de>, <jthoughton@...gle.com>,
	<jun.miao@...el.com>, <kai.huang@...el.com>, <keirf@...gle.com>,
	<kent.overstreet@...ux.dev>, <kirill.shutemov@...el.com>,
	<liam.merwick@...cle.com>, <maciej.wieczor-retman@...el.com>,
	<mail@...iej.szmigiero.name>, <maz@...nel.org>, <mic@...ikod.net>,
	<mpe@...erman.id.au>, <muchun.song@...ux.dev>, <nikunj@....com>,
	<nsaenz@...zon.es>, <oliver.upton@...ux.dev>, <palmer@...belt.com>,
	<pankaj.gupta@....com>, <paul.walmsley@...ive.com>, <pbonzini@...hat.com>,
	<pdurrant@...zon.co.uk>, <peterx@...hat.com>, <pgonda@...gle.com>,
	<pvorel@...e.cz>, <qperret@...gle.com>, <quic_cvanscha@...cinc.com>,
	<quic_eberman@...cinc.com>, <quic_mnalajal@...cinc.com>,
	<quic_pderrin@...cinc.com>, <quic_pheragu@...cinc.com>,
	<quic_svaddagi@...cinc.com>, <quic_tsoni@...cinc.com>,
	<richard.weiyang@...il.com>, <rick.p.edgecombe@...el.com>,
	<rientjes@...gle.com>, <roypat@...zon.co.uk>, <rppt@...nel.org>,
	<seanjc@...gle.com>, <shuah@...nel.org>, <steven.price@....com>,
	<steven.sistare@...cle.com>, <suzuki.poulose@....com>, <tabba@...gle.com>,
	<thomas.lendacky@....com>, <usama.arif@...edance.com>,
	<vannapurve@...gle.com>, <vbabka@...e.cz>, <viro@...iv.linux.org.uk>,
	<vkuznets@...hat.com>, <wei.w.wang@...el.com>, <will@...nel.org>,
	<willy@...radead.org>, <xiaoyao.li@...el.com>, <yan.y.zhao@...el.com>,
	<yilun.xu@...el.com>, <yuzenghui@...wei.com>, <zhiquan1.li@...el.com>
Subject: Re: [RFC PATCH v2 35/51] mm: guestmem_hugetlb: Add support for
 splitting and merging pages

On Wed, May 14, 2025 at 04:42:14PM -0700, Ackerley Tng wrote:
> These functions allow guest_memfd to split and merge HugeTLB pages,
> and clean them up on freeing the page.
> 
> For merging and splitting pages on conversion, guestmem_hugetlb
> expects the refcount on the pages to already be 0. The caller must
> ensure that.
> 
> For conversions, guest_memfd ensures that the refcounts are already 0
> by checking that there are no unexpected refcounts, and then freezing
> the expected refcounts away. On unexpected refcounts, guest_memfd will
> return an error to userspace.
> 
> For truncation, on unexpected refcounts, guest_memfd will return an
> error to userspace.
> 
> For truncation on closing, guest_memfd will just remove its own
> refcounts (the filemap refcounts) and mark split pages with
> PGTY_guestmem_hugetlb.
> 
> The presence of PGTY_guestmem_hugetlb will trigger the folio_put()
> callback to handle further cleanup. This cleanup process will merge
> pages (with refcount 0, since cleanup is triggered from folio_put())
> before returning the pages to HugeTLB.
> 
> Since the merging process is long, it is deferred to a worker thread
> since folio_put() could be called from atomic context.
> 
> Change-Id: Ib04a3236f1e7250fd9af827630c334d40fb09d40
> Signed-off-by: Ackerley Tng <ackerleytng@...gle.com>
> Co-developed-by: Vishal Annapurve <vannapurve@...gle.com>
> Signed-off-by: Vishal Annapurve <vannapurve@...gle.com>
> ---
>  include/linux/guestmem.h |   3 +
>  mm/guestmem_hugetlb.c    | 349 ++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 347 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/guestmem.h b/include/linux/guestmem.h
> index 4b2d820274d9..3ee816d1dd34 100644
> --- a/include/linux/guestmem.h
> +++ b/include/linux/guestmem.h
> @@ -8,6 +8,9 @@ struct guestmem_allocator_operations {
>  	void *(*inode_setup)(size_t size, u64 flags);
>  	void (*inode_teardown)(void *private, size_t inode_size);
>  	struct folio *(*alloc_folio)(void *private);
> +	int (*split_folio)(struct folio *folio);
> +	void (*merge_folio)(struct folio *folio);
> +	void (*free_folio)(struct folio *folio);
>  	/*
>  	 * Returns the number of PAGE_SIZE pages in a page that this guestmem
>  	 * allocator provides.
> diff --git a/mm/guestmem_hugetlb.c b/mm/guestmem_hugetlb.c
> index ec5a188ca2a7..8727598cf18e 100644
> --- a/mm/guestmem_hugetlb.c
> +++ b/mm/guestmem_hugetlb.c
> @@ -11,15 +11,12 @@
>  #include <linux/mm.h>
>  #include <linux/mm_types.h>
>  #include <linux/pagemap.h>
> +#include <linux/xarray.h>
>  
>  #include <uapi/linux/guestmem.h>
>  
>  #include "guestmem_hugetlb.h"
> -
> -void guestmem_hugetlb_handle_folio_put(struct folio *folio)
> -{
> -	WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress.");
> -}
> +#include "hugetlb_vmemmap.h"
>  
>  struct guestmem_hugetlb_private {
>  	struct hstate *h;
> @@ -34,6 +31,339 @@ static size_t guestmem_hugetlb_nr_pages_in_folio(void *priv)
>  	return pages_per_huge_page(private->h);
>  }
>  
> +static DEFINE_XARRAY(guestmem_hugetlb_stash);
> +
> +struct guestmem_hugetlb_metadata {
> +	void *_hugetlb_subpool;
> +	void *_hugetlb_cgroup;
> +	void *_hugetlb_hwpoison;
> +	void *private;
> +};
> +
> +struct guestmem_hugetlb_stash_item {
> +	struct guestmem_hugetlb_metadata hugetlb_metadata;
> +	/* hstate tracks the original size of this folio. */
> +	struct hstate *h;
> +	/* Count of split pages, individually freed, waiting to be merged. */
> +	atomic_t nr_pages_waiting_to_be_merged;
> +};
> +
> +struct workqueue_struct *guestmem_hugetlb_wq __ro_after_init;
> +static struct work_struct guestmem_hugetlb_cleanup_work;
> +static LLIST_HEAD(guestmem_hugetlb_cleanup_list);
> +
> +static inline void guestmem_hugetlb_register_folio_put_callback(struct folio *folio)
> +{
> +	__folio_set_guestmem_hugetlb(folio);
> +}
> +
> +static inline void guestmem_hugetlb_unregister_folio_put_callback(struct folio *folio)
> +{
> +	__folio_clear_guestmem_hugetlb(folio);
> +}
> +
> +static inline void guestmem_hugetlb_defer_cleanup(struct folio *folio)
> +{
> +	struct llist_node *node;
> +
> +	/*
> +	 * Reuse the folio->mapping pointer as a struct llist_node, since
> +	 * folio->mapping is NULL at this point.
> +	 */
> +	BUILD_BUG_ON(sizeof(folio->mapping) != sizeof(struct llist_node));
> +	node = (struct llist_node *)&folio->mapping;
> +
> +	/*
> +	 * Only schedule work if list is previously empty. Otherwise,
> +	 * schedule_work() had been called but the workfn hasn't retrieved the
> +	 * list yet.
> +	 */
> +	if (llist_add(node, &guestmem_hugetlb_cleanup_list))
> +		queue_work(guestmem_hugetlb_wq, &guestmem_hugetlb_cleanup_work);
> +}
> +
> +void guestmem_hugetlb_handle_folio_put(struct folio *folio)
> +{
> +	guestmem_hugetlb_unregister_folio_put_callback(folio);
> +
> +	/*
> +	 * folio_put() can be called in interrupt context, hence do the work
> +	 * outside of interrupt context
> +	 */
> +	guestmem_hugetlb_defer_cleanup(folio);
> +}
> +
> +/*
> + * Stash existing hugetlb metadata. Use this function just before splitting a
> + * hugetlb page.
> + */
> +static inline void
> +__guestmem_hugetlb_stash_metadata(struct guestmem_hugetlb_metadata *metadata,
> +				  struct folio *folio)
> +{
> +	/*
> +	 * (folio->page + 1) doesn't have to be stashed since those fields are
> +	 * known on split/reconstruct and will be reinitialized anyway.
> +	 */
> +
> +	/*
> +	 * subpool is created for every guest_memfd inode, but the folios will
> +	 * outlive the inode, hence we store the subpool here.
> +	 */
> +	metadata->_hugetlb_subpool = folio->_hugetlb_subpool;
> +	/*
> +	 * _hugetlb_cgroup has to be stored for freeing
> +	 * later. _hugetlb_cgroup_rsvd does not, since it is NULL for
> +	 * guest_memfd folios anyway. guest_memfd reservations are handled in
> +	 * the inode.
> +	 */
> +	metadata->_hugetlb_cgroup = folio->_hugetlb_cgroup;
> +	metadata->_hugetlb_hwpoison = folio->_hugetlb_hwpoison;
> +
> +	/*
> +	 * HugeTLB flags are stored in folio->private. stash so that ->private
> +	 * can be used by core-mm.
> +	 */
> +	metadata->private = folio->private;
> +}
> +
> +static int guestmem_hugetlb_stash_metadata(struct folio *folio)
> +{
> +	XA_STATE(xas, &guestmem_hugetlb_stash, 0);
> +	struct guestmem_hugetlb_stash_item *stash;
> +	void *entry;
> +
> +	stash = kzalloc(sizeof(*stash), 1);
> +	if (!stash)
> +		return -ENOMEM;
> +
> +	stash->h = folio_hstate(folio);
> +	__guestmem_hugetlb_stash_metadata(&stash->hugetlb_metadata, folio);
> +
> +	xas_set_order(&xas, folio_pfn(folio), folio_order(folio));
> +
> +	xas_lock(&xas);
> +	entry = xas_store(&xas, stash);
> +	xas_unlock(&xas);
> +
> +	if (xa_is_err(entry)) {
> +		kfree(stash);
> +		return xa_err(entry);
> +	}
> +
> +	return 0;
> +}
> +
> +static inline void
> +__guestmem_hugetlb_unstash_metadata(struct guestmem_hugetlb_metadata *metadata,
> +				    struct folio *folio)
> +{
> +	folio->_hugetlb_subpool = metadata->_hugetlb_subpool;
> +	folio->_hugetlb_cgroup = metadata->_hugetlb_cgroup;
> +	folio->_hugetlb_cgroup_rsvd = NULL;
> +	folio->_hugetlb_hwpoison = metadata->_hugetlb_hwpoison;
> +
> +	folio_change_private(folio, metadata->private);

Hi Ackerley,

We've been doing some testing with this series on top of David's
guestmemfd-preview branch with some SNP enablement[1][2] to exercise
this code along with the NUMA support from Shivank (BTW, I know you
have v3 in the works so let me know if we can help with testing that
as well).

One issue we hit is if you do a split->merge sequence the unstash of the
private data will result in folio_test_hugetlb_vmemmap_optimized() reporting
true even though the hugetlb_vmemmap_optimize_folio() call hasn't been
performed yet, and when that does get called it will be skipped, so some HVO
optimization can be lost in this way.

More troublesome however is if you later split the folio again,
hugetlb_vmemmap_restore_folio() may cause a BUG_ON() since the flags are in a
state that's not consistent with the state of the folio/vmemmap.

The following patch seems to resolve the issue but I'm not sure what the
best approach would be:

  https://github.com/AMDESE/linux/commit/b1f25956f18d32730b8d4ded6d77e980091eb4d3

Thanks,

Mike

[1] https://github.com/AMDESE/linux/commits/snp-hugetlb-v2-wip0/
[2] https://github.com/AMDESE/qemu/tree/snp-hugetlb-dev-wip0