[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aN_fJEZXo6wkcHOh@google.com>
Date: Fri, 3 Oct 2025 07:35:16 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Ackerley Tng <ackerleytng@...gle.com>
Cc: kvm@...r.kernel.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
x86@...nel.org, linux-fsdevel@...r.kernel.org, aik@....com,
ajones@...tanamicro.com, akpm@...ux-foundation.org, amoorthy@...gle.com,
anthony.yznaga@...cle.com, anup@...infault.org, aou@...s.berkeley.edu,
bfoster@...hat.com, binbin.wu@...ux.intel.com, brauner@...nel.org,
catalin.marinas@....com, chao.p.peng@...el.com, chenhuacai@...nel.org,
dave.hansen@...el.com, david@...hat.com, dmatlack@...gle.com,
dwmw@...zon.co.uk, erdemaktas@...gle.com, fan.du@...el.com, fvdl@...gle.com,
graf@...zon.com, haibo1.xu@...el.com, hch@...radead.org, hughd@...gle.com,
ira.weiny@...el.com, isaku.yamahata@...el.com, jack@...e.cz,
james.morse@....com, jarkko@...nel.org, jgg@...pe.ca, jgowans@...zon.com,
jhubbard@...dia.com, jroedel@...e.de, jthoughton@...gle.com,
jun.miao@...el.com, kai.huang@...el.com, keirf@...gle.com,
kent.overstreet@...ux.dev, kirill.shutemov@...el.com, liam.merwick@...cle.com,
maciej.wieczor-retman@...el.com, mail@...iej.szmigiero.name, maz@...nel.org,
mic@...ikod.net, michael.roth@....com, mpe@...erman.id.au,
muchun.song@...ux.dev, nikunj@....com, nsaenz@...zon.es,
oliver.upton@...ux.dev, palmer@...belt.com, pankaj.gupta@....com,
paul.walmsley@...ive.com, pbonzini@...hat.com, pdurrant@...zon.co.uk,
peterx@...hat.com, pgonda@...gle.com, pvorel@...e.cz, qperret@...gle.com,
quic_cvanscha@...cinc.com, quic_eberman@...cinc.com,
quic_mnalajal@...cinc.com, quic_pderrin@...cinc.com, quic_pheragu@...cinc.com,
quic_svaddagi@...cinc.com, quic_tsoni@...cinc.com, richard.weiyang@...il.com,
rick.p.edgecombe@...el.com, rientjes@...gle.com, roypat@...zon.co.uk,
rppt@...nel.org, shuah@...nel.org, steven.price@....com,
steven.sistare@...cle.com, suzuki.poulose@....com, tabba@...gle.com,
thomas.lendacky@....com, usama.arif@...edance.com, vannapurve@...gle.com,
vbabka@...e.cz, viro@...iv.linux.org.uk, vkuznets@...hat.com,
wei.w.wang@...el.com, will@...nel.org, willy@...radead.org,
xiaoyao.li@...el.com, yan.y.zhao@...el.com, yilun.xu@...el.com,
yuzenghui@...wei.com, zhiquan1.li@...el.com
Subject: Re: [RFC PATCH v2 29/51] mm: guestmem_hugetlb: Wrap HugeTLB as an
allocator for guest_memfd
On Wed, May 14, 2025, Ackerley Tng wrote:
> guestmem_hugetlb is an allocator for guest_memfd. It wraps HugeTLB to
> provide huge folios for guest_memfd.
>
> This patch also introduces guestmem_allocator_operations as a set of
> operations that allocators for guest_memfd can provide. In a later
> patch, guest_memfd will use these operations to manage pages from an
> allocator.
>
> The allocator operations are memory-management specific and are placed
> in mm/ so key mm-specific functions do not have to be exposed
> unnecessarily.
This code doesn't have to be put in mm/, all of the #includes are to <linux/xxx.h>.
Unless I'm missing something, what you actually want to avoid is _exporting_ mm/
APIs, and for that all that is needed is ensure the code is built-in to the kernel
binary, not to kvm.ko.
diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm
index d047d4cf58c9..c18c77e8a638 100644
--- a/virt/kvm/Makefile.kvm
+++ b/virt/kvm/Makefile.kvm
@@ -13,3 +13,5 @@ kvm-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o
kvm-$(CONFIG_HAVE_KVM_DIRTY_RING) += $(KVM)/dirty_ring.o
kvm-$(CONFIG_HAVE_KVM_PFNCACHE) += $(KVM)/pfncache.o
kvm-$(CONFIG_KVM_GUEST_MEMFD) += $(KVM)/guest_memfd.o
+
+obj-$(subst m,y,$(CONFIG_KVM_GUEST_MEMFD)) += $(KVM)/guest_memfd_hugepages.o
\ No newline at end of file
People may want the code to live in mm/ for maintenance and ownership reasons
(or not, I haven't followed the discussions on hugepage support), but that's a
very different justification than what's described in the changelog.
And if the _only_ user is guest_memfd, putting this in mm/ feels quite weird.
And if we anticipate other users, the name guestmem_hugetlb is weird, because
AFAICT there's nothing in here that is in any way guest specific, it's just a
few APIs for allocating and accounting hugepages.
Personally, I don't see much point in trying to make this a "generic" library,
in quotes because the whole guestmem_xxx namespace makes it anything but generic.
I don't see anything in mm/guestmem_hugetlb.c that makes me go "ooh, that's nasty,
I'm glad this is handled by a library". But if we want to go straight to a
library, it should be something that is really truly generic, i.e. not "guest"
specific in any way.
> Signed-off-by: Ackerley Tng <ackerleytng@...gle.com>
>
> Change-Id: I3cafe111ea7b3c84755d7112ff8f8c541c11136d
> ---
> include/linux/guestmem.h | 20 +++++
> include/uapi/linux/guestmem.h | 29 +++++++
> mm/Kconfig | 5 +-
> mm/guestmem_hugetlb.c | 159 ++++++++++++++++++++++++++++++++++
> 4 files changed, 212 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/guestmem.h
> create mode 100644 include/uapi/linux/guestmem.h
..
> diff --git a/include/uapi/linux/guestmem.h b/include/uapi/linux/guestmem.h
> new file mode 100644
> index 000000000000..2e518682edd5
> --- /dev/null
> +++ b/include/uapi/linux/guestmem.h
With my KVM hat on, NAK to defining uAPI in a library like this. This subtly
defines uAPI for KVM, and effectively any other userspace-facing entity that
utilizes the library/allocator. KVM's uAPI needs to be defined by KVM, period.
There's absolutely zero reason to have guestmem_hugetlb_setup() take in flags.
Explicitly pass the page size, or if preferred, the page_size_log, and let the
caller figure out how to communicate the size to the kernel.
IMO, the whole MAP_HUGE_xxx approach is a (clever) hack to squeeze the desired
size into mmap() flags. I don't see any reason to carry that forward to guest_memfd.
For once, we had the foresight to reserve some space in KVM's uAPI structure, so
there's no need to squeeze things into flags.
E.g. we could do something like this:
diff --git include/uapi/linux/kvm.h include/uapi/linux/kvm.h
index 42053036d38d..b79914472d27 100644
--- include/uapi/linux/kvm.h
+++ include/uapi/linux/kvm.h
@@ -1605,11 +1605,16 @@ struct kvm_memory_attributes {
#define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest_memfd)
#define GUEST_MEMFD_FLAG_MMAP (1ULL << 0)
#define GUEST_MEMFD_FLAG_INIT_SHARED (1ULL << 1)
+#define GUEST_MEMFD_FLAG_HUGE_PAGES (1ULL << 2)
struct kvm_create_guest_memfd {
__u64 size;
__u64 flags;
- __u64 reserved[6];
+ __u8 huge_page_size_log2;
+ __u8 reserve8;
+ __u16 reserve16;
+ __u32 reserve32;
+ __u64 reserved[5];
};
#define KVM_PRE_FAULT_MEMORY _IOWR(KVMIO, 0xd5, struct kvm_pre_fault_memory)
And not have to burn 6 bits of flags to encode the size in a weird location.
But that's a detail for KVM to sort out, which is exactly my point; how this is
presented to userspace for guest_memfd is question for KVM.
> @@ -0,0 +1,29 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_GUESTMEM_H
> +#define _UAPI_LINUX_GUESTMEM_H
> +
> +/*
> + * Huge page size must be explicitly defined when using the guestmem_hugetlb
> + * allocator for guest_memfd. It is the responsibility of the application to
> + * know which sizes are supported on the running system. See mmap(2) man page
> + * for details.
> + */
> +
> +#define GUESTMEM_HUGETLB_FLAG_SHIFT 58
> +#define GUESTMEM_HUGETLB_FLAG_MASK 0x3fUL
> +
> +#define GUESTMEM_HUGETLB_FLAG_16KB (14UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
> +#define GUESTMEM_HUGETLB_FLAG_64KB (16UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
> +#define GUESTMEM_HUGETLB_FLAG_512KB (19UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
> +#define GUESTMEM_HUGETLB_FLAG_1MB (20UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
> +#define GUESTMEM_HUGETLB_FLAG_2MB (21UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
> +#define GUESTMEM_HUGETLB_FLAG_8MB (23UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
> +#define GUESTMEM_HUGETLB_FLAG_16MB (24UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
> +#define GUESTMEM_HUGETLB_FLAG_32MB (25UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
> +#define GUESTMEM_HUGETLB_FLAG_256MB (28UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
> +#define GUESTMEM_HUGETLB_FLAG_512MB (29UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
> +#define GUESTMEM_HUGETLB_FLAG_1GB (30UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
> +#define GUESTMEM_HUGETLB_FLAG_2GB (31UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
> +#define GUESTMEM_HUGETLB_FLAG_16GB (34UL << GUESTMEM_HUGETLB_FLAG_SHIFT)
> +
> +#endif /* _UAPI_LINUX_GUESTMEM_H */
...
> +const struct guestmem_allocator_operations guestmem_hugetlb_ops = {
> + .inode_setup = guestmem_hugetlb_setup,
> + .inode_teardown = guestmem_hugetlb_teardown,
> + .alloc_folio = guestmem_hugetlb_alloc_folio,
> + .nr_pages_in_folio = guestmem_hugetlb_nr_pages_in_folio,
> +};
> +EXPORT_SYMBOL_GPL(guestmem_hugetlb_ops);
Why are these bundled into a structure? AFAICT, that adds layers of indirection
for absolutely no reason. And especially on the KVM guest_memfd side, implementing
a pile of infrastructure to support "custom" allocators is very premature. Without
a second "custom" allocator, it's impossible to determine if the indirection
provided is actually a good design. I.e. all of the kvm_gmem_has_custom_allocator()
logic in guest_memfd.c is just HugeTLB logic buried behind a layer of unnecessary
indirection.
Powered by blists - more mailing lists