[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aQO3Zko6Qrk7O96u@x1.local>
Date: Thu, 30 Oct 2025 15:07:18 -0400
From: Peter Xu <peterx@...hat.com>
To: "Liam R. Howlett" <Liam.Howlett@...cle.com>,
	David Hildenbrand <david@...hat.com>, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org, Mike Rapoport <rppt@...nel.org>,
	Muchun Song <muchun.song@...ux.dev>,
	Nikita Kalyazin <kalyazin@...zon.com>,
	Vlastimil Babka <vbabka@...e.cz>,
	Axel Rasmussen <axelrasmussen@...gle.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	James Houghton <jthoughton@...gle.com>,
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
	Hugh Dickins <hughd@...gle.com>, Michal Hocko <mhocko@...e.com>,
	Ujwal Kundur <ujwal.kundur@...il.com>,
	Oscar Salvador <osalvador@...e.de>,
	Suren Baghdasaryan <surenb@...gle.com>,
	Andrea Arcangeli <aarcange@...hat.com>
Subject: Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types
On Thu, Oct 30, 2025 at 01:13:24PM -0400, Liam R. Howlett wrote:
> * Peter Xu <peterx@...hat.com> [251021 12:28]:
> 
> ...
> 
> > Can you send some patches and show us the code, help everyone to support
> > guest-memfd minor fault, please?
> 
> Patches are here:
> 
> https://git.infradead.org/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/modularized_mem
Great!  Finally we have something solid to discuss on top.
Yes, I'm extremely happy to see whatever code there is, I'm happy to review
it.  I'm happy to see it rolling.  If it is better, we can adopt it.
> 
> This is actually modularized memory types.  That means there is no
> hugetlb.h or shmem.h included in mm/userfaultfd.c code.
> 
> uffd_flag_t has been removed.  This was turning into a middleware and
> it is not necessary.  Neither is supported_ioctls.
> 
> hugetlb now uses the same functions as every other memory type,
> including anon memory.
> 
> Any memory type can change functionality without adding instructions or
> flags or anything to some other code.
> 
> This code passes uffd-unit-test and uffd-wp-mremap (skipped the swap
> tests).
> 
> guest-memfd can implement whatever it needs to (or use others
> implementations), like shmem_uffd_ops here:
I didn't look at the patches in details, however I already commented
previously on a similar comment you left.  Since you have solid code this
time, let me ask one by one clearly this time inline:
> 
> static const struct vm_uffd_ops shmem_uffd_ops = {
>         .copy                   =       shmem_mfill_atomic_pte_copy,
This is optional, if you did the convertion it's perfect (though it's buggy
right now, more below).  Yes, UFFDIO_COPY might be a good fit for a global
API like this, however that's the least useful, as I mentioned, I do not
expect a new user..
It means, what you did on this may not grow any user.  The whole change may
not be useful to anyone..
Then I see what you introduced as the API:
struct vm_uffd_ops {
	int (*copy)(struct uffd_info *info);
	int (*zeropage)(struct uffd_info *info);
	int (*cont)(struct uffd_info *info);
	int (*poison)(struct uffd_info *info);
	int (*writeprotect)(struct uffd_info *info);
	/* Required features below */
	ssize_t (*is_dst_valid)(struct vm_area_struct *dst_vma,
				unsigned long dst_start, unsigned long len);
	unsigned long (*increment)(struct vm_area_struct *vma);
	ssize_t (*failed_do_unlock)(struct uffd_info *info);
	unsigned int (*page_shift)(struct vm_area_struct *src_vma);
	void (*complete_register)(struct vm_area_struct *vma);
};
The copy() interface (and most of the rest) takes something called
uffd_info*.  Then it looks like this:
struct uffd_info {
	unsigned long dst_addr;			/* Target address */
	unsigned long src_addr;			/* Source address */
	unsigned long len;			/* Total length to copy */
	unsigned long src_last;			/* starting src_addr + len */
	unsigned long dst_last;			/* starting dst_addr + len */
	struct folio *foliop;			/* folio pointer for retry */
	struct userfaultfd_ctx *ctx;		/* The userfaultfd context */
	struct vm_area_struct *dst_vma;		/* Target vma */
	unsigned long increment;		/* Size of each operation */
	bool wp;				/* Operation is requesting write protection */
	const struct vm_uffd_ops *uffd_ops;	/* The vma uffd_ops pointer */
	int (*op)(struct uffd_info *info);	/* The operation to perform on the dst */
	long copied;
};
You went almost mad when I introduced uffd_copy() in v1.  It might be
because there used to have pmd* and something around pgtables.  However
I'll still need to question whether this is a better and easier to adopt
interface if a mem type wants to opt-in any uffd features.
So are you happy yourself now with above complicated struct that memtype
needs to implement and support?
>         .zeropage               =       shmem_mfill_atomic_pte_zeropage,
Why a memory type needs to provide a separate hook to inject a zeropage?
It should almost always be the same of UFFDIO_COPY except copying.
>         .cont                   =       shmem_mfill_atomic_pte_continue,
It's OK to have it.  However said that, what we really need is "fetching a
cache folio".  I'll need to check how you exposed the userfaultfd helpers
so that it will support mem types to opt-in this.  To me, this is really an
overkill.
Shmem impl:
static int shmem_mfill_atomic_pte_continue(struct uffd_info *info)
{
	struct vm_area_struct *dst_vma = info->dst_vma;
	struct inode *inode = file_inode(dst_vma->vm_file);
	pgoff_t pgoff = linear_page_index(dst_vma, info->dst_addr);
	pmd_t *dst_pmd;
	struct folio *folio;
	struct page *page;
	int ret;
	ret = uffd_get_dst_pmd(dst_vma, info->dst_addr, &dst_pmd);
	if (ret)
		return ret;
	ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
	/* Our caller expects us to return -EFAULT if we failed to find folio */
	if (ret == -ENOENT)
		ret = -EFAULT;
	if (ret)
		goto out;
	if (!folio) {
		ret = -EFAULT;
		goto out;
	}
	page = folio_file_page(folio, pgoff);
	if (PageHWPoison(page)) {
		ret = -EIO;
		goto out_release;
	}
	ret = mfill_atomic_install_pte(dst_pmd, dst_vma, info->dst_addr,
				       page, false, info->wp);
	if (ret)
		goto out_release;
	folio_unlock(folio);
	ret = 0;
out:
	return ret;
out_release:
	folio_unlock(folio);
	folio_put(folio);
	goto out;
}
So are you sure this is better than the oneliner?
In your new API, the driver also needs to operate on pmd*.  Is it a concern
to you?  Maybe you don't think it's a concern now, even if you used to
think uffd_copy() has concerns exposing pmd* pointers?
The current series I proposed is not only simpler, but only expose folio*.
That's at least safer at least from your theory, is that right?
>         .poison                 =       mfill_atomic_pte_poison,
Why this is needed if UFFDIO_POISON should exactly do the same thing for
each memory type?
>         .writeprotect           =       uffd_writeprotect,
Same question.  Wr-protect over a pagecache folio should really behave the
same.  Why do you need to introduce it?
After all, even in your branch you reused change_protection(), where the
hugetlb special operations reside.  I don't see much point on why it's
needed.
>         .is_dst_valid           =       shmem_is_dst_valid,
It's definitely not obvious what this is for.
Looks like it's trying to verify vma validity.  However then I see shmem
impl has this:
static ssize_t shmem_is_dst_valid(struct vm_area_struct *dst_vma,
		unsigned long dst_start, unsigned long len)
{
	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
	    return -EINVAL;
	return 0;
}
Why shmem allows anon vma?
>         .increment              =       mfill_size,
This is almost only useful for hugetlbfs, same to page_shift and
complete_register.
It's ok if you want to clean hugetlbfs.  But if we want to fetch the
granule of the folios inside a vma, IMHO we should make it a vma attribute,
not something special to userfaultfd.
>         .failed_do_unlock       =       uffd_failed_do_unlock,
You'd better at least change the name of it.  It so far unlocks some
misterious locks then try to copy the folio without mmap/vma lock.
inline ssize_t
uffd_failed_do_unlock(struct uffd_info *info)
{
	ssize_t err;
	void *kaddr;
	up_read(&info->ctx->map_changing_lock);
	uffd_mfill_unlock(info->dst_vma);
	VM_WARN_ON_ONCE(!info->foliop);
	kaddr = kmap_local_folio(info->foliop, 0);
	err = copy_from_user(kaddr, (const void __user *) info->src_addr,
			     PAGE_SIZE);
	kunmap_local(kaddr);
	if (unlikely(err))
		return -EFAULT;
	flush_dcache_folio(info->foliop);
	return 0;
}
How do the mem type know what locks to unlock if they do not lock it first
themselves?
>         .page_shift             =       uffd_page_shift,
>         .complete_register      =       uffd_complete_register,
Hugetlbfs specific hooks.  It's OK if you prefer rewritting code for
hugetlbfs.  I don't have objections.
> };   
> 
> Where guest-memfd needs to write the one function:
> guest_memfd_pte_continue(), from what I understand.
You did mention what is required in your new API:
	/* Required features below */
	ssize_t (*is_dst_valid)(struct vm_area_struct *dst_vma,
				unsigned long dst_start, unsigned long len);
	unsigned long (*increment)(struct vm_area_struct *vma);
	ssize_t (*failed_do_unlock)(struct uffd_info *info);
	unsigned int (*page_shift)(struct vm_area_struct *src_vma);
	void (*complete_register)(struct vm_area_struct *vma);
So guest-memfd needs to implement these 6 APIs to support minor fault.
> 
> Obviously some of the shmem_ functions would need to be added to a
> header, or such.
> 
> And most of that can come from shmem_mfill_atomic_pte_continue(), from
> what I understand.  This is about 40 lines of code, but may require
> exposing some shmem functions to keep the code that compact.
> 
> So we don't need to expose getting a folio to a module, or decode any
> special flags or whatever.  We just call the function that needs to be
I didn't reply before, but I don't think supported_ioctls is special flags
or magic values.  They're simply flags showing what the mem type supports
on the ioctls.  One can set it wrong, but I don't think what you proposed
with above 6 APIs would be easier to get right.  Any module can also make
things wrong in any of above.
UFFDIO_CONTINUE itself is definitely getting much more complicated, it used
to only set a flag in supported_ioctls, provide a oneliner to fetch a
cache.  Now it needs all above 6 APIs, one of them taking uffd_info* which
further contains 10+ fields to process.
> called on the vma that is found.
> 
> If anyone has tests I can use for guest-memfd and instructions on
> guest-memfd setup, I'll just write it instead of expanding the
> userfaultfd middleware.
Personally, this whole thing is an overkill to me:
$ git diff --stat 63f84ba525ea04ef376eac851efce2f82dd05f21..HEAD
 fs/userfaultfd.c              |  14 +--
 include/linux/hugetlb.h       |  21 ----
 include/linux/mm.h            |  11 ++
 include/linux/shmem_fs.h      |  14 ---
 include/linux/userfaultfd_k.h | 108 ++++++++++------
 mm/hugetlb.c                  | 359 +++++++++++++++++++++++++++++++++++++++++++++--------
 mm/shmem.c                    | 245 ++++++++++++++++++++++++------------
 mm/userfaultfd.c              | 869 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------------------------
 8 files changed, 962 insertions(+), 679 deletions(-)
I can wait for some second opinions, but if you are confident this will be
welcomed by others, I suggest you prepare a minimum changeset preparing
support for guest-memfd minor fault, then post it formally upstream.
Thanks,
-- 
Peter Xu
Powered by blists - more mailing lists
 
