linux-kernel - Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6odeeo7bgxgq4v6y3jercrriqyreynuelofrw6k6roh7ws5vy2@wyvx7uiztb5y>
Date: Thu, 30 Oct 2025 16:24:46 -0400
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Peter Xu <peterx@...hat.com>
Cc: David Hildenbrand <david@...hat.com>, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, Mike Rapoport <rppt@...nel.org>,
        Muchun Song <muchun.song@...ux.dev>,
        Nikita Kalyazin <kalyazin@...zon.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        James Houghton <jthoughton@...gle.com>,
        Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
        Hugh Dickins <hughd@...gle.com>, Michal Hocko <mhocko@...e.com>,
        Ujwal Kundur <ujwal.kundur@...il.com>,
        Oscar Salvador <osalvador@...e.de>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Andrea Arcangeli <aarcange@...hat.com>
Subject: Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types

* Peter Xu <peterx@...hat.com> [251030 15:07]:
> On Thu, Oct 30, 2025 at 01:13:24PM -0400, Liam R. Howlett wrote:
> > * Peter Xu <peterx@...hat.com> [251021 12:28]:
> > 
> > ...
> > 
> > > Can you send some patches and show us the code, help everyone to support
> > > guest-memfd minor fault, please?
> > 
> > Patches are here:
> > 
> > https://git.infradead.org/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/modularized_mem
> 
> Great!  Finally we have something solid to discuss on top.
> 
> Yes, I'm extremely happy to see whatever code there is, I'm happy to review
> it.  I'm happy to see it rolling.  If it is better, we can adopt it.
> 
> > 
> > This is actually modularized memory types.  That means there is no
> > hugetlb.h or shmem.h included in mm/userfaultfd.c code.
> > 
> > uffd_flag_t has been removed.  This was turning into a middleware and
> > it is not necessary.  Neither is supported_ioctls.
> > 
> > hugetlb now uses the same functions as every other memory type,
> > including anon memory.
> > 
> > Any memory type can change functionality without adding instructions or
> > flags or anything to some other code.
> > 
> > This code passes uffd-unit-test and uffd-wp-mremap (skipped the swap
> > tests).
> > 
> > guest-memfd can implement whatever it needs to (or use others
> > implementations), like shmem_uffd_ops here:
> 
> I didn't look at the patches in details, however I already commented
> previously on a similar comment you left.  Since you have solid code this
> time, let me ask one by one clearly this time inline:
> 
> > 
> > static const struct vm_uffd_ops shmem_uffd_ops = {
> >         .copy                   =       shmem_mfill_atomic_pte_copy,
> 
> This is optional, if you did the convertion it's perfect (though it's buggy
> right now, more below).  Yes, UFFDIO_COPY might be a good fit for a global
> API like this, however that's the least useful, as I mentioned, I do not
> expect a new user..
> 
> It means, what you did on this may not grow any user.  The whole change may
> not be useful to anyone..

We currently have two users, that's why it's here.  It is useful because
we don't have hugetlb in the code anymore.  We don't even have shmem
code in the userfaultfd code.

> 
> Then I see what you introduced as the API:
> 
> struct vm_uffd_ops {
> 	int (*copy)(struct uffd_info *info);
> 	int (*zeropage)(struct uffd_info *info);
> 	int (*cont)(struct uffd_info *info);
> 	int (*poison)(struct uffd_info *info);
> 	int (*writeprotect)(struct uffd_info *info);
> 	/* Required features below */
> 	ssize_t (*is_dst_valid)(struct vm_area_struct *dst_vma,
> 				unsigned long dst_start, unsigned long len);
> 	unsigned long (*increment)(struct vm_area_struct *vma);
> 	ssize_t (*failed_do_unlock)(struct uffd_info *info);
> 	unsigned int (*page_shift)(struct vm_area_struct *src_vma);
> 	void (*complete_register)(struct vm_area_struct *vma);
> };
> 
> The copy() interface (and most of the rest) takes something called
> uffd_info*.  Then it looks like this:
> 
> struct uffd_info {
> 	unsigned long dst_addr;			/* Target address */
> 	unsigned long src_addr;			/* Source address */
> 	unsigned long len;			/* Total length to copy */
> 	unsigned long src_last;			/* starting src_addr + len */
> 	unsigned long dst_last;			/* starting dst_addr + len */
> 	struct folio *foliop;			/* folio pointer for retry */
> 	struct userfaultfd_ctx *ctx;		/* The userfaultfd context */
> 
> 	struct vm_area_struct *dst_vma;		/* Target vma */
> 	unsigned long increment;		/* Size of each operation */
> 	bool wp;				/* Operation is requesting write protection */
> 	const struct vm_uffd_ops *uffd_ops;	/* The vma uffd_ops pointer */
> 	int (*op)(struct uffd_info *info);	/* The operation to perform on the dst */
> 	long copied;
> };
> 
> You went almost mad when I introduced uffd_copy() in v1.  It might be
> because there used to have pmd* and something around pgtables.  However
> I'll still need to question whether this is a better and easier to adopt
> interface if a mem type wants to opt-in any uffd features.
> 
> So are you happy yourself now with above complicated struct that memtype
> needs to implement and support?

The entire thing is crazy, really.  I don't think modules should be
doing any of this.  It's nuts.  How can you register for a memory type
and not know how to deal with the memory type?  It makes no sense in the
first place.

The only reason that this is approaching acceptable is that memfd
already does the things it needs to do with it.  So we're trying to hand
something back to someone who already made the damn thing, but has no
way of finding it themselves.

Otherwise, this is stupid.

> 
> >         .zeropage               =       shmem_mfill_atomic_pte_zeropage,
> 
> Why a memory type needs to provide a separate hook to inject a zeropage?
> It should almost always be the same of UFFDIO_COPY except copying.

This also has two users.

> 
> >         .cont                   =       shmem_mfill_atomic_pte_continue,
> 
> It's OK to have it.  However said that, what we really need is "fetching a
> cache folio".  I'll need to check how you exposed the userfaultfd helpers
> so that it will support mem types to opt-in this.  To me, this is really an
> overkill.

This is literally doing what your subject said: modularize memory types.

> 
> Shmem impl:
> 
> static int shmem_mfill_atomic_pte_continue(struct uffd_info *info)
> {
> 	struct vm_area_struct *dst_vma = info->dst_vma;
> 	struct inode *inode = file_inode(dst_vma->vm_file);
> 	pgoff_t pgoff = linear_page_index(dst_vma, info->dst_addr);
> 	pmd_t *dst_pmd;
> 	struct folio *folio;
> 	struct page *page;
> 	int ret;
> 
> 	ret = uffd_get_dst_pmd(dst_vma, info->dst_addr, &dst_pmd);
> 	if (ret)
> 		return ret;
> 
> 	ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
> 	/* Our caller expects us to return -EFAULT if we failed to find folio */
> 	if (ret == -ENOENT)
> 		ret = -EFAULT;
> 	if (ret)
> 		goto out;
> 	if (!folio) {
> 		ret = -EFAULT;
> 		goto out;
> 	}
> 
> 	page = folio_file_page(folio, pgoff);
> 	if (PageHWPoison(page)) {
> 		ret = -EIO;
> 		goto out_release;
> 	}
> 
> 	ret = mfill_atomic_install_pte(dst_pmd, dst_vma, info->dst_addr,
> 				       page, false, info->wp);
> 	if (ret)
> 		goto out_release;
> 
> 	folio_unlock(folio);
> 	ret = 0;
> out:
> 	return ret;
> out_release:
> 	folio_unlock(folio);
> 	folio_put(folio);
> 	goto out;
> }
> 
> So are you sure this is better than the oneliner?
> 
> In your new API, the driver also needs to operate on pmd*.  Is it a concern
> to you?  Maybe you don't think it's a concern now, even if you used to
> think uffd_copy() has concerns exposing pmd* pointers?
> 
> The current series I proposed is not only simpler, but only expose folio*.
> That's at least safer at least from your theory, is that right?

I disagree, it's adding yet-another-mode to the code instead of function
pointers.  When do we stop bolting on and start fixing existing issues?

This change stands on its own without guest_memfd support.  Yours is an
anti-pattern.

> 
> >         .poison                 =       mfill_atomic_pte_poison,
> 
> Why this is needed if UFFDIO_POISON should exactly do the same thing for
> each memory type?

Yet, it doesn't so there's a function pointer for it.

> 
> >         .writeprotect           =       uffd_writeprotect,
> 
> Same question.  Wr-protect over a pagecache folio should really behave the
> same.  Why do you need to introduce it?

hugetlb is different.

> 
> After all, even in your branch you reused change_protection(), where the
> hugetlb special operations reside.  I don't see much point on why it's
> needed.

hugetlb uses hugetlb_change_protection().

> 
> >         .is_dst_valid           =       shmem_is_dst_valid,
> 
> It's definitely not obvious what this is for.

is_dst_valid checks to see if the destination is valid.  I can add a
comment if you want.

> 
> Looks like it's trying to verify vma validity.  However then I see shmem
> impl has this:
> 
> static ssize_t shmem_is_dst_valid(struct vm_area_struct *dst_vma,
> 		unsigned long dst_start, unsigned long len)
> {
> 	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
> 	    return -EINVAL;
> 
> 	return 0;
> }
> 
> Why shmem allows anon vma?

We can drop the !vma_is_shmem() check from here, but I just moved them.
This is an existing check today.


> 
> >         .increment              =       mfill_size,
> 
> This is almost only useful for hugetlbfs, same to page_shift and
> complete_register.

Correct.

> 
> It's ok if you want to clean hugetlbfs.  But if we want to fetch the
> granule of the folios inside a vma, IMHO we should make it a vma attribute,
> not something special to userfaultfd.

If there is a user outside of uffd, then we can move it.  Until then,
let's not pollute outside of mm/userfaultfd.c please.

> 
> >         .failed_do_unlock       =       uffd_failed_do_unlock,
> 
> You'd better at least change the name of it.  It so far unlocks some
> misterious locks then try to copy the folio without mmap/vma lock.
> 
> inline ssize_t
> uffd_failed_do_unlock(struct uffd_info *info)
> {
> 	ssize_t err;
> 	void *kaddr;
> 
> 	up_read(&info->ctx->map_changing_lock);
> 	uffd_mfill_unlock(info->dst_vma);
> 	VM_WARN_ON_ONCE(!info->foliop);
> 
> 	kaddr = kmap_local_folio(info->foliop, 0);
> 	err = copy_from_user(kaddr, (const void __user *) info->src_addr,
> 			     PAGE_SIZE);
> 	kunmap_local(kaddr);
> 	if (unlikely(err))
> 		return -EFAULT;
> 
> 	flush_dcache_folio(info->foliop);
> 	return 0;
> }
> 
> How do the mem type know what locks to unlock if they do not lock it first
> themselves?

We can change this if you want to move the lock to the internal code,
but the failed exit due to the folio mess that copy makes required this
block of code before.

hugetlb has a different block of code.


> 
> >         .page_shift             =       uffd_page_shift,
> >         .complete_register      =       uffd_complete_register,
> 
> Hugetlbfs specific hooks.  It's OK if you prefer rewritting code for
> hugetlbfs.  I don't have objections.

Yes, hugetlb is a memory type and so it is also modularized.

> 
> > };   
> > 
> > Where guest-memfd needs to write the one function:
> > guest_memfd_pte_continue(), from what I understand.
> 
> You did mention what is required in your new API:
> 
> 	/* Required features below */
> 	ssize_t (*is_dst_valid)(struct vm_area_struct *dst_vma,
> 				unsigned long dst_start, unsigned long len);
> 	unsigned long (*increment)(struct vm_area_struct *vma);
> 	ssize_t (*failed_do_unlock)(struct uffd_info *info);
> 	unsigned int (*page_shift)(struct vm_area_struct *src_vma);
> 	void (*complete_register)(struct vm_area_struct *vma);
> 
> So guest-memfd needs to implement these 6 APIs to support minor fault.

Or, you know, call the existing functions.

> 
> > 
> > Obviously some of the shmem_ functions would need to be added to a
> > header, or such.
> > 
> > And most of that can come from shmem_mfill_atomic_pte_continue(), from
> > what I understand.  This is about 40 lines of code, but may require
> > exposing some shmem functions to keep the code that compact.
> > 
> > So we don't need to expose getting a folio to a module, or decode any
> > special flags or whatever.  We just call the function that needs to be
> 
> I didn't reply before, but I don't think supported_ioctls is special flags
> or magic values.  They're simply flags showing what the mem type supports
> on the ioctls.  One can set it wrong, but I don't think what you proposed
> with above 6 APIs would be easier to get right.  Any module can also make
> things wrong in any of above.
> 
> UFFDIO_CONTINUE itself is definitely getting much more complicated, it used
> to only set a flag in supported_ioctls, provide a oneliner to fetch a
> cache.  Now it needs all above 6 APIs, one of them taking uffd_info* which
> further contains 10+ fields to process.
> 
> > called on the vma that is found.
> > 
> > If anyone has tests I can use for guest-memfd and instructions on
> > guest-memfd setup, I'll just write it instead of expanding the
> > userfaultfd middleware.
> 
> Personally, this whole thing is an overkill to me:

Absolutely.  And it needed to be done.  You are welcome.

> 
> $ git diff --stat 63f84ba525ea04ef376eac851efce2f82dd05f21..HEAD
>  fs/userfaultfd.c              |  14 +--
>  include/linux/hugetlb.h       |  21 ----
>  include/linux/mm.h            |  11 ++
>  include/linux/shmem_fs.h      |  14 ---
>  include/linux/userfaultfd_k.h | 108 ++++++++++------
>  mm/hugetlb.c                  | 359 +++++++++++++++++++++++++++++++++++++++++++++--------
>  mm/shmem.c                    | 245 ++++++++++++++++++++++++------------
>  mm/userfaultfd.c              | 869 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------------------------
>  8 files changed, 962 insertions(+), 679 deletions(-)


Right, so the existing code a huge mess today and you won't fix
anything, ever.

We currently have a uffd_flags_t that sets wp or not then passes it
through to set another part of the flag to pass it through again.  This
is stupid.

And then it gets more stupid.  We then check that flag exactly once for
the second argument in a global function and NEVER LOOK AT THAT PART
AGAIN.  So we now have a type to contain a boolean, but we keep passing
through the uffd_flags_t, so we can see if wp is set or not.

And you're fine with it.  In fact, let's add another one.  I mean, we
tried for two but people didn't like two so lets whine and whine and
whine until people get frustrated and let you push a second one of those
gems into the code you REFUSED TO MAINTAIN.

This is what owning a problem looks like: I removed the uufd_flags_t,
because it's stupid.  I removed all the hugetlb checks because I
modularized the memory types.

I'm happy to address changes, but I'm not happy to accept more
middleware and "it's not part of the patch set" to address any problem
as you push more trash into an already horrible code base.

We need to fix things too.

So I'm fixing it.
Liam