linux-kernel - Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <boylgf5thjyblsvjlumxrwg5qfl43e4revh5i7yyh2yiddhgah@gtht57bgkuzn>
Date: Thu, 6 Nov 2025 11:16:20 -0500
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: David Hildenbrand <dhildenb@...hat.com>
Cc: Peter Xu <peterx@...hat.com>, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, Mike Rapoport <rppt@...nel.org>,
        Muchun Song <muchun.song@...ux.dev>,
        Nikita Kalyazin <kalyazin@...zon.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        James Houghton <jthoughton@...gle.com>,
        Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
        Hugh Dickins <hughd@...gle.com>, Michal Hocko <mhocko@...e.com>,
        Ujwal Kundur <ujwal.kundur@...il.com>,
        Oscar Salvador <osalvador@...e.de>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Andrea Arcangeli <aarcange@...hat.com>
Subject: Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types

* David Hildenbrand <dhildenb@...hat.com> [251105 16:23]:
> On 30.10.25 18:13, Liam R. Howlett wrote:
> > * Peter Xu <peterx@...hat.com> [251021 12:28]:
> > 
> > ...
> > 
> > > Can you send some patches and show us the code, help everyone to support
> > > guest-memfd minor fault, please?
> > 
> > Patches are here:
> > 
> 
> Hi Liam,
> 
> thanks for showing us what userfaultfd could look like when refactored
> according to your idea. I think most of the userfaultfd core code is easier
> to get in your tree.
> 
> > https://git.infradead.org/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/modularized_mem
> > 
> > This is actually modularized memory types.  That means there is no
> > hugetlb.h or shmem.h included in mm/userfaultfd.c code.
> 
> Yeah, I think there is this confusion of "modulize memory types" and
> "support minor fault in guest_memfd".
> 
> So I see what you did here as trying to see how far we could go to remove
> any traces of shmem/hugetlb from userfaultd core.

Right, yes.  This is an example of what we can do to completely
modularize the memory types.  If this were the reality, it might be
possible to create a guest_memfd type that has the necessary changes and
only export what is required for the module to execute - that is, avoid
exporting memory functions and also avoid having the bulk of guest_memfd
remain where it is today.

> 
> So I'll comment based on that and rather see it as a bigger, more extreme
> rework.

Sounds good, I was hoping the patches could be a starting point for a
conversation.

> 
> > 
> > uffd_flag_t has been removed.  This was turning into a middleware and
> > it is not necessary.  Neither is supported_ioctls.
> 
> I assume you mean the entries that were proposed in Peters series, not
> something that is upstream.

No.  This is upstream today.  uffd_flags_t is used for two purposes
today: 1. deciding which operation to call and 2. pass through if the
request includes wp.  In fact, some of the callers just create and send
the flag only within the mm/userfaultfd.c code because wp doesn't make
sense.  Once dispatched to the operation code, the flag is only ever
used to check for wp.

If the code was structured to use the call path to know what underlying
operation, then the flag can be reduced to a boolean - which is what I
ended up doing.

> 
> > 
> > hugetlb now uses the same functions as every other memory type,
> > including anon memory.
> > 
> > Any memory type can change functionality without adding instructions or
> > flags or anything to some other code.
> > 
> > This code passes uffd-unit-test and uffd-wp-mremap (skipped the swap
> > tests).
> > 
> > guest-memfd can implement whatever it needs to (or use others
> > implementations), like shmem_uffd_ops here:
> 
> There is obviously some downside to be had with this approach (some of which
> Mike raised), regarding the interface to "memory types" implementing this,
> but I'll discuss that later.
> 
> > 
> > static const struct vm_uffd_ops shmem_uffd_ops = {
> >          .copy                   =       shmem_mfill_atomic_pte_copy,
> >          .zeropage               =       shmem_mfill_atomic_pte_zeropage,
> >          .cont                   =       shmem_mfill_atomic_pte_continue,
> >          .poison                 =       mfill_atomic_pte_poison,
> >          .writeprotect           =       uffd_writeprotect,
> >          .is_dst_valid           =       shmem_is_dst_valid,
> >          .increment              =       mfill_size,
> 
> See below, I wonder if that could be performed by the callbacks invoked as
> part of the prior calls to mfill_loop() etc.
> 
> >          .failed_do_unlock       =       uffd_failed_do_unlock,
> 
> That one is a bit unfortunate (read: ugly :) ).

Agreed.

> 
> failed_do_unlock() is only called from mfill_copy_loop(). Where we perform a
> prior info.uffd_ops->copy.
> 
> 
> After calling err = info->op(info);
> 
> Couldn't that callback just deal with the -ENOENT case?
> 
> So in case of increment/failed_do_unlock, maybe we could find a way to just
> let the ->copy etc communicate/perform that directly.

The failure case is only detected after getting a folio, but will need
to 'retry' (copy is the only one that does a retry).  Retry gets the
destination vma, where the vm_ops comes from.  This is why you need to
return to the loop.  So it's not that simple to moving it into the
function.

In upstream today, the return -ENOENT can only happen for copy (in fact
others mask it out..), but every operation checks for -ENOENT since they
are all using the same code block.

All of this code is more complicated because we have to find the vma
before we know what variation of the operation we need.  Annoyingly,
this is decided per-mfill_size even though the vma doesn't change,
currently in  mfill_atomic_pte().

I think we could find a way to do what you are suggesting, and I think
it's easier and less risky if the logical operations are not being
dispatched based on uffd_flag_t.

> 
> >          .page_shift             =       uffd_page_shift,
> 
> Fortunately, this is not required. The only user in move_present_ptes()
> moves *real* PTEs, and nothing else (no hugetlb PTEs that are PMDs etc. in
> disguise).

The hugetlb code had a different value, so I did extract it when I
Iunited mfill_atomic() and mfill_atomic_hugetlb().  I am sure there are
other changes that could be removed as well, but to logically follow the
changes through each step it seemed easier to extract everything that
was different into its own function pointer.

> 
> >          .complete_register      =       uffd_complete_register,
> > };
> > 
> 
> So, the design is to callback into the memory-type handler, which will then
> use exported uffd functionality to get the job done.
> 
> This nicely abstracts hugetlb handling, but could mean that any code
> implementing this interface has to built up on exported uffd functionality
> (not judging, just saying).
> 
> As we're using the callbacks as an indication whether features are
> supported, we cannot easily leave them unset to fallback to the default
> handling.
> 
> Of course, we could use some placeholder, magic UFFD_DEFAULT_HANDLER keyword
> to just use the uffd_* stuff without exporting them.
> 
> So NULL would mean "not supported" and "UFFD_DEFAULT_HANDLER" would mean "no
> special handling needed".
> 
> Not sure how often that would be the case, though. For shmem it would
> probably only be the poison callback, for others, I am not sure.

There are certainly a lot of this we would not want to export.  My
initial thought was to create two function pointers: one for operations
that can be replaced, and one for basic functions that always have a
default.  We could do this with two function pointers, either tiered or
at the same level.

Most of this is to do with hugetlb having its own code branch into its
own loop.  We could even create an op that is returned that only lives
in mm/userfaultfd.c and has two variants: hugetlb and not_hugetlb.  This
would indeed need the hugetlb.h again, but I'm pretty sure that removing
the header is 'too big of a change' anyways.


> 
> > Where guest-memfd needs to write the one function:
> > guest_memfd_pte_continue(), from what I understand.
> 
> It would be interesting to see how that one would look like.
> 
> I'd assume fairly similar to shmem_mfill_atomic_pte_continue()?
> 
> Interesting question would be, how to avoid the code duplication there.

Yes, this is where I was going here.  I was going to try and finish this
off by creating that one function.  That, and reducing the vm_ops to a
more sensible size (as mentioned above).

shmem_mfill_atomic_pte_continue() could be cut up into function segments
to avoid the duplication.  Or we could make a wrapper that accepts a
function pointer.. there are certainly ways we can mitigate duplication.

Really, what is happening here is that the code was written to try and
use common code.  Then hugetlb came in and duplicated everything
anyways, so we lost what we were gaining by creating a
dispatcher/middleware in the end.  Then handling errors complicated it
further.

The code has also bee __alway_inline'ed, so the assembly duplicates the
code anyways.  It's really an attempt to avoid updating multiple
functions when issues arise.  But now we have error checks that don't
need to happen and they are running in a loop... also hugetlb has its
own loop.  And returning -ENOENT has a special meaning so we have to be
careful of that too.

I don't think the compliers are smart enough to know that the retry
loop can be removed for 3/4 cases, so the assembly is probably
sub-optimal.

> 
> (as a side note, I wonder if we would want to call most of these uffd helper
> uffd_*)

Yeah, sure.  Does it matter for static inlines?  Naming is hard and I
think shmem_mfill_atomic_pte_continue() is a dumb name as well.. it's
too short really :)

> 
> 
> I'll have to think about some of this some more. In particular, alternatives
> to at least get all the shmem logic cleanly out of there and maybe only have
> a handful callback into hugetlb.
> 
> IOW, not completely make everything fit the "odd case" and rather focus on
> the "normal cases" when designing this vm_ops interface here.
> 
> Not sure if that makes sense, just wanted to raise it.

Thanks for looking.  I think there is some use to having this example
and that's why I was asking what it might look like.  I certainly went
overboard in the last few commits to remove hugetlb all together, but at
that point it was so close..

I think there might be value uniting both hugetlb and the normal code
path, even if the operations call signatures are aligned and we just use
a pointer to a struct within the "while (src_addr < src_start + len)"
loop that exists today.

The removal of the uffd_flags_t is also something that might be worth
exploring.

Cheers,
Liam