lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <43lezmpxcs63qpgefapvjkrxmswmbloc4dba3i26pdxfoaycur@2bfofnrxlwjj>
Date: Mon, 3 Nov 2025 13:43:23 -0500
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Mike Rapoport <rppt@...nel.org>
Cc: Peter Xu <peterx@...hat.com>, David Hildenbrand <david@...hat.com>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        Muchun Song <muchun.song@...ux.dev>,
        Nikita Kalyazin <kalyazin@...zon.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        James Houghton <jthoughton@...gle.com>,
        Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
        Hugh Dickins <hughd@...gle.com>, Michal Hocko <mhocko@...e.com>,
        Ujwal Kundur <ujwal.kundur@...il.com>,
        Oscar Salvador <osalvador@...e.de>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Andrea Arcangeli <aarcange@...hat.com>
Subject: Re: [PATCH v4 0/4] mm/userfaultfd: modulize memory types

* Mike Rapoport <rppt@...nel.org> [251103 11:11]:
> Hi Liam,
> 
> On Thu, Oct 30, 2025 at 01:13:24PM -0400, Liam R. Howlett wrote:
> > * Peter Xu <peterx@...hat.com> [251021 12:28]:
> > 
> > ...
> > 
> > > Can you send some patches and show us the code, help everyone to support
> > > guest-memfd minor fault, please?
> > 
> > Patches are here:
> > 
> > https://git.infradead.org/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/modularized_mem
> 
> It's really cool you picked up the gauntlet and invested this effort into
> refactoring of uffd!

Thanks.

I have found that a few places really needed attention to avoid
expanding this code into more middleware.  The existing code has bundled
all operations through one loop and now ever operation needs to check
for special cases even when they cannot be triggered.

And that loop is duplicated for hugetlb, so if there is an issue then we
already have to fix the code in 2 places, and we also have to check
unnecessary error conditions for 3 out of 4 operations, in both loops.
This seems like the worst situation of both ideas - multiple call paths
and unnecessary checks in loops.

> 
> I agree that userfault code begs for cleanups after the sh^W stuff has been
> piling over and over, but ...
>  
> > This is actually modularized memory types.  That means there is no
> > hugetlb.h or shmem.h included in mm/userfaultfd.c code.
> > 
> > uffd_flag_t has been removed.  This was turning into a middleware and
> > it is not necessary.  Neither is supported_ioctls.
> > 
> > hugetlb now uses the same functions as every other memory type,
> > including anon memory.
> > 
> > Any memory type can change functionality without adding instructions or
> > flags or anything to some other code.
> > 
> > This code passes uffd-unit-test and uffd-wp-mremap (skipped the swap
> > tests).
> > 
> > guest-memfd can implement whatever it needs to (or use others
> > implementations), like shmem_uffd_ops here:
> > 
> > static const struct vm_uffd_ops shmem_uffd_ops = {
> >         .copy                   =       shmem_mfill_atomic_pte_copy,
> >         .zeropage               =       shmem_mfill_atomic_pte_zeropage,
> >         .cont                   =       shmem_mfill_atomic_pte_continue,
> >         .poison                 =       mfill_atomic_pte_poison,
> >         .writeprotect           =       uffd_writeprotect,
> >         .is_dst_valid           =       shmem_is_dst_valid,
> >         .increment              =       mfill_size,
> >         .failed_do_unlock       =       uffd_failed_do_unlock,
> >         .page_shift             =       uffd_page_shift,
> >         .complete_register      =       uffd_complete_register,
> > };   
> 
> ... I don't think it's the right level of abstraction to add as uffd_ops to
> vmap_ops.
>  
> As I see it, we have two levels where things are different: hugetlb vs
> others at the very core of mfill_atomic() and then how different pte-based
> types implement a single page operations, i.e copy/zeropage/continue.
> 
> So to separate hugetlb code from userfault we need something like 
> 
> 	->get_parent_pagetable()
> 	->pagesize()
> 	->mfill_atomic_page()
> 
> and apparently something like your complete_register() and maybe
> is_dst_valid().
> 
> But to provide hooks for shmem, anon and potentially guest_memfd() we
> should be looking at callbacks to get a folio to populate a PTE, either for
> copy or continue, and Peter's ->minor_get_folio() seems to me the right
> level of abstraction to expose at vm_uffd_ops.

This was entirely a clean up, nothing has been exported.

When adding the guest_memfd type, I want to export the smallest set of
these that we need.  I've come up with a few ways of doing that so far,
and that's why I was asking how to test the guest_memefd use case.

Both ideas involve splitting these ops into two: external and internal.
We can either tier the ops in each other or add both directly to the vma
struct.  I'm leaning towards embedding a second op into uffd_ops.

We can default to the common cases and allow the hugetlb code to set its
own (it could never be a module anyways..).  As in the example code I've
sent out, you can see we can fall back to defaults if !vma->vm_ops ||
!vma->vm_ops->userfaultfd for anon vmas.  We can use this same method.

The other thing about going this far was that, by the time I reached the
point of having the targed mfill_atomic operations modularized, it was
very close to being able to drop shmem.h and hugetlb.h from the code
entirely, so a few of these exist to completely decouple the code.

I'm not sure that it is necessary, but I was trying to show how to
completely modularize the memory types, specifically since I was asked
how.

> 
> I believe we can extract ->get_folio() and ->put_folio() from
> shmem_mfill_atomic_pte() and call them from mfill_atomic_pte_copy().
> 
> > Where guest-memfd needs to write the one function:
> > guest_memfd_pte_continue(), from what I understand.
> > 
> > Obviously some of the shmem_ functions would need to be added to a
> > header, or such.
> > 
> > And most of that can come from shmem_mfill_atomic_pte_continue(), from
> > what I understand.  This is about 40 lines of code, but may require
> > exposing some shmem functions to keep the code that compact.
> 
> This seems to me an overkill to implement MFILL_ATOMIC_CONTINUE for
> guest_memfd().
> I think it should be able to register a callback to provide a singe folio
> at a given file offset if that folio is in the guest_memfd's page cache.
> No reason for guest_memfd to start re-implementing locking, acquiring of
> PMD and updating the page table, even if it only needs to call functions
> from userfaultfd

There are certainly ways around this as well; a guest_memfd and shmem
function that passes the pointer to the core function comes to mind.
That would avoid the safety concerns about handing out the folio.

We could also have a single function pointer that is exported and only
used in shmem functions when !null or fall back to the default.

It's not unsolvable and it could built on the fact that this is less
middleware and more of an interface.

> 
> > So we don't need to expose getting a folio to a module, or decode any
> > special flags or whatever.  We just call the function that needs to be
> > called on the vma that is found.
> 
> Agree about exposing flags to a module and about limiting API to functions
> only.

Unfortunately, the middleware and flags will continue to grow.

Given the tone of the response I received on my clean up work after
being asked, I will not be continuing to work on this.

Regards,
Liam


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ