linux-kernel - Re: [PATCH v2 1/4] mm: Introduce vm_uffd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aMnAscxj_h42wOAC@x1.local>
Date: Tue, 16 Sep 2025 15:55:29 -0400
From: Peter Xu <peterx@...hat.com>
To: "Liam R. Howlett" <Liam.Howlett@...cle.com>,
	David Hildenbrand <david@...hat.com>,
	Mike Rapoport <rppt@...nel.org>,
	Nikita Kalyazin <kalyazin@...zon.com>,
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
	Suren Baghdasaryan <surenb@...gle.com>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, Vlastimil Babka <vbabka@...e.cz>,
	Muchun Song <muchun.song@...ux.dev>,
	Hugh Dickins <hughd@...gle.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	James Houghton <jthoughton@...gle.com>,
	Michal Hocko <mhocko@...e.com>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Oscar Salvador <osalvador@...e.de>,
	Axel Rasmussen <axelrasmussen@...gle.com>,
	Ujwal Kundur <ujwal.kundur@...il.com>
Subject: Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API

On Fri, Jul 04, 2025 at 03:39:32PM -0400, Liam R. Howlett wrote:
> * Peter Xu <peterx@...hat.com> [250704 11:00]:
> > On Fri, Jul 04, 2025 at 11:34:15AM +0200, David Hildenbrand wrote:
> > > On 03.07.25 19:48, Mike Rapoport wrote:
> > > > On Wed, Jul 02, 2025 at 03:46:57PM -0400, Peter Xu wrote:
> > > > > On Wed, Jul 02, 2025 at 08:39:32PM +0300, Mike Rapoport wrote:
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > > The main target of this change is the implementation of UFFD for
> > > > > > > KVM/guest_memfd (examples: [1], [2]) to avoid bringing KVM-specific code
> > > > > > > into the mm codebase.  We usually mean KVM by the "drivers" in this context,
> > > > > > > and it is already somewhat "knowledgeable" of the mm.  I don't think there
> > > > > > > are existing use cases for other drivers to implement this at the moment.
> > > > > > > 
> > > > > > > Although I can't see new exports in this series, there is now a way to limit
> > > > > > > exports to particular modules [3].  Would it help if we only do it for KVM
> > > > > > > initially (if/when actually needed)?
> > > > > > 
> > > > > > There were talks about pulling out guest_memfd core into mm, but I don't
> > > > > > remember patches about it. If parts of guest_memfd were already in mm/ that
> > > > > > would make easier to export uffd ops to it.
> > > > > 
> > > > > Do we have a link to such discussion?  I'm also curious whether that idea
> > > > > was acknowledged by KVM maintainers.
> > > > 
> > > > AFAIR it was discussed at one of David's guest_memfd calls
> > > 
> > > While it was discussed in the call a couple of times in different context
> > > (guest_memfd as a library / guest_memfd shim), I think we already discussed
> > > it back at LPC last year.
> > > 
> > > One of the main reasons for doing that is supporting guest_memfd in other
> > > hypervisors -- the gunyah hypervisor in the kernel wants to make use of it
> > > as well.
> > 
> > I see, thanks for the info. I found the series, it's here:
> > 
> > https://lore.kernel.org/all/20241113-guestmem-library-v3-0-71fdee85676b@quicinc.com/
> > 
> > Here, the question is whether do we still want to keep explicit calls to
> > shmem, hugetlbfs and in the future, guest-memfd.  The library-ize of
> > guest-memfd doesn't change a huge lot on answering this question, IIUC.
> 
> Can you explore moving hugetlb_mfill_atomic_pte and
> shmem_mfill_atomic_pte into mm/userfaultfd.c and generalizing them to
> use the same code?
> 
> That is, split the similar blocks into functions and reduce duplication.
> 
> These are under the UFFD config option and are pretty similar.  This
> will also limit the bleeding of mfill_atomic_mode out of uffd.

These are different problems to solve.

I did propose a refactoring of hugetlbfs in general years ago.  I didn't
finish that.  It's not only because it's a huge amount of work that almost
nobody likes to review (even if everybody likes to see landing), also that
since we decided to not add more complexity into hugetlbfs code like HGM
there's also less point on refactoring.

I hope we can be on the same page to understand the problem we want to
solve here.  The problem is we want to support userfaultfd on guest-memfd.
Your suggestion could be helpful, but IMHO it's orthogonal to the current
problem.  It's also not a dependency.  If it is, you should see code
duplications introduced in this series, which might not be needed after a
cleanup.  It's not like that.

This series simply resolves a totally different problem.  The order on
solving this problem or cleaning things elsewhere (or we decide to not do
it at all) are not deterministic, IMHO.

> 
> 
> 
> If you look at what the code does in userfaultfd.c, you can see that
> some refactoring is necessary for other reasons:
> 
> mfill_atomic() calls mfill_atomic_hugetlb(), or it enters a while
> (src_addr < src_start + len) to call mfill_atomic_pte().. which might
> call shmem_mfill_atomic_pte() in mm/shmem.c
> 
> mfill_atomic_hugetlb() calls, in a while (src_addr < src_start + len)
> loop and calls hugetlb_mfill_atomic_pte() in mm/hugetlb.c

Again, IMHO this still falls into hugetlbfs refactoring category.  I would
sincerely request that we put that as a separate topic to discuss, because
it's huge and community resources are limited on both developments and
reviews.

Shmem is already almost sharing code with anonymous, after all these two
memory types are the only ones that support userfaultfd besides hugetlbfs.

> 
> The shmem call already depends on the vma flags.. which it still does in
> your patch 4 here.  So you've replaced this:
> 
> if (!(dst_vma->vm_flags & VM_SHARED)) {
> ...
> } else {
>         shmem_mfill_atomic_pte()
> }
> 
> With...
> 
> if (!(dst_vma->vm_flags & VM_SHARED)) {
> ...
> } else {
> ...
>         uffd_ops->uffd_copy()
> }

I'm not 100% sure I get this comment.  It's intentional to depend on vma
flags here for shmem.  See Andrea's commit:

commit 5b51072e97d587186c2f5390c8c9c1fb7e179505
Author: Andrea Arcangeli <aarcange@...hat.com>
Date:   Fri Nov 30 14:09:28 2018 -0800

    userfaultfd: shmem: allocate anonymous memory for MAP_PRIVATE shmem

Are you suggesting we should merge it somehow?  I'd appreciate some
concrete example here if so.

> 
> So, really, what needs to happen first is userfaultfd needs to be
> sorted.
> 
> There's no point of creating a vm_ops_uffd if it will just serve as
> replacing the call locations of the functions like this, as it has done
> nothing to simplify the logic.

I had a vague feeling that you may not have understood the goal of this
series.  I thought we were on the same page there.  It could be partly my
fault if that's not obvious, I can try to be more explicit in the cover
letter next if I'm going to repost.

Please consider supporting guest-memfd as the goal, vm_ops_uffd is to allow
all changes to happen in guest-memfd.c.  Without it, it can't happen.
That's the whole point so far.

> 
> > However if we want to generalize userfaultfd capability for a type of
> > memory, we will still need something like the vm_uffd_ops hook to report
> > such information.  It means drivers can still overwrite these, with/without
> > an exported mfill_atomic_install_pte() functions.  I'm not sure whether
> > that eases the concern.
> 
> If we work through the duplication and reduction where possible, the
> path forward may be easier to see.
> 
> > 
> > So to me, generalizing the mem type looks helpful with/without moving
> > guest-memfd under mm/.
> 
> Yes, it should decrease the duplication across hugetlb.c and shmem.c,
> but I think that userfaultfd is the place to start.
> 
> > 
> > We do have the option to keep hard-code guest-memfd like shmem or
> > hugetlbfs. This is still "doable", but this likely means guest-memfd
> > support for userfaultfd needs to be done after that work.  I did quickly
> > check the status of gunyah hypervisor [1,2,3], I found that all of the
> > efforts are not yet continued in 2025.  The hypervisor last update was Jan
> > 2024 with a batch push [1].
> > 
> > I still prefer generalizing uffd capabilities using the ops.  That makes
> > guest-memfd support on MINOR not be blocked and it should be able to be
> > done concurrently v.s. guest-memfd library.  If guest-memfd library idea
> > didn't move on, it's non-issue either.
> > 
> > I've considered dropping uffd_copy() and MISSING support for vm_uffd_ops if
> > I'm going to repost - that looks like the only thing that people are
> > against with, even though that is not my preference, as that'll make the
> > API half-broken on its own.
> 
> The generalisation you did does not generalize much, as I pointed out
> above, and so it seems less impactful than it could be.
> 
> These patches also do not explore what this means for guest_memfd.  So
> it is not clear that the expected behaviour will serve the need.
> 
> You sent a link to an example user.  Can you please keep this work
> together in the patch set so that we know it'll work for your use case
> and allows us an easier way to pull down this work so we can examine it.
> 
> Alternatively, you can reduce and combine the memory types without
> exposing the changes externally, if they stand on their own.  But I
> don't think anyone is going to accept using a vm_ops change where a
> direct function call could be used.

While I was absent, Nikita sent a version that is based on the library
code.  That uses direct function calls:

https://lore.kernel.org/all/20250915161815.40729-3-kalyazin@amazon.com/

I think it's great we have the other sample of implementing this.

I believe Nikita has no strong opinion how this will land at last, whatever
way the community prefers.  I prefer this solution I sent.

Do you have a preference, when explicitly there's the other one to compare?

> 
> > Said that, I still prefer this against
> > hard-code and/or use CONFIG_GUESTMEM in userfaultfd code.
> 
> It also does nothing with regards to the CONFIG_USERFAULTFD in other
> areas.  My hope is that you could pull out the common code and make the
> CONFIG_ sections smaller.
> 
> And, by the way, it isn't the fact that we're going to copy something
> (or mfill_atomic it, I guess?) but the fact that we're handing out the
> pointer for something else to do that.  Please handle this manipulation
> in the core mm code without handing out pointers to folios, or page
> tables.

Is "return pointers to folios" also an issue now in the API?  Are you
NACKing uffd_get_folio() API that this series proposed?

> 
> You could do this with the address being passed in and figure out the
> type, or even a function pointer that you specifically pass in an enum
> of the type (I think this is what Lorenzo was suggesting somewhere),
> maybe with multiple flags for actions and fallback (MFILL|ZERO for
> example).

I didn't quickly get it.  I appreciate if there's some more elaborations.

Thanks,

-- 
Peter Xu