linux-kernel - Re: [PATCH v2 1/4] mm: Introduce vm_uffd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <swfs7qpgrezamnijhheiggwdfklfqdc6ahp5g7nvprr64m7wz5@msf2mqajzbuz>
Date: Thu, 18 Sep 2025 21:50:49 -0400
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Peter Xu <peterx@...hat.com>
Cc: David Hildenbrand <david@...hat.com>,
        Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
        Nikita Kalyazin <kalyazin@...zon.com>, Mike Rapoport <rppt@...nel.org>,
        Suren Baghdasaryan <surenb@...gle.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Vlastimil Babka <vbabka@...e.cz>,
        Muchun Song <muchun.song@...ux.dev>, Hugh Dickins <hughd@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        James Houghton <jthoughton@...gle.com>, Michal Hocko <mhocko@...e.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Oscar Salvador <osalvador@...e.de>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        Ujwal Kundur <ujwal.kundur@...il.com>
Subject: Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API

* Peter Xu <peterx@...hat.com> [250918 17:07]:
> On Thu, Sep 18, 2025 at 03:43:34PM -0400, Liam R. Howlett wrote:
> > * Peter Xu <peterx@...hat.com> [250918 14:21]:
> > > On Thu, Sep 18, 2025 at 07:53:46PM +0200, David Hildenbrand wrote:
> > > > Re Nikita: If we could just reuse fault() for userfaultfd purposes, that
> > > > might actually be pretty nice.
> > > 
> > > I commented on that.
> > > 
> > > https://lore.kernel.org/all/aEiwHjl4tsUt98sh@x1.local/
> 
> [1]
> 
> > > 
> > > That'll need to leak FAULT_FLAG_USERFAULT_CONTINUE which isn't necessary,
> > > make it extremely hard to know when to set the flag, and comlicates the
> > > fault path which isn't necessary.
> > > 
> > > I think Mike's comment was spot on, that the new API is literally
> > > do_fault() for shmem, but only used in userfaultfd context so it's even an
> > > oneliner.
> > > 
> > > I do not maintain mm, so above is only my two cents, so I don't make
> > > decisions.  Personally I still prefer the current approach of keep the mm
> > > main fault path clean.
> > 
> > What we are trying to say is you can have a fault path that takes a type
> > enum that calls into a function that does whatever you want.  It can
> > even live in mm/userfaultfd.c.  It can then jump off to mm/guest-memfd.c
> > which can contain super unique copying of memory if that's needed.
> 
> Per mentioning of mm/guest-memfd.c, are you suggesting to take guest-memfd
> library approach?

> We have in total of at least three proposals:
> 
> (a) https://lore.kernel.org/all/20250404154352.23078-1-kalyazin@amazon.com/
> (b) this one
> (c) https://lore.kernel.org/all/20250915161815.40729-1-kalyazin@amazon.com/
> 
> I reviewd (a) and (c) and I provided my comments.  If you prefer the
> library approach, feel free to reply directly to (c) thread against my
> email.
> 
> I chose (b), from when it was posted.

Honestly, I don't know what I'd vote for because I don't like any of
them.  I'd chose (d) the do nothing option.

> 
> 
> > 
> > That way driver/i_know_better_that_everyone.c or fs/stature.c don't
> > decide they can register their uffd and do cool stuff that totally won't
> > tank the system in random strange ways.
> 
> What is the difference if they are allowed to register ->fault() and tank
> the system?

One less problem.

More people with mm experience looking at the handling of folios.

The common code not being cloned and kept up to date when an issue in
the original is discovered.

Having to only update a few fault handlers when there is a folio or
other mm change.

Hopefully better testing?

> > 
> > Seriously, how many fault handlers are you expecting to have here?
> 
> First of all, it's not about "how many".  We can assume one user as of now.
> Talking about any future user doesn't really help.  The choice I made above
> on (b) is the best solution I think, with any known possible users.  The
> plan might change, when more use cases pops up.  However we can only try to
> make a fair decision with the current status quo.

Planning to handle one, five, or two billion makes a difference in what
you do.  Your plan right now enables everyone to do whatever they want,
where they want.  I don't think we need this sort of flexibility with
the limited number of users?

> 
> OTOH, the vm_uffd_ops also provides other fields (besides uffd_*() hooks).
> I wouldn't be surprised if a driver wants to opt-in with some of the fields
> with zero hooks attached at all, when an userfaultfd feature is
> automatically friendly to all kinds of memory types.
> 
> Consider one VMA that sets UFFDIO_WRITEPROTECT but without most of the
> rest.
> 
> As I discussed, IMHO (b) is the clean way to describe userfaultfd demand
> for any memory type.
> 
> > 
> > I'd be surprised if a lot of the code in these memory types isn't
> > shared, but I guess if they are all over the kernel they'll just clone
> > the code and bugs (like the arch code does, but with less of a reason).
> > 
> > > Besides, this series also cleans up other places all over the places, the
> > > vm_uffd_ops is a most simplified version of description for a memory type.
> > 
> > 6 files changed, 207 insertions(+), 76 deletions(-)
> > 
> > Can you please point me to which patch has clean up?
> 
> Patch 4.  If you want me to explain every change I touched that is a
> cleanup, I can go into details.  Maybe it's faster if you read them, it's
> not a huge patch.

I responded here [1].  I actually put a lot of effort into that response
and took a lot of time to dig into some of this to figure out if it was
possible, and suggested some ideas.

That was back in July, so the details aren't that fresh anymore.  Maybe
you missed my reply?

I was hoping that, if the code was cleaned up, a solution may be more
clear.

I think the ops idea has a lot of positives.  I also think you don't
need to return a folio pointer to make it work.

> > 
> > How is a generic callback that splits out into, probably 4?, functions
> > the deal breaker here?
> > 
> > How is leaking a flag the line that we won't cross?
> > 
> > > So IMHO it's beneficial in other aspects as well.  If uffd_copy() is a
> > > concern, fine, we drop it.  We don't plan to have more use of UFFDIO_COPY
> > > outside of the known three memory types after all.
> > 
> > EXACTLY!  There are three memory types and we're going to the most
> > flexible interface possible, with the most danger.  With guest_memfd
> > we're up to four functions we'd need.  Why not keep the mm code in the
> > mm and have four functions to choose from?  If you want 5 we can always
> > add another.
> 
> I assume for most of the rest comments, you're suggesting the library
> approach.  If so, again, I suggest we discuss explicitly in that thread.
> 
> The library code may (or may not) be useful for other purposes.  For the
> support of userfaultfd, it definitely doesn't need to be a dependency.
> OTOH, we may still want some abstraction like this one with/without the
> library.  If so, I don't really see why we need to pushback on this.
> 
> AFAIU, the only concern (after drop uffd_copy) is we'll expose
> uffd_get_folio().

If you read my response [1], then you can see that I find the ops
underutilized and lacks code simplification.

> Please help explain why ->fault() isn't worse.

I'm not sure it is worse, but I don't think it's necessary to return a
folio for 4 users.  And I think it could be better if we handled the
operations on the folio internally, if at all possible.

> 
> If we accepted ->fault() for all these years, I don't see a reason we
> should reject ->uffd_get_folio(), especially one of the goals is to keep
> the core mm path clean, per my comment to proposal (a).

I see this argument as saying "there's a hole in our boat so why can't I
make another?"  It's not the direction we have to go to get what we need
right now, so why are we doing it?  Like you said, it can be evaluated
later if things change..

My thoughts were around an idea that we only really need to do a limited
number of operations on that pointer you are returning.  Those
operations may share code, and could be internal to mm.  I don't see
this as (a), (b), or (c), but maybe an addition to (b)?  Maybe we need
more ops to cover the uses?

So, I think I do want the vm_uffd_ops idea, but not as it is written
right now.

Thanks,
Liam

[1]. https://lore.kernel.org/all/e7vr62s73dftijeveyg6lfgivctijz4qcar3teswjbuv6gog3k@4sbpuj35nbbh/