lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJuCfpH1mtKiphP8ipZeD4CNG9Mr4QERJTQAQm_gtZow5G7AAQ@mail.gmail.com>
Date: Thu, 3 Jul 2025 08:01:12 -0700
From: Suren Baghdasaryan <surenb@...gle.com>
To: Peter Xu <peterx@...hat.com>
Cc: Mike Rapoport <rppt@...nel.org>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, Vlastimil Babka <vbabka@...e.cz>, 
	Muchun Song <muchun.song@...ux.dev>, Hugh Dickins <hughd@...gle.com>, 
	Andrew Morton <akpm@...ux-foundation.org>, James Houghton <jthoughton@...gle.com>, 
	"Liam R . Howlett" <Liam.Howlett@...cle.com>, Nikita Kalyazin <kalyazin@...zon.com>, 
	Michal Hocko <mhocko@...e.com>, David Hildenbrand <david@...hat.com>, 
	Andrea Arcangeli <aarcange@...hat.com>, Oscar Salvador <osalvador@...e.de>, 
	Axel Rasmussen <axelrasmussen@...gle.com>, Ujwal Kundur <ujwal.kundur@...il.com>
Subject: Re: [PATCH v2 1/4] mm: Introduce vm_uffd_ops API

On Wed, Jul 2, 2025 at 1:23 PM Peter Xu <peterx@...hat.com> wrote:
>
> On Wed, Jul 02, 2025 at 09:16:31PM +0300, Mike Rapoport wrote:
> > On Tue, Jul 01, 2025 at 10:04:28AM -0700, Suren Baghdasaryan wrote:
> > > On Mon, Jun 30, 2025 at 3:16 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@...cle.com> wrote:
> > > >
> > > > It seems like we're assuming a _lot_ of mm understanding in the underlying
> > > > driver here.
> > > >
> > > > I'm not sure it's really normal to be handing around page table state and
> > > > folios etc. to a driver like this, this is really... worrying to me.
> > > >
> > > > This feels like you're trying to put mm functionality outside of mm?
> > >
> > > To second that, two things stick out for me here:
> > > 1. uffd_copy and uffd_get_folio seem to be at different abstraction
> > > levels. uffd_copy is almost the entire copy operation for VM_SHARED
> > > VMAs while uffd_get_folio is a small part of the continue operation.
> > > 2. shmem_mfill_atomic_pte which becomes uffd_copy for shmem in the
> > > last patch is quite a complex function which itself calls some IMO
> > > pretty internal functions like mfill_atomic_install_pte(). Expecting
> > > modules to implement such functionality seems like a stretch to me but
> > > maybe this is for some specialized modules which are written by mm
> > > experts only?
> >
> > Largely shmem_mfill_atomic_pte() differs from anonymous memory version
> > (mfill_atomic_pte_copy()) by the way the allocated folio is accounted and
> > whether it's added to the page cache. So instead of uffd_copy(...) we might
> > add
> >
> >       int (*folio_alloc)(struct vm_area_struct *vma, unsigned long dst_addr);
> >       void (*folio_release)(struct vm_area_struct *vma, struct folio *folio);
>
> Thanks for digging into this, Mike.  It's just that IMHO it may not be
> enough..
>
> I actually tried to think about a more complicated API, but more I thought
> of that, more it looked like an overkill.  I can list something here to
> show why I chose the simplest solution with uffd_copy() as of now.

TBH below does not sound like an overkill to me for keeping mm parts
to itself without over-exposing them to modules.

>
> Firstly, see shmem_inode_acct_blocks() at the entrance: that's shmem
> accounting we need to do before allocations, and with/without allocations.

Ok, this results in an additional folio_prealloc() hook.

> That accounting can't be put into folio_alloc() yet even if we'll have one,
> because we could have the folio allocated when reaching here (that is, when
> *foliop != NULL).  That was a very delicated decision of us to do shmem
> accounting separately in 2016:
>
> https://lore.kernel.org/all/20161216144821.5183-37-aarcange@redhat.com/
>
> Then, there's also the complexity on how the page cache should be managed
> for any mem type.  For shmem, folio was only injected right before the
> pgtable installations.  We can't inject it when folio_alloc() because then
> others can fault-in without data populated. It means we at least need one
> more API to do page cache injections for the folio just got allocated from
> folio_alloc().

folio_add_to_cache() hook?

>
> We also may want to have different treatment on how the folio flags are
> setup.  It may not always happen in folio_alloc().  E.g. for shmem right
> now we do this right before the page cache injections:
>
>         VM_BUG_ON(folio_test_locked(folio));
>         VM_BUG_ON(folio_test_swapbacked(folio));
>         __folio_set_locked(folio);
>         __folio_set_swapbacked(folio);
>         __folio_mark_uptodate(folio);
>
> We may not want to do exactly the same for all the rest mem types.  E.g. we
> likely don't want to set swapbacked for guest-memfd folios.  We may need
> one more API to do it.

Can we do that inside folio_add_to_cache() hook before doing the injection?

>
> Then if to think about failure path, when we have the question above on
> shmem acct issue, we may want to have yet another post_failure hook doing
> shmem_inode_unacct_blocks() properly for shmem.. maybe we don't need that
> for guest-memfd, but we still need that for shmem to properly unacct when
> folio allocation succeeded, but copy_from_user failed somehow.

Failure handling hook.

>
> Then the question is, do we really need all these fuss almost for nothing?

If that helps to keep modules simpler and mm details contained inside
the core kernel I think it is worth doing. I imagine if the shmem was
a module then implementing the current API would require exporting
functions like mfill_atomic_install_pte(). That seems like
over-exposure to me. And if we can avoid all the locking and
refcounting that Liam mentioned that would definitely be worth it
IMHO.

>
> Note that if we want, IIUC we can still change this in the future. The
> initial isolation like this series would still be good to land earlier; we
> don't even plan to support MISSING for guest-memfd in the near future, but
> only MINOR mode for now.  We don't necessarily need to work out MISSING
> immediately to move on useful features landing Linux.  Even if we'll have
> it for guest-memfd, it shouldn't be a challenge to maintain both.  This is
> just not a contract we need to sign forever yet.
>
> Hope above clarifies a bit on why I chose the simplest solution as of
> now. I also don't like this API, as I mentioned in the cover letter. It's
> really a trade-off I made at least for now the best I can come up with.
>
> Said that, if any of us has better solution, please shoot.  I'm always open
> to better alternatives.

I didn't explore this code as deep as you have done and therefore
might not see all the pitfalls but looks like you already considered
an alternative which does sound better to me. What are the drawbacks
that I might be missing?
Thanks,
Suren.

>
> Thanks,
>
> --
> Peter Xu
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ